epubjs
Version:
Render ePub documents in the browser, across many devices
38 lines (37 loc) • 4.48 kB
HTML
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>Text Encoding</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body><div class="sect1" title="Text Encoding"><div class="titlepage"><div><div><h1 class="title"><a id="learnjava3-CHP-4-SECT-1"/>Text Encoding</h1></div></div></div><p><a id="idx10144" class="indexterm"/> <a id="I_indexterm4_id647012" class="indexterm"/> <a id="idx10207" class="indexterm"/> <a id="idx10211" class="indexterm"/> <a id="I_indexterm4_id647048" class="indexterm"/> Java is a language for the Internet. Since the citizens of
the Net speak and write in many different human languages, Java must be
able to handle a large number of languages as well. One of the ways in
which Java supports internationalization is through the Unicode character
set. Unicode is a worldwide standard that supports the scripts of most
languages.<sup>[<a id="learnjava3-CHP-4-FNOTE-1" href="#ftn.learnjava3-CHP-4-FNOTE-1" class="footnote">6</a>]</sup> The latest version of Java bases its character and string
data on the Unicode 6.0 standard, which uses at least two bytes to
represent each symbol internally.</p><p>Java source code can be written using Unicode and stored in any
number of character encodings, ranging from a full binary form to
ASCII-encoded Unicode character values. This makes Java a friendly
language for non-English-speaking programmers who can use their native
language for class, method, and variable names just as they can for the
text displayed by the application.</p><p>The Java <a id="I_indexterm4_id647092" class="indexterm"/><code class="literal">char</code> type and <a id="I_indexterm4_id647105" class="indexterm"/><a id="I_indexterm4_id647110" class="indexterm"/><a id="I_indexterm4_id647118" class="indexterm"/><a id="I_indexterm4_id647124" class="indexterm"/><code class="literal">String</code> class natively
support Unicode values. Internally, the text is stored as multibyte
characters using the UTF-16 encoding; however, the Java language and APIs
make this transparent to you and you will not generally have to think
about it. Unicode is also very ASCII-friendly (ASCII is the most common
character encoding for English). The first 256 characters are defined to
be identical to the first 256 characters in the ISO 8859-1 (Latin-1)
character set, so Unicode is effectively backward-compatible with the most
common English character sets. Furthermore, one of the most common file
encodings for Unicode, called UTF-8, preserves ASCII values in their
single byte form. This encoding is used by default in compiled Java class
files, so storage remains compact for English text.</p><p>Most platforms can’t display all currently defined Unicode
characters. As a result, Java programs can be written with special Unicode
escape sequences. A Unicode character can be represented with this escape
sequence:</p><a id="I_4_tt113"/><pre class="programlisting"> <code class="err">\</code><code class="n">u</code><em class="replaceable"><code><code class="n">xxxx</code></code></em></pre><p><em class="replaceable"><code>xxxx</code></em> is a sequence of one to four
hexadecimal digits. The escape sequence indicates an ASCII-encoded Unicode
character. This is also the form Java uses to output (print) Unicode
characters in an environment that doesn’t otherwise support them. Java
also comes with classes to read and write Unicode character streams in
specific encodings, including UTF-8.<a id="I_indexterm4_id647169" class="indexterm"/><a id="I_indexterm4_id647176" class="indexterm"/><a id="I_indexterm4_id647183" class="indexterm"/></p><div class="footnotes"><br/><hr/><div class="footnote"><p><sup>[<a id="ftn.learnjava3-CHP-4-FNOTE-1" href="#learnjava3-CHP-4-FNOTE-1" class="para">6</a>] </sup>For more information about Unicode, see <a class="ulink" href="http://www.unicode.org">http://www.unicode.org</a>. Ironically, one of the scripts
listed as “obsolete and archaic” and not currently supported by the
Unicode standard is Javanese—a historical language of the people of
the Island of Java.</p></div></div></div></body></html>