epubjs
Version:
Render ePub documents in the browser, across many devices
140 lines (136 loc) • 20.5 kB
HTML
<html xmlns="http://www.w3.org/1999/xhtml"><head><title>XML Basics</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body><div class="sect1" title="XML Basics"><div class="titlepage"><div><div><h1 class="title"><a id="learnjava3-CHP-24-SECT-2"/>XML Basics</h1></div></div></div><p>The basic syntax of XML is extremely simple. If you’ve worked with
HTML, you’re already halfway there. As with HTML, XML represents
information as text using <a id="I_indexterm24_id828088" class="indexterm"/><span class="emphasis"><em>tags</em></span> to add structure. A tag begins
with a name sandwiched between less than (<) and greater than (>)
characters. Unlike HTML, XML tags must always be
<span class="emphasis"><em>balanced</em></span>; in other words, an opening tag must always
be followed by a closing tag. A closing tag looks just like the opening
tag but starts with a less than sign and a slash (</). An opening tag,
closing tag, and any content in between are collectively referred to as an
<span class="emphasis"><em>element</em></span> of the XML document. Elements can contain
other elements, but they must be properly nested (all tags started within
an element must be closed before the element itself is closed). Elements
can also contain plain text or a mixture of elements and text (called
mixed content). Comments are enclosed between <code class="literal"><!—</code> and <code class="literal">—></code> markers. Here are a few examples:</p><a id="I_24_tt1293"/><pre class="programlisting"><code class="o"><!--</code> <code class="n">Simple</code> <code class="o">--></code>
<code class="o"><</code><code class="n">Sentence</code><code class="o">></code><code class="n">This</code> <code class="n">is</code> <code class="n">text</code><code class="o">.</</code><code class="n">Sentence</code><code class="o">></code>
<code class="o"><!--</code> <code class="n">Element</code> <code class="o">--></code>
<code class="o"><</code><code class="n">Paragraph</code><code class="o">><</code><code class="n">Sentence</code><code class="o">></code><code class="n">This</code> <code class="n">is</code> <code class="n">text</code><code class="o">.</</code><code class="n">Sentence</code><code class="o">></</code><code class="n">Paragraph</code><code class="o">></code>
<code class="o"><!--</code> <code class="n">Mixed</code> <code class="o">--></code>
<code class="o"><</code><code class="n">Paragraph</code><code class="o">></code>
<code class="o"><</code><code class="n">Sentence</code><code class="o">></code><code class="n">This</code> <code class="o"><</code><code class="n">verb</code><code class="o">></code><code class="n">is</code><code class="o"></</code><code class="n">verb</code><code class="o">></code> <code class="n">text</code><code class="o">.</</code><code class="n">Sentence</code><code class="o">></code>
<code class="o"></</code><code class="n">Paragraph</code><code class="o">></code>
<code class="o"><!--</code> <code class="n">Empty</code> <code class="o">--></code>
<code class="o"><</code><code class="n">PageBreak</code><code class="o">></</code><code class="n">PageBreak</code><code class="o">></code></pre><p>An empty tag can be written more compactly in a special form using a
single tag ending with a slash and a greater-than sign (/>):</p><a id="I_24_tt1294"/><pre class="programlisting"><code class="o"><</code><code class="n">PageBreak</code><code class="o">/></code></pre><div class="sect2" title="Attributes"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-24-SECT-2.1"/>Attributes</h2></div></div></div><p><a id="idx11188" class="indexterm"/> <a id="idx11209" class="indexterm"/>An XML element can contain
<span class="emphasis"><em>attributes</em></span>, which are simple name-value pairs
supplied inside the start tag.</p><a id="I_24_tt1295"/><pre class="programlisting"><code class="o"><</code><code class="n">Document</code> <code class="n">type</code><code class="o">=</code><code class="s">"LEGAL"</code><code class="n">id</code><code class="o">=</code><code class="s">"42"</code><code class="o">>...</</code><code class="n">Document</code><code class="o">></code>
<code class="o"><</code><code class="n">Image</code> <code class="n">name</code><code class="o">=</code><code class="s">"truffle.jpg"</code><code class="o">/></code></pre><p><a id="I_indexterm24_id828191" class="indexterm"/> <a id="I_indexterm24_id828197" class="indexterm"/> <a id="I_indexterm24_id828204" class="indexterm"/> <a id="I_indexterm24_id828210" class="indexterm"/>The attribute value must always be enclosed in quotes. You
can use double (<code class="literal">"</code>) or single
(<code class="literal">'</code>) quotes. Single quotes are useful
if the value contains double quotes.</p><p>Attributes are intended to be used for simple, unstructured
properties or compact identifiers associated with the element data. It
is always possible to make an attribute into a child element, so,
strictly speaking, there is no real need for attributes. But they often
make the XML easier to read and more logical. In the case of the
<code class="literal">Document</code> element in our preceding
snippet, the attributes <code class="literal">type</code> and
<code class="literal">ID</code> represent metadata about the
document. We might expect that a Java class representing the <code class="literal">Document</code> would have an enumeration of document
types such as <code class="literal">LEGAL</code>. In the case of
the <code class="literal">Image</code> element, the attribute is
simply a more compact way of including the filename. As a rule,
attributes should be compact, with little significant internal structure
(URLs push the envelope); by contrast, child elements can have arbitrary
complexity.</p><p>The <code class="literal">id</code> attribute in the
previous example may have special significance when used with a
corresponding <code class="literal">idref</code> attribute.
Together, these standard attributes are used with document validation to
enforce referential integrity in documents. When validated, an <code class="literal">id</code> attribute value must be unique within the
document and an <code class="literal">idref</code> attribute value
must refer to a valid <code class="literal">id</code> within the
document.<a id="I_indexterm24_id828303" class="indexterm"/><a id="I_indexterm24_id828310" class="indexterm"/></p></div><div class="sect2" title="XML Documents"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-24-SECT-2.2"/>XML Documents</h2></div></div></div><p><a id="idx11189" class="indexterm"/> <a id="I_indexterm24_id828334" class="indexterm"/> <a id="I_indexterm24_id828340" class="indexterm"/>An XML document begins with a header like the following
and one <span class="emphasis"><em>root element</em></span>:</p><a id="I_24_tt1296"/><pre class="programlisting"><code class="o"><?</code><code class="n">xml</code> <code class="n">version</code><code class="o">=</code><code class="s">"1.0"</code> <code class="n">encoding</code><code class="o">=</code><code class="s">"UTF-8"</code><code class="o">?></code>
<code class="o"><</code><code class="n">MyDocument</code><code class="o">></code>
<code class="o"></</code><code class="n">MyDocument</code><code class="o">></code></pre><p>The header identifies the version of XML and the character
encoding used. The root element is simply the top of the element
hierarchy, which can be considered a tree. If you omit this header or
have XML text without a single root element (as in our earlier simple
examples), technically what you have is called an XML
<span class="emphasis"><em>fragment</em></span>.</p></div><div class="sect2" title="Encoding"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-24-SECT-2.3"/>Encoding</h2></div></div></div><p><a id="idx11192" class="indexterm"/> <a id="idx11212" class="indexterm"/>The default encoding for an XML document is UTF-8, the
ASCII-friendly 8-bit Unicode encoding. This encoding preserves ASCII
values, so English text is unaltered by it. It also allows Unicode
values to be stored in a reasonably efficient way. An XML document may
specify another encoding using the encoding attribute of the XML
header.</p><p>Within an XML document, certain characters are necessarily
sacrosanct: for example, the <code class="literal"><</code> and
<a id="I_indexterm24_id828414" class="indexterm"/><a id="I_indexterm24_id828419" class="indexterm"/><code class="literal">></code> characters that
indicate element tags. When you need to include these in your text, you
must encode them. XML provides an escape mechanism called “entities”
that allows for encoding special structures. XML has five predefined
entities, as shown in <a class="xref" href="ch24s03.html#learnjava3-CHP-24-TABLE-1" title="Table 24-1. XML entities">Table 24-1</a>.<a id="I_indexterm24_id828439" class="indexterm"/></p><div class="table"><a id="learnjava3-CHP-24-TABLE-1"/><p class="title">Table 24-1. XML entities</p><div class="table-contents"><table summary="XML entities" style="border-collapse: collapse;border-top: 0.5pt solid ; border-bottom: 0.5pt solid ; "><colgroup><col/><col/></colgroup><thead><tr><th style="text-align: left"><p>Entity</p></th><th style="text-align: left"><p>Encodes</p></th></tr></thead><tbody><tr><td style="text-align: left"><p> <a id="I_indexterm24_id828490" class="indexterm"/> <code class="literal">&amp;</code>
</p></td><td style="text-align: left"><p>& (ampersand)</p></td></tr><tr><td style="text-align: left"><p> <code class="literal">&lt;</code> </p></td><td style="text-align: left"><p>< (less than)</p></td></tr><tr><td style="text-align: left"><p> <a id="I_indexterm24_id828531" class="indexterm"/> <a id="I_indexterm24_id828538" class="indexterm"/> <code class="literal">&gt;</code>
</p></td><td style="text-align: left"><p>> (greater than)</p></td></tr><tr><td style="text-align: left"><p> <code class="literal">&quot;</code> </p></td><td style="text-align: left"><p> <code class="literal">"</code>
(quotation mark)</p></td></tr><tr><td style="text-align: left"><p> <code class="literal">&apos;</code> </p></td><td style="text-align: left"><p> <code class="literal">'</code>
(apostrophe)</p></td></tr></tbody></table></div></div><p>An alternative to encoding text in this way is to use a special
“unparsed” section of text called a character data (CDATA) section. A
CDATA section starts with the cryptic string <a id="I_indexterm24_id828606" class="indexterm"/><code class="literal"><![CDATA[</code> and ends
with <code class="literal">]]></code>, like this:</p><a id="I_24_tt1297"/><pre class="programlisting"><code class="o"><![</code><code class="n">CDATA</code><code class="o">[</code> <code class="n">Learning</code> <code class="n">Java</code><code class="o">,</code> <code class="n">O</code><code class="err">'</code><code class="n">Reilly</code> <code class="o">&</code> <code class="n">Associates</code> <code class="o">]]></code></pre><p>The CDATA section looks a little like a comment, but the data is
still part of the document, just opaque to the parser.</p><p>There is one more alternative, which is to use a special
<a id="I_indexterm24_id828640" class="indexterm"/><code class="literal"><include></code>
directive to include the contents of a URL or file either as pre-escaped
text or optionally parsed as XML. XML includes are very convenient, and
we’ll talk about them later in this chapter.<a id="I_indexterm24_id828656" class="indexterm"/><a id="I_indexterm24_id828664" class="indexterm"/></p></div><div class="sect2" title="Namespaces"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-24-SECT-2.4"/>Namespaces</h2></div></div></div><p><a id="idx11198" class="indexterm"/> <a id="idx11216" class="indexterm"/>You’ve probably seen that HTML has a <code class="literal"><body></code> tag that is used to structure web
pages. Suppose for a moment that we are writing XML for a funeral home
that also uses the tag <code class="literal"><body></code>
for some other, more macabre, purpose. This could be a problem if we
want to mix HTML with our mortuary information.</p><p>If you consider HTML and the funeral home tags to be languages in
this case, the elements (tag names) used in a document are really the
vocabulary of those languages. An XML <span class="emphasis"><em>namespace</em></span> is
a way of saying whose dictionary you are using for a given element,
allowing us to mix them freely. (Later, we’ll talk about XML Schemas,
which enforce the grammar and syntax of the language.)</p><p>A namespace is specified with the <a id="I_indexterm24_id828728" class="indexterm"/><code class="literal">xmlns</code> attribute, whose
value is a <a id="I_indexterm24_id828739" class="indexterm"/><a id="I_indexterm24_id828744" class="indexterm"/>Uniform Resource Identifier (URI) that uniquely defines
the set (and usually the meaning) of tags from that namespace:</p><a id="I_24_tt1298"/><pre class="programlisting"><code class="o"><</code><code class="n">element</code> <code class="n">xmlns</code><code class="o">=</code><code class="s">"namespaceURI"</code><code class="o">></code></pre><p>Recall from <a class="xref" href="ch14.html" title="Chapter 14. Programming for the Web">Chapter 14</a> that a URI is not
necessarily a URL. URIs are more general than URLs. In practical terms,
a URI is to be treated as a unique string. Often, the URI is in fact
also a URL for a document describing the namespace, but when true it is
only by convention.</p><p>An <code class="literal">xmlns</code> namespace attribute
can be applied to an element and affects all its (nested) children; this
is called a default namespace for the element:</p><a id="I_24_tt1299"/><pre class="programlisting"><code class="o"><</code><code class="n">body</code> <code class="n">xmlns</code><code class="o">=</code><code class="s">"http://funeral-procedures.org/"</code><code class="o">></code></pre><p>Often it is desirable to mix and match namespaces on a tag-by-tag
basis. To do this, we can use the special <code class="literal">xmlns</code> attribute to define a special identifier
for the namespace and use that identifier as a prefix on the tags in
question. For example:</p><a id="I_24_tt1300"/><pre class="programlisting"><code class="o"><</code><code class="n">funeral</code> <code class="nl">xmlns:</code><code class="n">fun</code><code class="o">=</code><code class="s">"http://funeral-procedures.org/"</code><code class="o">></code>
<code class="o"><</code><code class="n">html</code><code class="o">><</code><code class="n">head</code><code class="o">></</code><code class="n">head</code><code class="o">><</code><code class="n">body</code><code class="o">></code>
<code class="o"><</code><code class="nl">fun:</code><code class="n">body</code><code class="o">></code><code class="n">Corpse</code> <code class="err">#</code><code class="mi">42</code><code class="o"></</code><code class="nl">fun:</code><code class="n">body</code><code class="o">></code>
<code class="o"></</code><code class="n">funeral</code><code class="o">></code></pre><p>In the preceding snippet of XML, we’ve qualified the body tag with
the prefix “fun:”, which we defined in the <code class="literal"><funeral></code> tag. In this case, we should
qualify the root tag as well, reflexively:</p><a id="I_24_tt1301"/><pre class="programlisting"><code class="o"><</code><code class="nl">fun:</code><code class="n">funeral</code> <code class="nl">xmlns:</code><code class="n">fun</code><code class="o">=</code><code class="s">"http://funeral-procedures.org/"</code><code class="o">></code></pre><p>The XML parser factories supplied with Java have a switch to
specify whether you want the parser to interpret namespaces. This switch
defaults to off for historical reasons.</p><a id="I_24_tt1302"/><pre class="programlisting"><code class="n">parserFactory</code><code class="o">.</code><code class="na">setNamespaceAware</code><code class="o">(</code> <code class="kc">true</code> <code class="o">);</code></pre><p>We’ll talk more about parsing in the sections on SAX and DOM later
in this chapter.<a id="I_indexterm24_id828844" class="indexterm"/><a id="I_indexterm24_id828851" class="indexterm"/></p></div><div class="sect2" title="Validation"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-24-SECT-2.5"/>Validation</h2></div></div></div><p><a id="idx11206" class="indexterm"/> <a id="I_indexterm24_id828877" class="indexterm"/>A document that conforms to the basic rules of XML with
proper encoding and balanced tags is called a <a id="I_indexterm24_id828887" class="indexterm"/><span class="emphasis"><em>well-formed</em></span> document. Just because a
document is syntactically correct, however, doesn’t mean that it makes
sense. Two related sets of tools, DTDs and XML Schemas, define ways to
provide a grammar for your XML elements. They allow you to create
syntactic rules, such as “a <code class="literal">City</code>
element can appear only once inside an <code class="literal">Address</code> element and comes before a <code class="literal">State</code> element.” XML Schema goes further to
provide a flexible language for describing the validity of data content
of the tags, including both simple and compound data types made of
numbers and strings.</p><p>A document that is checked against a DTD or XML Schema description
and follows the rules is called a <a id="I_indexterm24_id828924" class="indexterm"/><span class="emphasis"><em>valid</em></span> document. A document can be
well formed without being valid, but not vice versa.<a id="I_indexterm24_id828933" class="indexterm"/></p></div><div class="sect2" title="HTML to XHTML"><div class="titlepage"><div><div><h2 class="title"><a id="learnjava3-CHP-24-SECT-2.6"/>HTML to XHTML</h2></div></div></div><p><a id="I_indexterm24_id828947" class="indexterm"/> <a id="idx11207" class="indexterm"/>To speak very loosely, we could say that the most popular
and widely used form of XML in the world today is HTML. The terminology
is loose because HTML is not really well-formed XML. HTML tags violate
XML’s rule forbidding unbalanced elements; the common <a id="I_indexterm24_id828969" class="indexterm"/><code class="literal"><p></code> tag is
typically used without a closing tag, for example. HTML attributes also
don’t require quotes. XML tags are also case-sensitive; <code class="literal"><P></code> and <code class="literal"><p></code> are two different tags in XML. We
could generously say that HTML is “forgiving” with respect to details
like this, but as a developer, you know that sloppy syntax results in
ambiguity. XHTML is an alternate, strict XML version of HTML that is
clear and unambiguous. This form of HTML works in modern browsers.
Fortunately, if you want to switch, you don’t have to manually clean up
all your HTML documents; <a class="ulink" href="http://tidy.sourceforge.net">Tidy</a> is an open source program
that automatically converts HTML to XHTML, validates it, and corrects
common mistakes.<a id="I_indexterm24_id829009" class="indexterm"/></p></div></div></body></html>