SCDJWS Study Guide: XML Basic

Printer-friendly version Printer-friendly version | Send this 
article to a friend Mail this to a friend

Previous Next vertical dots separating previous/next from contents/index/pdf Contents

XML Syntax

XML documents must be well formed in order to be considered XML. In order to be well formed documents, XML documents must adhere to the following strict syntax rules:

XML documents use a self-describing and simple syntax

<?xml version="1.0" 
<body>Don't forget me this weekend!</body>

The first line in the document - the XML declaration - defines the XML version of the document. In this case the document conforms to the 1.0 specification of XML

Note: that XML is case-sensitive, and the XML declaration must be in lowercase.

The next line describes the root element of the document (like it was saying: "this document is a note"):


The next 4 lines describe 4 child elements of the root (to, from, heading, and body):

<body>Don't forget me this weekend!</body>

And finally the last line defines the end of the root element:


XML building blocks

Elements are the basic building blocks of XML markup. Tags consist of element type names. The content between the start tag and the end tag of an element is value of that element. If there is no content between the start tag and the end tag, the element is an empty element. Elements may have zero to n attributes, even empty elements can have associated attributes.

All XML elements MUST have a closing tag

In HTML some elements do not have to have a closing tag. However, all XML elements must have a closing tag. The following code is legal in HTML:

<p>This is a paragraph
<p>This is another paragraph

In XML all elements must have a closing tag like this:

<p>This is a paragraph</p>
<p>This is another paragraph</p>

Note: You might have noticed from the previous example that the XML declaration did not have a closing tag. This is not an error. The declaration is not a part of the XML document itself. It is not an XML element, and it should not have a closing tag.

All XML documents must have a root tag

The first tag in an XML document is the root tag. All XML documents must contain a single and unique tag pair to define the root element. All other elements must be nested within the root element. All elements can have sub (child) elements. Sub elements must be in pairs and correctly nested within their parent element.

All elements can have sub elements (children). Sub elements must be correctly nested within their parent element:


XML tags are case sensitive

Unlike HTML, XML tags are case sensitive. With XML, the tag <Message> is different from the tag <message>. According to the above syntax rule, opening and closing tags must therefore be written with the same case:

<Message>This is incorrect</message>
<message>This is correct</message>

All XML elements MUST be properly nested

Improper nesting of tags makes no sense to XML. Overlapping elements are not allowed. An element must have a closing tag before the next element's starting tag.

In HTML some elements can be improperly nested within each other like this:

<b><i>This text is bold and italic</b></i>

Because XML is strictly hierarchical, you have to be careful to close your child elements before you close your parents. (This is called properly nesting your tags.).In XML all elements must be properly nested within each other like this:

<b><i>This text is bold and italic</i></b>

Nesting tags can be used to express various structures. We can represent a list by using the same tag repeatedly.

Attribute values MUST always be quoted and are case sensitive. It is illegal to omit quotation marks around attribute values. Each attribute of an element can be specified ONLY once, but in any order.

XML elements can have one or more attributes in name/value pairs in the element start tag. In XML the attribute value must always be quoted. Study the two XML documents below. The first one is incorrect, the second is correct:

<?xml version="1.0"?>
<note date=12/11/99>
<body>Don't forget me this weekend!</body>
<?xml version="1.0"?>
<note date="12/11/99">
<body>Don't forget me this weekend!</body>

Attributes are used to attach additional, secondary information to an element. Attributes can also accept default values, while elements cannot.

Legal XML names

The first character of a legal XML name must be either a unicode character, an underscore or a colon. The following characters may be one of these - unicode character, unicode number, underscore, colon, hyphen or a period. The colon char should not be used except as a namespace delimiter.

White space in XML is preserved

With XML, the white space in your document is NOT truncated. This is unlike HTML. With HTML, a sentence like this:

Hello       my name is Tove,

will be displayed like this:

Hello my name is Tove,

because HTML reduces multiple, consecutive white space characters to a single white space.

With XML, White Space is preserved. There is a special category of characters, called white space. This includes things like the space character, new lines (what you get when you hit the Enter key), and tabs. White space is used to separate words, as well as to make text more readable.

White space stripping is very advantageous for a language like HTML, which has become primarily a means for displaying information. It allows the source for an HTML document to be formatted in a readable way for the person writing the HTML, while displaying it formatted in a readable, and possibly quite different, way for the user.

It is important to understand how XML deals with white space. Adding white space to the markup doesn't affect the document's content. Typically, you will add white space to indent child elements. These spaces could be there just to make the document easier to read, while not actually being part of its data. This "readability" white space is called extraneous white space.

But adding white space to the data between tags does affect the document's content, none of the white spaces will be striped out. White space is significant if it is located within the data of a XML document.

The XML parser should normalize White space use within attributes, although this is not always true. XML authors should keep white space issues in mind when developing.

With XML, a new line is always stored as LF

Do you know what a typewriter is? Well, a typewriter is a mechanical device which was used last century to produce printed documents. After you have typed one line of text on a typewriter, you have to manually return the printing carriage to the left margin position and manually feed the paper up one line.

With XML, a new line is always stored as LF. However, there is one form of white space stripping that XML performs on PCDATA, which is the handling of new line characters. The problem is that there are two characters that are used for new lines - the line feed (LF) character and the carriage return (CR) - and computers running Windows, computers running Unix, and Macintosh computers all use these characters differently.

In Windows applications, a new line in the text is normally stored as a pair of CR LF (carriage return, line feed) characters. In Unix applications, a new line is normally stored as a LF character. Some applications use only a CR character to store a new line.

For this reason, it was decided that XML parsers would change all new lines to a single line feed character before processing. This means that any XML application will know, no matter which operating system it's running under, that a new line will be represented by a single line feed character. This makes data exchange between multiple computers running different operating systems that much easier, since programmers don't have to deal with the (sometimes annoying) end-of-line logic.

The use of the ampersand �&� symbol is RESERVED. XML uses this to define an entity reference

The ampersand symbol cannot be used by itself. There are a set of standard entity references that every DTD file should contain. There are mostly symbols that you would want to place inside the XML file. You define them by using their decimal value on the ASCII chart. Here is a good list:

Less-Than [ < ]"&#60;"
Greater-Than [ > ] "&#62;"
Ampersand [ & ] "&#38;"
Apostrophe [ ' ] "&#39;"
Quote [ " ] "&#34;"
Non Breaking Space (a forced space) "&#32;"
Emdash [ -- ] "&#045;&#045;"

Previous Next vertical dots separating previous/next from contents/index/pdf Contents

  |   |