SCDJWS Study Guide: XML Basic
An XML document is usually starts with a declaration. It is often very handy to be able to identify a document as being a certain type. XML provides the XML declaration for us to identify the documents as being XML, along with giving the parsers a few other pieces of information. All XML documents should begin with an XML declaration (You don't need to have an XML declaration, but you should include it anyway). We will learn the XML declaration in the following example:
A typical XML declaration looks like this:
<?xml version='1.0' encoding='UTF-16' standalone='yes'?>
<name nickname='Shiny John'>
<!--John lost his middle name in a fire-->
This first line of the above example is the declaration:
<?xml version='1.0' encoding='UTF-16' standalone='yes'?>
The declaration has the following format:
<?xml version="version_number" encoding="encoding_declaration standalone="standalone_status"?>
The XML declaration is not required in order for the XML document to be considered well formed; however, there are very few instances when you should not use the XML declaration. Unless you have a specific reason not to (such as working with a document fragment), you should always use it.
The XML declaration starts with the characters <?xml, and ends with the characters ?>.Spaces are not allowed between the question marks and the angle brackets in the processing instruction delimiters.
The XML declaration has no closing tag, that is </?xml>.
The XML declaration must be in lower case (except for the encoding declarations).
If you include it, it must be situated at the first character of the first line in the XML document. That is, the first character in the file should be that <; no line breaks or spaces. Some parsers are more forgiving about this than others.
If you include it, you must include the version number attribute, but the encoding and standalone attributes are optional.
The version, encoding, and standalone attributes must be presented in that order shown above.
Currently, the version should be 1.0. If you use a number other than 1.0, XML parsers that were written for the version 1.0 specification should reject the document. (As of yet, there have been no plans announced for any other version of the XML specification. If there ever is one, the version number in the XML declaration will be used to signal which version of the specification your document claims to support.)
The encoding attribute is not required, but if it is not specified, the parser will assume the default value of "UTF-8," which is the standard 8-bit Unicode encoding.
standalone="yes" if a DTD is part of the document; standalone="no" if there is an external DTD, or no DTD.
The following table shows a list of the possible attributes that may be used in the XML declaration:
|Attribute Name:||Possible Attribute Value:||Attribute Description:|
|version||1.0||Specifies the version of the XML standard that the XML document conforms to. The version attribute must be included if the XML declaration is declared.|
|encoding||UTF-8, UTF-16, ISO-10646-UCS-2, ISO-10646-UCS-4, ISO-8859-1 to ISO-8859-9, ISO-2022-JP, Shift_JIS, EUC-JP||These are the encoding names of the most common character sets in use today.|
|standalone||yes, no||Use 'yes' if the XML document has an internal DTD. Use 'no' if the XML document is linked to an external DTD, or any external entity references.|
It should come as no surprise to us that text is stored in computers using numbers, since numbers are all that computers really understand.
A character code is a one-to-one mapping between a set of characters and the corresponding numbers to represent those characters.
A character encoding is the method used to represent the numbers in a character code digitally, (in other words how many bytes should be used for each number, etc.)
One character code/encoding that you might have come across is the American Standard Code for Information Interchange (ASCII). For example, in ASCII the character "a" is represented by the number 97, and the character "A" is represented by the number 65.
There are seven-bit and eight-bit ASCII encoding schemes. 8-bit ASCII uses one byte (8 bits) for each character, which can only store 256 different values, so that limits ASCII to 256 characters. That's enough to easily handle all of the characters needed for English, which is why ASCII was the predominant character encoding used on personal computers in the English-speaking world for many years. But there are way more than 256 characters in all of the world's languages, so obviously ASCII can only handle a small subset of these. This is reason that Unicode was invented.
Unicode is a character code designed from the ground up with internationalization in mind, aiming to have enough possible characters to cover all of the characters in any human language. There are two major character encodings for Unicode: UTF-16 and UTF-8. UTF-16 takes the easy way, and simply uses two bytes for every character (two bytes = 16 bits = 65,356 possible values).
UTF-8 is more clever: it uses one byte for the characters covered by 7-bit ASCII, and then uses some tricks so that any other characters may be represented by two or more bytes. This means that ASCII text can actually be considered a subset of UTF-8, and processed as such. For text written in English, where most of the characters would fit into the ASCII character encoding, UTF-8 can result in smaller file sizes, but for text in other languages, UTF-16 should usually be smaller.
Because of the work done with Unicode to make it international, the XML specification states that all XML processors must use Unicode internally. Unfortunately, very few of the documents in the world are encoded in Unicode. Most are encoded in ISO-8859-1, or windows-1252, or EBCDIC, or one of a large number of other character encodings. (Many of these encodings, such as ISO-8859-1 and windows-1252, are actually variants of ASCII. They are not, however, subsets of UTF-8 in the same way that "pure" ASCII is.)
Specifying Character Encoding for XML
This is where the encoding attribute in our XML declaration comes in. It allows us to specify, to the XML parser, what character encoding our text is in. The XML parser can then read the document in the proper encoding, and translate it into Unicode internally. If no encoding is specified, UTF-8 or UTF-16 is assumed (parsers must support at least UTF-8 and UTF-16). If no encoding is specified, and the document is not UTF-8 or UTF-16, it results in an error.
Sometimes an XML processor is allowed to ignore the encoding specified in the XML declaration. If the document is being sent via a network protocol such as HTTP, there may be protocol-specific headers which specify a different encoding than the one specified in the document. In such a case, the HTTP header would take precedence over the encoding specified in the XML declaration. However, if there are no external sources for the encoding, and the encoding specified is different from the actual encoding of the document, it results in an error.
If you're creating XML documents in Notepad on a machine running a Microsoft Windows operating system, the character encoding you are using by default is windows-1252. So the XML declarations in your documents should look like this:
<?xml version="1.0" encoding="windows-1252"?>
However, not all XML parsers understand the windows-1252 character set. If that's the case, try substituting ISO-8859-1, which happens to be very similar. Or, if your document doesn't contain any special characters (like accented characters, for example), you could use ASCII instead, or leave the encoding attribute out, and let the XML parser treat the document as UTF-8.
If you're running Windows NT or Windows 2000, Notepad also gives you the option of saving your text files in Unicode, in which case you can leave out the encoding attribute in your XML declarations.
If the standalone attribute is included in the XML declaration, it must be either yes or no.
yes specifies that this document exists entirely on its own, without depending on any other files
no indicates that the document may depend on other files
This little attribute actually has its own name: the Standalone Document Declaration, or SDD. The XML specification doesn't actually require a parser to do anything with the SDD. It is considered more of a hint to the parser than anything else.