SCDJWS Study Guide: XML Basic
The segment of an XML document between an opening tag (i.e., <tagname>) and a corresponding closing tag (i.e., </tagname>) is called an element. XML is designed to hold any kind of information in elements. An element may contain a mixture of sub-elements and PCDATA (text) between the opening and closing tags.
We would actually refer to the element by the element type, which is synonymous with the name used in the start/end tag pair. For example <state>Virginia</state>, we have a state element, the content of which is the state of Virginia.
Elements have relationships with other elements in a document. Some are parents and some are children. Using this semantic description, one can see that children elements need parent elements defined and used first. The fact that XML elements can contain other elements can give rise to an arbitrarily deep hierarchy of elements within elements.
The following terms are used to describe the hierarchical relationships.
- Nesting -- Refers to the process of elements containing other elements
- Child-- A child element is an element that is contained within another element.
- Parent--> A parent element is an element that contains another element.
- Sibling -- Sibling elements are elements with the same parent.
As mentioned in the XML Syntax section, an XML document must have one root element. The root element is the ultimate parent element and must be contain all the other elements and data, except the XML declaration, XML comments, and certain processing instructions.
The following XML document describes a movie:
<title>Harry Potter and the Goblet of Fire</title>
The movie is the root element. The title, release_by, directed_by, run_time, rating and actors are child elements of movie. The movie is the parent element of title, release_by, directed_by, run_time, rating and actors. Thetitle, release_by, directed_by, run_time, rating and actors are siblings (or sister elements) because they have the same parent.
Tags are a very important part of XML. They are what you use to mark the beginning and ending of elements in your XML documents. An XML element is everything from (including) the element's start tag to (including) the element's end tag.
An element can have element content, mixed content, simple content, empty content, or/and attributes.
In the example above, movie has element content, because it contains other elements. The actors element has mixed content because it contains both text and other elements. The released_by element has simple content (or text content) because it contains only text. The reviews has empty content, because it carries no information.
In the example above only the reviews element has attributes. The attribute named total has the value.
Parsed Character Data
XML documents are read and processed by a specific piece of software called an XML parser. When a document is processed by the XML parser, each character in the document is read, or parsed, in order to create a representation of the data.
Any text that gets read by the parser is Parsed Character Data, or PCDATA. This is important because you will see the term PCDATA pop up all over. Element content is considered either other elements or PCDATA. Attribute values are considered PCDATA.
By definition, PCDATA is parsed, which means that the parser looks at each of the characters and tries to determine their meaning. For example, if the parser encounters a < then it knows that the characters that follow represent an element instance. When the parser encounters a /, it knows that it has encountered an end tag.
Because PCDATA is parsed, it cannot contain <, >, and / characters, as these characters are used in XML syntax. For example,
<!--This is not well-formed XML!-->
<order>0 is < 1 & 1 > 0</order>
If we're going to be creating elements we're going to have to give them names, and XML is very generous in the names we're allowed to use. For example, there aren't any reserved words to avoid in XML, as there are in most programming languages, so we have a lot flexibility in this regard.
However, there are some rules that we must follow:
Names can contain letters, numbers, and other characters.
Names can start with letters (including non-Latin characters) or the "_" character, but not numbers or other punctuation characters.
After the first character, numbers are allowed, as are the characters "-" and ".".
Names can't contain spaces.
Names can't contain the ":" character. Strictly speaking, this character is allowed, but the XML specification says that it's "reserved". You should avoid using it in your documents, unless you are working with namespaces.
Names can't start with the letters "xml", in uppercase, lowercase, or mixed - you can't start a name with "xml", "XML", XmL", or any other combination.
There can't be a space after the opening "<" character; the name of the element must come immediately after it. However, there can be space before the closing ">"character, if desired.
Names are case sensitive
The good practice of the element names based these simple rules:
- Any name can be used, no words are reserved, but the idea is to make names descriptive. Names with an underscore separator are nice. Examples: <first_name>, <last_name>.
- Avoid "-" and "." in names. It could be a mess if your software tried to subtract name from first (first-name) or think that "name" is a property of the object "first" (first.name).
- Element names can be as long as you like, but don't exaggerate. Names should be short and simple, like this: <book_title> not like this: <the_title_of_the_book>.
- XML documents often have a corresponding database, in which fields exist corresponding to elements in the XML document. A good practice is to use the naming rules of your database for the elements in the XML documents.
- Non-English letters like >��� are perfectly legal in XML element names, but watch out for problems if your software vendor doesn't support them.
- The ":" should not be used in element names because it is reserved to be used for something called namespaces (more later).
If an element contains no subelements or character data, that element is said to be "empty." In most cases, an empty element will contain an attribute-value pair inside of a single tag that is "terminated" by a forward slash before its closing bracket. The slash before the ending bracket serves the same function as an end tag's forward slash. The special empty element syntax is <tagname/>.
An element containing nothing more than an attribute is still considered "empty" and "without content" because attribute values count as markup not character data.For example:
Technically, an empty element can also be expressed using element start and end tags. For the above example, the <item id="1234"category="food"></item> is also a correct syntax.
Recall from our discussion of element names that the only place we can have a space within the tag is before the closing ">". This rule is slightly different when it comes to empty elements. The "/" and ">" characters always have to be together, so you can create an empty element like this:
but not like these:
<item / >
Empty elements really don't buy you anything - except that they take less typing - so you can use them, or not, at your discretion. Keep in mind, however, that as far as XML is concerned <item></item> is exactly the same as <item/>; for this reason, XML parsers will sometimes change your XML from one form to the other. You should never count on your empty elements being in one form or the other, but since they're syntactically exactly the same, it doesn't matter.
Interestingly, nobody in the XML community seems to mind the empty element syntax, even though it doesn't add anything to the language. This is especially interesting considering the passionate debates that have taken place on whether attributes are really necessary.
One place where empty elements are very often used is for elements that have no (or optional) PCDATA, but instead have all of their information stored in attributes. So if we rewrote our <item> example without child elements, instead of a start-tag and end-tag we would probably use an empty element, like this:
<item id="1234" category="food"/>
Another common example is the case where just the element name is enough; for example, the HTML <BR> tag might be converted to an XML empty element, such as the XHTML <br/> tag. (XHTML is the latest "XML-compliant" version of HTML.)
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
When do we use empty elements?
- Element has no data other than attributes
- Used as placeholders for attributes
- Mark point phenomena