SCDJWS Study Guide: XML Basic

Printer-friendly version Printer-friendly version | Send this 
article to a friend Mail this to a friend

Previous Next vertical dots separating previous/next from contents/index/pdf Contents

XML Introduction

XML is a W3C recommended markup language for general purpose. XML was originally envisioned as a language for defining new document formats for the World Wide Web. XML was designed to facilitate the sharing of data across different systems, particularly systems connected via the Internet. XML is a subset of the Standard Generalized Markup Language (SGML) and thus can be considered to be a meta-language: a language for defining markup languages.


SGML is the Standard Generalized Markup Language (ISO 8879:1986), the international standard for defining descriptions of the structure of different types of electronic document.

SGML is very large, powerful, and complex. It has been in heavy industrial and commercial use for nearly two decades, and there is a significant body of expertise and software to go with it.

XML is a lightweight cut-down version of SGML which keeps enough of its functionality to make it useful but removes all the optional features which made SGML too complex to program for in a Web environment.

XML is a derivative of SGML, it is more restrictive than SGML:

  • XML provides a dramatic improvement in the ease of writing programs that can parse documents written in XML-derived markup languages.
  • XML greatly simplifies the task of creating custom markup languages that are meaningful to one's own enterprise.
  • XML-derived markup languages are slightly less expressive than SGML-based languages.
  • XML-derived markup languages are somewhat wordier than SGML-based languages.
  • XML-derived markup languages are less forgiving of syntactical variances than SGML-based languages.

Markup Languages

A markup language is merely a set of conventions for denoting which parts of a document should be treated differently from other parts.

Historically, that goal has been achieved for written documents by using different styles for different parts of the document. For example, the title can be printed or displayed centered on a page, in bold face type, while the body of a document can be presented or displayed as a continuous stream of text, separated from the title by one or more blank "lines".

This technique works well when there's a human around to look at the document, and when that human understands that a string of text in bold letters at the top of a page should be understood to be the document title.

A machine, of course, has no such understanding. Someone must write a program to parse the document, and somehow identify which parts are the title and which parts are the texts. Every new style requires a new program, or a new fix to the old program. This scenario quickly becomes unmanageable, obviating all but the simplest operations on documents.

A markup language embeds special "tags" in text that help programs identify the various parts of a document.

Here, for example, is a recipe marked up with a whimsical markup language:

A recipe written in the " Whimsical Markup Language ":

title{Chocolate Cake}
ingredient{howmuch {3 tablespoons}what{Dark Chocolate}}
ingredient{howmuch {1 pound}what{Flour}}
ingredient{howmuch {2 cups}what{Milk}}
ingredient{howmuch {1 dozen}what{Eggs}}
instructions{Mix, beat, bake at 325 degrees for 40 minutes.}

Note that the marked-up document shown here is NOT in any official markup language, such as LaTeX, rtf, or HTML. It is just an example of a document whose component parts have been distinguished using a simple and obvious syntax.


A markup language is a mechanism to identify structures in a document. The XML specification defines a standard way to add markup to documents. XML is a set of abstract rules for building a markup language. XML is not a markup language itself. XML was designed to describe data. XML tags are not predefined in XML. You must define your own tags. XML uses a DTD (Document Type Definition) or XML schema to describe the data. XML with a DTD or XML schema is designed to be self-descriptive.

Structure and Entities

XML data is represented and exchanged between software applications in units called XML Document. An XML Document is made up of declarations, elements, attributes, text data, comments and other components. Each of these components will be described in more details in other sections.

XML documents have both logical and physical structure. The logical structure is simply the elements (and attributes) in the document and their order. Logically, the document is composed of declarations, elements, comments, character references, and processing instructions, all of which are indicated in the document by explicit markup.

XML documents use storage units called entities to arrange physical structures to produce a logical structure. Entities define blocks of text for reuse in documents or in DTDs, and include data from other storage units (such as files). Every entity is either internal or external. An internal entity is defined in a document's prolog (along with or within the DTD), and is not associated with any external file or data source. An external entity is also defined in the prolog, but depends on some external file or data source. There are other characteristics also determine an entity's type, such as parsed or unparsed; and a general entity or a parameter entity, etc.

Each XML document has a special text entity called document entity or root entity. All entities referred to directly or indirectly from the root entity are regarded as parts of the physical structure of the document.

The developers of XML introduced a distinction between "well-formed" documents (which followed the XML syntax) and "valid" documents (whose markup followed a particular language developed from XML). The concept of a merely "well-formed" document greatly eased the burden on the document writer, and may be the single most important reason for XML's acceptance. A well-formed XML document is one from which an XML Processor can successfully build a tree structure.

Previous Next vertical dots separating previous/next from contents/index/pdf Contents

  |   |