XML Core
XML Syntax & Well-Formed Documents
XML is a markup language for structured data. Unlike HTML which browsers forgive, XML is strict — one misplaced character and the parser rejects the entire document. This lesson teaches the syntax rules, document structure, and the mental model that makes XML predictable and powerful.
What XML Actually Is
XML (eXtensible Markup Language) is a W3C standard for encoding documents and data in a format that is both human-readable and machine-parsable. It is the foundation of a surprisingly large fraction of the software world.
Unlike HTML, XML has no predefined tags. You create your own vocabulary: <invoice>, <employee>, <sensor-reading> — whatever describes your data domain. This extensibility is the "X" in XML.
XML is self-describing. The tag names tell you what the data means. <price currency="USD">49.99</price> communicates both the value and its context without external documentation.
XML is also a meta-language — a language for defining other languages. RSS, SVG, XHTML, SOAP, XSD, XSLT, MathML, Android layouts, and dozens of other standards are all XML vocabularies.
The Anatomy of an XML Document
<?xml version="1.0" encoding="UTF-8"?>
<!-- This is a comment -->
<catalog>
<book id="bk101" genre="fiction">
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<price currency="USD">12.99</price>
<publish_date>1925-04-10</publish_date>
<description>A portrait of the Jazz Age in all of its decadence.</description>
</book>
<book id="bk102" genre="non-fiction">
<title>Thinking, Fast and Slow</title>
<author>Daniel Kahneman</author>
<price currency="USD">16.99</price>
<publish_date>2011-10-25</publish_date>
</book>
</catalog>Components Explained
XML declaration — <?xml version="1.0" encoding="UTF-8"?> — optional but strongly recommended. Declares the XML version and character encoding. Always use UTF-8 unless you have a specific reason not to.
Elements — <book>...</book> — the fundamental building blocks. An element has an opening tag, content (text, child elements, or both), and a closing tag. Empty elements can self-close: <br/>, <image src="logo.png"/>.
Attributes — id="bk101" — key-value pairs inside the opening tag. Attribute values MUST be quoted (single or double). Use attributes for metadata about the element; use child elements for the data itself.
Text content — the actual data between tags. <title>The Great Gatsby</title> — "The Great Gatsby" is the text content.
Comments — <!-- comment --> — ignored by parsers. Cannot contain double hyphens (--) anywhere inside.
CDATA sections — <![CDATA[raw content with <special> characters & symbols]]> — escape blocks where the parser treats everything as literal text. Essential for embedding code snippets or HTML within XML without escaping every bracket.
Processing instructions — <?xml-stylesheet type="text/xsl" href="transform.xsl"?> — instructions for the processing application, not part of the XML data itself.
Well-Formedness Rules
Well-formedness is the minimum bar for any XML document. A parser will reject the entire document if any rule is violated — no partial parsing, no error recovery.
1. Exactly one root element. The entire document must be wrapped in a single top-level element. Two root elements is invalid.
2. Every opening tag must have a corresponding closing tag (or be self-closing). <name>Alice</name> or <br/>.
3. Tags must be properly nested. <a><b></a></b> is invalid. The last-opened tag must be the first to close. <a><b></b></a> is valid.
4. Attribute values must be quoted. id="bk101" or id='bk101' — both work. id=bk101 is invalid.
5. Attribute names must be unique within an element. <book id="1" id="2"> is invalid.
6. Element and attribute names are case-sensitive. <Book> and <book> are completely different elements.
7. Special characters in text content must be escaped:
| Character | Escaped form |
|---|---|
| & | & |
| < | < |
| > | > |
| " | " |
| ' | ' |
Well-Formed vs. Valid
These two terms mean different things in XML:
Well-formed — follows the syntax rules above. Every XML parser checks this before doing anything else. Fail and the entire document is rejected.
Valid — well-formed AND conforms to a schema (DTD, XSD, or Relax NG) that defines the allowed structure. Validation is optional but critical for data interchange — it is a contract between two systems that says "here is exactly what this document may contain."
A document can be well-formed but not valid (it follows XML syntax but not the agreed structure). A document cannot be valid without being well-formed first.
Elements vs. Attributes
The design decision of what to put in elements versus attributes comes up in every XML schema design.
Use elements for:
- Data that has structure or may need child elements later
- Data you will query, sort, or aggregate
- Data that could have multiple values
- The core content of the document
Use attributes for:
- Metadata that describes the element (IDs, types, categories)
- Simple, single values that will not expand
- Lookup keys and identifiers
Example debate: <price currency="USD">12.99</price> vs. <price><value>12.99</value><currency>USD</currency></price>. Both are valid XML. The attribute version is more concise; the element version is more extensible (you could add <exchange_rate> later without changing the outer structure). In practice: attributes for simple metadata, elements for data.
Encoding and Character Sets
Always declare encoding in the XML declaration. UTF-8 handles virtually all characters across all languages and is the correct choice for almost every new XML document.
<?xml version="1.0" encoding="UTF-8"?>If your source data is in a different encoding (Windows-1252, ISO-8859-1, etc.), either declare it correctly or convert to UTF-8 before processing. Encoding mismatches cause parser errors that are frustrating to debug.
XML as a Specification Language
XML is the original specification format for data structures. An XML Schema is literally a specification — it defines the allowed elements, their types, their constraints, and their relationships. The same rigor that makes a good design document makes a good XML schema.
See [DTD & XML Schema Validation](/tutorials/xml-fundamentals/dtd-xsd-validation) for how to write those specifications, and [Namespaces & Modular XML](/tutorials/xml-fundamentals/namespaces) for how to combine multiple XML vocabularies without naming conflicts.
Example
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
<book id="bk101" genre="fiction">
<title>The Great Gatsby</title>
<author>F. Scott Fitzgerald</author>
<price currency="USD">12.99</price>
<publish_date>1925-04-10</publish_date>
</book>
<book id="bk102" genre="non-fiction">
<title>Thinking, Fast and Slow</title>
<author>Daniel Kahneman</author>
<price currency="USD">16.99</price>
</book>
</catalog>
<!-- Self-closing empty element -->
<image src="cover.jpg" width="200" height="300"/>
<!-- CDATA for embedded code -->
<script><![CDATA[
if (x < 10 && y > 0) { alert("valid"); }
]]></script>