Text and Markup
This part of the HTML reference is
an explanation of SGML syntax as it applies to HTML. For lexical
issues, the purpose is to take the standard and reduce it from the
abstract system that is SGML to a concrete language, HTML. For
structural issues, the purpose is to give you enough background to
read the DTD.
Structured Text
An HTML document is a hierarchy of elements. Each element has a name,
some attributes, and some content. Most elements are represented in
the document as a start tag, which gives the name and attributes,
followed by the content, followed by the end tag. For example:
<HTML>
<TITLE>
A sample HTML document
</TITLE>
<H1>
An Example of Structure
</H1>
Here's a typical paragraph.
<P>
<UL>
<LI>
Item one has an
<A NAME=anchor>
anchor
</A>
<LI>
Here's item two.
</UL>
</HTML>
Some elements (e.g. P, LI) are "empty." They have no content. They
show up as just a start tag.
For the rest of the elements, the content is a sequence of data
characters and nested elements. The content must match the element's
model group from its declaration in the
DTD.
Using the example from above, the content of the UL element is the
sequence "LI, #PCDATA, A, LI, #PCDATA". This matches the model group
from the UL element declaration: "(#PCDATA|LI|A)+".
Parsing Content Into Data and Markup
An HTML document is like a text file, except that some of the
characters are interpreted as markup, rather than document content.
The following table lists the special character sequences that
separate data from markup in an HTML document.
- CRO
- Character Reference Open: "&#", when followed by a
letter or a digit, signals a character reference. SGML idioms include
things like "¨" and "&#SPACE;". It is not used in HTML.
- ERO
- Entity Reference Open: "&", when followed by a letter,
signals an entity reference.
- ETAGO
- End Tag Open: "</", when followed by
a letter, signals an end tag.
- MDO
- Markup Declaration Open: "<!", when followed by a
letter or "--" or "[", signals one of several SGML markup
declarations. The only purpose it serves in HTML is to introduce comments.
- MSC
- Marked Section Close: "]]", when followed by ">" signals
the end of a marked section. While marked sections are not used
by HTML, this sequence of characters is recognized and reported as an
error by conforming SGML parsers.
- PIO
- Processing Instruction Open: "<?" signals a processing instruction. It is not used
in HTML.
- STAGO
- Start Tag Open: "<", when followed by a letter,
signals a start tag.
In the DTD, the symbol PCDATA stands
for parsed character data, the normal text characters in an HTML
document.
The text consists of a stream of lines. The division into lines has no
significance apart from indicating a word end.
All of the SGML delimiters listed in the table of delimitersare recognized in PCDATA.
In the DTD, the symbol CDATA stands
for character data, the text without markup in an SGML document. Only
the end tag open delimiters is
recognized in CDATA.
The characters in an SGML document are organized into a heirarchy of
elements by the use of tags. Tags are set off from the data characters
by angle brackets: '<' and '>'.
Names
The element name immediately follows "<". Names consist of a letter
followed by up to 33 letters, digits, periods, or hyphens. Names are
not case sensitive.
Attributes
Following the element name, whitespace and attributes are allowed. An
attribute consists of a name, an equal sign, and a value. Spaces are
allowed around the equal sign.
The value is either a token or a literal. A token is up to 34 letters,
digits, periods, or dashes. Tokens are case sensitive.
A literal is a string surrounded by single quotes or a string
surrounded by double quotes. Entity references are processed inside
attribute values as inside PCDATA. The length of an attribute value
(after entity processing) is limited to 1024 characters.
Each attribute has a type, which puts constraints on the values it can
have. For example, the NAME attribute of the A element is an ID. An ID
is a name that must be unique among all IDs in the document.
In order to include characters that would otherwise be parsed as
markup, you can use entity references refer to some of
characters.
An entity reference is an ampersand, followed by a name, followed by a
semicolon. No spaces are allowed within an entity reference. For
example:
This is how you include a <tag> as data.
Comment declarations can be used include information aimed at persons
and tools that read the document in source form. This information will
be ignored when the document is processed by an SGML parser.
Comments begin with the character sequence "<!--" and end with
"--", which must be followed by '>'. (Technically, whitespace is
allowed between the closing "--" and '>'.) They are only allowed in
PCDATA.