Home / XML / XML and SGML


1. Differences between XML and SGML

XML allows only documents that use the SGML declaration in this note. This declares all the following SGML features as:


Note that it differs from the reference concrete syntax in a number of ways:

• It also declares no short reference delimiters; it follows that SHORTREF and USEMAP declarations cannot occur in XML
• The PIC (processing instruction close) delimiter is ?>
• Quantities and capacities are effectively unlimited
• Names are case sensitive (NAMECASE GENERAL is NO)
• Underscore and colon are allowed in names
• Names can use Unicode characters and are not restricted to ASCII

The following constructs which are permitted in SGML when SHORTTAG is YES are not allowed in XML:

• Unclosed start-tags
• Unclosed end-tags
• Empty start-tags
• Empty end-tags
• Attribute values in attribute specifications entered directly rather than as literals
• Attribute specifications that omit the attribute name

NET delimiters can be used only to close an empty element. In SGML without the Web SGML Adaptations Annex, the NET delimiter is declared as />. With this approach, XML is not allowing null end-tags and is allowing net-enabling start-tags only for elements with no end-tag. In SGML with the Web SGML Adaptations Annex, there is a separate NESTC (net-enabling start tag close) delimiter. This allows the XML syntax to be handled as a combination of a net-enabling start-tag . With this approach, XML is allowing a net-enabling start-tag only when immediately followed by a null end-tag.

XML imposes the following restrictions not in SGML:

• Entity references
• General entity references in content are required to be synchronous
• External entity references in attribute values are not allowed

• Character references
• Named character references are not allowed
• Numeric character references to non-SGML characters are not allowed

• Entity declarations
• A #DEFAULT entity cannot be declared
• External SDATA entities are not allowed
• External CDATA entities are not allowed
• Internal SDATA entities are not allowed
• Internal CDATA entities are not allowed
• An ampersand in a parameter literal must be followed by a syntactically valid entity reference or numeric character reference

• Attribute definition list declarations
• Associated element type in attribute definition list declarations cannot be a name group
• Attributes cannot be declared for a notation
• A name token group must use the or connector
• Attribute values specified as defaults in attribute definition list declarations must be literals .

• Element type declarations
• Associated element type in element type declaration cannot be a name group
• In an element declaration, a generic identifier cannot be specified as a rank stem and rank suffix
• Minimization parameters in element declarations are not allowed
• RCDATA declared content are not allowed
• CDATA declared content are not allowed
• Content models cannot use the and connector
• Content models for mixed content have a restricted form

• Comments
• A parameter separator cannot contain comments; this means that markup declarations (other than comment declarations) cannot contain comments
• Empty comment declarations ( in the reference concrete syntax) are not allowed
• A comment declaration cannot contain more than one comment

• Processing instructions
• Processing instructions must start with a name (the PI target)
• A processing instruction whose PI target is xml can only occur at the beginning of a external entity and must be an XML declaration if it occurs in the document entity, and otherwise an text declaration

• Marked sections
• In marked section declarations, TEMP status keyword is not allowed
• In a marked section declaration, a status keyword specification that contains no status keywords is not allowed
• In a marked section declaration, a status keyword specification cannot contain more than one status keyword
• Marked sections are not allowed in the internal subset
• Parameter separators are not allowed in status keyword specifications in the document instance; in particular, parameter entity references are not allowed

• Other
• Names beginning with [Xx][Mm][Ll] are reserved
• The SGML declaration must be implied and cannot be explicitly present in the document entity
• When < and & occur as data, they must be entered as < and &
• A parameter separator required by the formal syntax must always be present and cannot be omitted when it is adjacent to a delimiter

XML predefines the semantics of the attributes xml:space and xml:lang. It also reserves all attribute, element type and notation names beginning with [Xx][Mm][Ll].

XML requires that an SGML parser use an entity manager that behaves as follows:
• Lines are terminated by newline (Unicode code #X000A) rather than being delimited by RS and RE as with a typical SGML entity manager
• System identifiers are treated as URLs
• The entity manager must support entities encoded in UTF-16 and UTF-8, and must be able automatically to detect which encoding an entity uses based on the presence of the byte order mark
• The entity manager should be able to recognize the encoding declaration in the XML declaration and encoding PI and use it to determine the encoding of entity

XML imposes requirements on the information that a parser must make available to an application.