Using SGML spec/rules to parse DTDs

A very obscure technical SGML, DTD question I’m sure, but no harm in asking it anyway.

In the SGML specification (ISO 8879) the first SGML entity is:

1 SGML document =
    2 SGML document entity,
    (    3 SGML subdocument entity |
        4 SGML text entity |
        5.1 character data entity |
        5.2 specific data entity |
        6 non-SGML data entity    )*

That says “2 SGML document entity” must happen, followed by any number of the bracketed things including zero times. So 2 has to happen first and it must happen. 2 is:

2 SGML document entity =
    5 s*,
    171 SGML declaration,
    7 prolog,
    10 document instance set,

5 is a space character or three other non-printable chars I think, anyway it’s optional. Then 171 has to happen. 171 is:

171 SGML declaration =
    65 ps+,
    76 minimum literal,
    65 ps+,
    172 document character set,
    65 ps+,
    180 capacity set,
    65 ps+,
    181 concrete syntax scope,
    65 ps+,
    182 concrete syntax,
    65 ps+,
    195 feature use,
    65 ps+,
    199 application-specific information,
    65 ps*,

The start of that would be: <!SGML

Then there’s “7 prolog”. 7 prolog is:

7 prolog =
    8 other prolog*,
    9 base document type declaration,
    (    110 document type declaration |
        8 other prolog    )*,
    (    154 link type declaration |
        8 other prolog    )*

8 other prolog is “91 comment declaration” or “44 processing instruction” or some space but it’s optional anyway. Then 9 is:

9 base document type declaration =
    110 document type declaration

Which, finally (this is what I was waiting for), is the document declaration which is recognisable from the start of any well formed web page.

I understand that you take the doc type line from the start of a web page, like

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">

and that gets added before the rest of the DTD for when parsing it. So the DTD you parse is:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "">

   Extensible HTML version 1.0 Strict DTD

   This is the same as HTML 4 Strict except for
   changes due to the differences between XML and SGML.

   Namespace =

   For further information, see:

   Copyright (c) 1998-2002 W3C (MIT, INRIA, Keio),
   All Rights Reserved. 

   This DTD module is identified by the PUBLIC and SYSTEM identifiers:

   PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
   SYSTEM ""

   $Revision: 1.1 $
   $Date: 2002/08/01 13:56:03 $

<!--================ Character mnemonic entities =========================-->

   "-//W3C//ENTITIES Latin 1 for XHTML//EN"

So what the SGML rules/spec is saying is that you have to have at least a “171 SGML declaration” first, which starts <!SGML
Then you have the doctype line. Then what’s in the DTD presumably.

There is no <!SGML … line apparent from a DTD. Where is the <!SGML… bit? Is it implied in some way? Otherwise without that it seems a DTD isn’t proper SGML?

Ah, I’ve kind of answered this myself now. The missing <!SGML …> bits I’ve found on’s site.

xhtml 1’s SGML:
SGML Declaration of HTML 4:

So those need to come before the doc type to make the DTDs proper SGML. Odd the way the existence/location of those aren’t part of it technically in some way though. I mean in a machine linked, readable and parsable way.

I thought it might be deducible from DTD urls.

This works: << url in doctype << the sgml

These don’t: << url in doctype << not found, doesn’t exist << url in doctype << not found, doesn’t exist

Ah, HTML4’s machine readable one is

Hmm, maybe I’ll just ignore the <!SGML… bit and start at and with the doc type line.

What are you trying to achieve? Learning SGML is a waste of time.

If you’re trying to write correct HTML and are targeting browsers, forget about SGML. Browsers never used SGML parsers and never will. Historically browsers used different rules to parse HTML, but today they have all rewritten their HTML parsers according to the HTML spec. (Previous HTML specs maintained the fiction that HTML was SGML, but browsers didn’t comply and Web content didn’t, either.)

The HTML spec for writing HTML: