SitePoint Sponsor

User Tag List

Results 1 to 10 of 10
  1. #1
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,292
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)

    parsing dtd's. xhtml and html's flavour of sgml differs.

    i'm writing some php to parse dtd's (which are in sgml) -- dtd's which are pointed to by the doctype at the start of html and xhtml documents.

    i've just found out that html and xhtml use different versions of sgml. i think xhtml's sgml is the same as the sgml xml uses.

    the point is it's not just that the rules in xhtml dtd's are different to the rules in html dtd's, but the language the rules are written in differs -- a different version of sgml is used.

    so it seems that html uses an html version of sgml, and xhtml uses an xml version of sgml. the code i'm writing, which takes a doctype as a starting point, what should it base its answer to "parse using xml-sgml or html-sgml?" on?

    Code:
    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
            "http://www.w3.org/TR/html4/loose.dtd">
    
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
            "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
    also there doesn't seem to be anything in the dtd documents themselves which state which sgml to use.

    i suppose i could use the HTML and XHTML part of the doctype which are part of the Formal Public Identifyer parts of the doctypes, in particular the "label" or "public text description" part of the fpi (according to http://www.eskimo.com/~bloo/indexdot.../d/doctype.htm )

    but that doesn't seem a very reliable good way to tell if the xml sgml or html sgml should be used -- just looking to see if the "HTML 4.01 Transitional"/"XHTML 1.0 Strict" part of the doctype starts with an X or not -- which is what that would come down to. especially as one term used to describe that bit of the doctype is "public text description" -- doesn't sound like something a programatic decision should be based on.

    or i could base the decision on the sgml itself. for example html sgml, the element definitions contain a pair of hyphens and/or O's to indicate whether the opening and closing tags are optional or not:

    <!ELEMENT UL - - (LI)+>

    whereas those never occur in the xhtml's sgml:

    <!ELEMENT ul (li)+>

    what's the propper thing to base the use xhtml-sgml or html-sgml decision on does anyone know? thanks.

    (i'm not actually sure at the moment how much xhtml's sgml and html's sgml differs -- not the rules but the language -- it does differ at least a little bit)

  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,159
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    HTML is an application of SGML.
    XML is a limited subset of SGML (with some extensions).
    XHTML is an application of XML.

    XML DTDs are more limited than full-blown SGML DTDs. Since XML doesn't permit any tags to be omitted, those markers (hyphens/O's) aren't used in XML DTDs. Also, XML's comment syntax is much more limited than SGML's, so comments in XML DTDs are always separate, whereas in an SGML DTD they can be embedded in the declarations.

    XML DTDs don't allow for inclusions or exclusions, either.

    You could parse either one using the generic SGML DTD parser, since XML DTDs are a subset of SGML DTDs (as far as I know, anyway). The only issue would be the flags for optional/required tags. You could set the defaults to 'required' and override the value if the flags are present (i.e., it's a non-XML DTD).

    If you want two separate parsers you'll have to look at the FPI. XHTML is XML, so the DTDs are XML DTDs. HTML4 and older are SGML, so the DTDs are SGML DTDs.

    Of course, if you want to be really generic and sophisticated, you could read the respective SGML declarations (for HTML and XML) and dynamically configure the parser according to that.
    Birnam wood is come to Dunsinane

  3. #3
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,292
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    > XML DTDs are more limited than full-blown SGML DTDs. Since XML doesn't permit any tags to be omitted, those markers (hyphens/O's) aren't used in XML DTDs.

    > You could parse either one using the generic SGML DTD parser, since XML DTDs are a subset of SGML DTDs (as far as I know, anyway).

    right.

    > The only issue would be the flags for optional/required tags. You could set the defaults to 'required' and override the value if the flags are present (i.e., it's a non-XML DTD).

    could it be that that's how sgml is supposed to work i wonder? i wonder if sgml really has been tweaked for the xhtml version (which is what my initial assumption in the original question was: "html and xhtml use different versions of sgml")? probably not i now suspect. from a reading/parsing point of view, i reckon a full correct sgml parser would happily deal with xhtml's dtd sgml? because xhtml's dtd sgml is just subset of html's dtd sgml.

    your suggestion above about using defaults for the optional/required markers, is that your idea, or is that how sgml is supposed to work? i bet/suspect now, that's how it's supposed to work.

    > If you want two separate parsers you'll have to look at the FPI. XHTML is XML, so the DTDs are XML DTDs. HTML4 and older are SGML, so the DTDs are SGML DTDs.

    no i don't want two separate parsers at all. i just thought earlier on that was going to be necessary. now looks like that's not going to be required. i just need a more fuller, more sgml aware parser i reckon.

    > Also, XML's comment syntax is much more limited than SGML's, so comments in XML DTDs are always separate, whereas in an SGML DTD they can be embedded in the declarations.

    oh yes, i hadn't noticed that that didn't happen in xhtml's dtd sgml.

    right, excellent, thanks for the info.

    obviously i need to read up more on sgml.


    p.s. ah, the one thing you said that possibly makes what i'm saying above incorrect is:

    > XML is a limited subset of SGML (with some extensions).

    in particular the "with some extensions" bit. what are those extensions? just the fact that the required/can-be-omitted part is always omitted?

    thanks.

  4. #4
    Non-Member
    Join Date
    Oct 2008
    Location
    Banned
    Posts
    506
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You can not omit stuff in XHTML such as self closing tags. In html you can forget them, in XHTML you can not.

  5. #5
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,159
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    SGML is very generic and an application of SGML can look almost like anything. HTML mainly uses the default SGML delimiters, like '<' for STAGO, '</' for ETAGO, '>' for TAGC, etc.

    An application of SGML needs an SGML declaration that specifies which delimiters to use, which characters are legal in names and identifiers, case sensitivity, and so on. The SGML declaration also allows a number of 'features' to be turned on or off. One feature is known as SHORTTAG and another is known as OMITTAG. Both are declared as YES for HTML.

    OMITTAG YES means the language allows some tags to be omitted, if their existence can be inferred. SHORTTAG YES implies several things, most of which aren't properly supported by the HTML parsers used in browsers (like empty start tags, <>, and end tags, </>).

    In HTML you can, in theory, also use 'null end-tags' (NET). Browser parsers don't support it, but you should be able to write things like this,
    Code:
    <p<abbr/HTML/ is an application of <abbr/SGML/.
    XML is a subset, or a specialisation, of SGML. Its SGML declaration says OMITTAG NO, so no tags can be omitted in XML. The authors of XML wanted more control than the regular SHORTTAG feature allows, so they use an extension to SGML that lets them fine-tune the behaviour. This makes it possible for XML to disallow unclosed tags, empty tags, unquoted attribute values and attribute minimisation, while still allowing null end-tags.

    In SGML, the NET separator ('/') terminates the start-tag and replaces the end-tag, as in <abbr/HTML/. The extension used by XML introduces a second delimiter, known as NESTC (NET-enabling start-tag close). In XML, NESTC is '/' and the NET delimiter is set to '>'. That's why we write <foo/>, which in HTML would be <foo//. The XML specification places an additional limitation on NET usage, saying that it may only be used for empty elements. Thus, <abbr/HTML> is not valid XML.

    There's more info in James Clark's Comparison of SGML and XML if you're interested.
    Birnam wood is come to Dunsinane

  6. #6
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,292
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    right, excellent -- thanks very much for the info.

  7. #7
    SitePoint Zealot
    Join Date
    Mar 2008
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    It is best to use to use an XML Schema Validator for all XHTML documents (including those served as text/html) rather than the SGML based W3C Markup Validator -- a well-formedness check is automatically performed.

    I will post links if ever I get to ten posts!

    JFP

  8. #8
    SitePoint Zealot
    Join Date
    Mar 2008
    Posts
    116
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Here is, IMO, the best (and most used) XML Schema Validator

    JFP

  9. #9
    om nom nom nom Stomme poes's Avatar
    Join Date
    Aug 2007
    Location
    Netherlands
    Posts
    10,233
    Mentioned
    47 Post(s)
    Tagged
    1 Thread(s)
    For instance, it is unclear whether the form element is allowed to carry a name attribute.
    Whoa, I thought that was certain, no "name" on forms!

  10. #10
    Programming Since 1978 silver trophybronze trophy felgall's Avatar
    Join Date
    Sep 2005
    Location
    Sydney, NSW, Australia
    Posts
    16,603
    Mentioned
    24 Post(s)
    Tagged
    1 Thread(s)
    The W3C standard allows name to be used on a form tag but also states:

    Note. This attribute has been included for backwards compatibility. Applications should use the id attribute to identify elements.
    So you should NOT give your form a name, you should give it an id instead if you want to correctly follow the standards.
    Stephen J Chapman

    javascriptexample.net, Book Reviews, follow me on Twitter
    HTML Help, CSS Help, JavaScript Help, PHP/mySQL Help, blog
    <input name="html5" type="text" required pattern="^$">


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •