SitePoint Sponsor

User Tag List

Results 1 to 16 of 16

Hybrid View

  1. #1
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)

    parsing sgml - source of syntax details?

    i'm back on a project i started ages ago and didn't get very far with which includes parsing sgml. i haven't been able to find a good source of info about sgml's syntax. where can i get a good thorough detailed description of that? i've found things like

    http://www.w3.org/TR/html401/intro/sgmltut.html
    http://www.isgmlug.org/sgmlhelp/g-index.htm
    http://www.linuxfromscratch.org/alfs...td-syntax.html

    but none of them really detail it. for example you can have this in an attribute list:

    xmlns %URI; #FIXED 'http://www.w3.org/1999/xhtml'

    but i've not found anything which talks about what follows the fixed part in any detail; will it always be on the same line? if not how do you reliably, programmatically, distinguish between something which is connected with fixed, and the next entry in the attribute list? sometimes there's nothing following fixed, sometimes it's in single quotes, sometimes double... etc.

    so the kind of detail required to programmetically parse sgml, where can i find it does anyone know?

    thanks.

  2. #2
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Perhaps my article The art of reading a DTD can help?

    In your example, the components are,
    xmlns – the name of the attribute
    %URI; – a parameter entity reference (that resolves to CDATA) that specifies the data type of the attribute value
    #FIXED – a reserved word (as evidenced by the leading RNI ('#')) that states that the attribute can only have one, fixed value
    'http://www.w3.org/1999/xhtml' – the fixed value that is the only one allowed for this attribute

    So the declaration says that there's an attribute named xmlns which takes a CDATA value that must be exactly "http://www.w3.org/1999/xhtml".

    Web SGML and HTML 4.0 Explained is fairly old by now, but a useful introduction to SGML in my opinion.
    Birnam wood is come to Dunsinane

  3. #3
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johnyboy View Post
    but i've not found anything which talks about what follows the fixed part in any detail
    That will be the fixed value that must be used for the attribute.

    Quote Originally Posted by johnyboy View Post
    will it always be on the same line?
    No, white-space can be used fairly freely in a DTD. The next attribute declaration could start on the same line, I believe.

    Quote Originally Posted by johnyboy View Post
    if not how do you reliably, programmatically, distinguish between something which is connected with fixed, and the next entry in the attribute list?
    There will always be exactly one value following #FIXED.

    Quote Originally Posted by johnyboy View Post
    sometimes there's nothing following fixed
    I think there must be. You can't state that an attribute has a fixed value without specifying what that fixed value is.

    Quote Originally Posted by johnyboy View Post
    sometimes it's in single quotes, sometimes double... etc.
    Yes, both single and double quotes can be used around string literals, just as in many programming languages. In a DTD they are equivalent, just as they are in HTML and JavaScript (but not in, e.g., PHP).
    Birnam wood is come to Dunsinane

  4. #4
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think you might want to buy the SGML spec in order to implement it properly, although it doesn't cover error handling so you have to make that up yourself or reverse engineer other implementations.

    http://www.iso.org/iso/iso_catalogue...csnumber=16387

    I'm curious, though, what's the purpose of the project?
    Simon Pieters

  5. #5
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    > I think there must be. You can't state that an attribute has a fixed value without specifying what that fixed value is.

    i was pretty sure i'd come across fixed with nothing after it but having just searched for it can't find it so i'm not so sure now. i guess you're right.

    great, thanks for all that other helpful info.

    another question about the thing which follows #fixed: is that always in quotes?

    thanks.

  6. #6
    SitePoint Author silver trophybronze trophy

    Join Date
    Nov 2004
    Location
    Ankh-Morpork
    Posts
    12,158
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johnyboy View Post
    another question about the thing which follows #fixed: is that always in quotes?
    It might perhaps depend on the datatype of the attribute value. My guess is that a string literal is the normal case, since most attributes are of type CDATA anyway.

    You shouldn't have to care about the quotes, though. When your parser (or lexical analyser) encounters a single or double quote it should expect a string literal followed by a terminating quote, and use the literal (without quotes) as the resulting value.
    Birnam wood is come to Dunsinane

  7. #7
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by zcorpan
    I think you might want to buy the SGML spec in order to implement it properly, although it doesn't cover error handling so you have to make that up yourself or reverse engineer other implementations.

    http://www.iso.org/iso/iso_catalogue...csnumber=16387
    not at that price. i would like something like that though. there's the sgml handbook by goldfarb which is available for not too much 2nd hand. maybe i should get that. google books has it http://books.google.com/books?id=RilvKya0EnwC. that's probably as good as i'm going to get for free. just realised there's various bits of it missing though.

    > I'm curious, though, what's the purpose of the project?

    a cms (and i suppose a kind of ide as well)

    Quote Originally Posted by AutisticCuckoo
    It might perhaps depend on the datatype of the attribute value. My guess is that a string literal is the normal case, since most attributes are of type CDATA anyway.

    You shouldn't have to care about the quotes, though. When your parser (or lexical analyser) encounters a single or double quote it should expect a string literal followed by a terminating quote, and use the literal (without quotes) as the resulting value.
    ok great, thanks.

  8. #8
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johnyboy View Post
    > I'm curious, though, what's the purpose of the project?

    a cms (and i suppose a kind of ide as well)
    Interesting. Now I just wonder why using an XML parser or HTML5 parser doesn't suit your needs.
    Simon Pieters

  9. #9
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    > Now I just wonder why using an XML parser or HTML5 parser doesn't suit your needs.

    given the doctype and any little snippet of html, say '<ul>', could an XML parser or HTML5 parser say what's possible and/or manditory next? (and the same kind of thing for attributes as well as elements). if so maybe they'd be good.

    or, stepping back a bit more, given just a doctype can a parser say what all the elements are and (optional + manditory) attributes of all those elments? and all the other details like possible/manditory values of attributes.

    a parser must have that kind of information in it, but how easy/possible is it to get that information out of it when there's nothing to parse, before there's anything to parse? or even when there is something to parse?; if the parsers just say "that's illegal" rather than "that's illegal, you should have an 'li' after a 'ul' not a 'p'" then a parser definitely would not be any good.

  10. #10
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Sounds like you want to do validation as well as parsing.

    An XML parser can do DTD validation just like an SGML parser can do DTD validation. However you could also do XML Schema validation or RELAX NG validation after parsing, which provide for more expressiveness than DTD validation.

    You can do XML Schema or RELAX NG validation with an HTML5 parser, too.

    You may want to look into the Validator.nu HTML parser and/or the Validator.nu Web service API:

    http://about.validator.nu/htmlparser/
    http://about.validator.nu/#api
    Simon Pieters

  11. #11
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    > Sounds like you want to do validation as well as parsing.

    i'm parsing a dtd to get a graph (nodes and vertices type graph) of what's in a dtd. that graph would allow me to jump in at any point and know what elements are possible and/or manditory before and after any element, and what attribues and values are possible and/or manditory for that element -- without having any input (apart from which dtd is being used, and the arbitory starting point). would an html validator be able to stand in, in place of what a graph of a dtd would allow/give? can you insert 'ul' into a validator and then get back what elements can/must come before and after a ul, and what attributes+values are allowed/mantory? all the possibilities/options from any particular element?

    thanks.

  12. #12
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    The RELAX NG schema that Validator.nu uses for XHTML 1.0 strict contains:

    ul = element ul { ul.attlist, li+ }
    ul.attlist = Common.attrib
    li = element li { li.attlist, Flow.model }
    li.attlist = Common.attrib
    List.class = ul | ol | dl
    Block.class |= List.class
    -- http://s.validator.nu/xhtml10/list.rnc (which is referenced from http://s.validator.nu/xhtml10/xhtml-strict.rnc )

    ...which is basically the same information as what the DTD provides:
    <!ENTITY &#37; lists "ul | ol | dl">
    ...
    <!ENTITY % block
    "p | %heading; | div | %lists; | %blocktext; | fieldset | table">
    ...
    <!ELEMENT ul (li)+>
    <!ATTLIST ul
    %attrs;
    >
    -- http://www.w3.org/TR/xhtml1/dtds.htm...rict.dtd_lists

    Jing generates messages that just say "foo is not allowed in this context" so Jing is probably not suitable for your usage (though maybe you could hack Jing to generate more useful messages). However you could write your own implementation that parses the RELAX NG schema just like you're writing your own DTD parser.

    Are you just writing the DTD parser or are you also parsing the document instance with the same SGML parsing architecture? For your purpose you might be able to use an off-the-shelf XML or HTML5 parser (without validation) to parse the document instance and build up a tree structure, and then separately parse the schema (DTD or RELAX NG) and then build up the graph.

    (You could discuss this further in #whatwg on freenode with hsivonen and MikeSmith if you like.)
    Simon Pieters

  13. #13
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Simon Pieters

  14. #14
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    oh right that does look helpful. "validation" seemed so after, rather than before, the event/fact if you see what i mean. i shall look into all that. thanks very much.


    > Are you just writing the DTD parser or are you also parsing the document instance with the same SGML parsing architecture?

    just writing a dtd parser to get the graph to be able to know what's possible/manditory. it hadn't occurred to me the same code which parses the dtd could also be used to parse html. if and when it comes to parsing html i'd probably (assuming i'd already made the graph or got something similar) use the graph plus a bit of code/functionality to go between the being parsed html and the graph in order to parse the html. i've not thought about this aspect much though.

  15. #15
    bronze trophy
    Join Date
    Dec 2004
    Location
    Sweden
    Posts
    2,670
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by johnyboy View Post
    oh right that does look helpful. "validation" seemed so after, rather than before, the event/fact if you see what i mean.
    That's true for RELAX NG and XML Schema, but as specified not actually true for DTDs.

    In SGML, you need to parse the DTD in order to parse the document instance (because the DTD says which tags are inferred, among other things). Thus DTD validation happens during parsing the document instance.

    XML removed syntactical features such as tag inference in order to allow for DTDless parsing. However it did not remove DTDs in order to be compatible with existing SGML parsers.

    Then namespaces in XML were invented, and namespace processing is done on a layer above XML processing and hence after DTD validation. This is why DTDs don't support namespaces. Instead of changing how DTD validation works, the W3C invented a new, better schema language called XML Schema which supports namespaces and happens after namespace processing.

    RELAX NG is a independently produced schema language that also happens after namespace processing.

    An HTML5 parser has namespace processing built-in and has no DTD validation facilities per spec. But one can still do DTD validation after an HTML5 parser by just comparing the qualified names and ignoring the namespace. (With a RELAX NG schema you would compare the local name,namespace pair instead of qualified name.)
    Simon Pieters

  16. #16
    SitePoint Wizard
    Join Date
    Apr 2002
    Posts
    2,322
    Mentioned
    3 Post(s)
    Tagged
    0 Thread(s)
    right, ok, thanks for the info.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •