SitePoint Sponsor

User Tag List

Results 1 to 9 of 9
  1. #1
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    79
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    multi-format text parser

    Hi all,

    I'm currently in the middle of a refactor on my application - a PHP message centre. It's a kind of distribution hub for messages from a variety of sources and destinations, which might be in any number of formats. For example, one simple use case might be to take a phpBB forum post as an input, and send the message in an HTML email, with all the formatting and attachments of the original.

    I have already written a tokenizer/lexer to convert BBCode (used by phpBB) into HTML and/or plain text. However, I'm considering supporting other formats such as Wiki formatting, Markdown and similar. This has got me wondering how feasible it would be to make a single parser which can take any format and output any other, rather than separate parsers such as BBCode->HTML, Markdown->BBCode, etc.

    On the face of it, it's a lot of work but I'm not sure how impossibly complex it might be. It means tokenizing the input based on a set of rules for the input format, with each token given a generic rather than a proprietary name, and then reassembling based on rules for the output format.

    I'd like some advice and observations about how I could do this, what the pitfalls might be, and any existing solutions which might help. Since HTMLPurifier seems so stable and successful, I have been wondering in the back of my mind if I might tap into its code somehow and build upon it as a framework. I haven't looked closely enough at the code yet to determine if this might be possible.

    Many thanks for any comments!
    ____________________


    George

  2. #2
    SitePoint Zealot
    Join Date
    Apr 2003
    Location
    Connecticut
    Posts
    173
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'd do it the same way you explained.

    Say the post looked like this:

    Code:
    [b]Welcome to my website![/b] I have over [i]30[/i] members.
    or
    Code:
    <strong>Welcome to my website!</strong> I have over <em>30</em> members.
    I'd convert it to something like:

    Code:
    &#91;code:strong]Welcome to my website!&#91;/code] I have over &#91;code:em]30&#91;/code] members.
    Then I would take this 'standard' code and convert it to any other format.. so all we need to do is

    HTML->Standard, BBCode->Standard

    and

    Standard->HTML, Standard->BBCode

    rather than

    HTML->BBCode and BBCode->HTML

  3. #3
    SitePoint Wizard stereofrog's Avatar
    Join Date
    Apr 2004
    Location
    germany
    Posts
    4,324
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'd use XML or even XHTML as the internal format, so that you can use XSL to generate everything you want from it.

  4. #4
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    79
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Aah - now XSL is something I've still not looked into. It would certainly be interesting to learn a new technique at the same time,
    ____________________


    George

  5. #5
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    79
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I'm just doing a bit of research into XSL for the first time. It's a fascinating subject!

    Would I need to write a dtd for my XML format, or is that not necessary?

    I'm beginning to understand how to do the XML to output format transformation using XSL. However, before I jump in I need to be sure I will be able to write converters from the various input formats into XML. Markdown and BBCode aren't such a problem, but how does one convert HTML (from an untrusted source) into valid XML? Would you recommend I run it through something like HTMLPurifier first? Then I could set up a whitelist of tags to allow through.
    ____________________


    George

  6. #6
    SitePoint Wizard silver trophybronze trophy Cups's Avatar
    Join Date
    Oct 2006
    Location
    France, deep rural.
    Posts
    6,869
    Mentioned
    17 Post(s)
    Tagged
    1 Thread(s)
    but how does one convert HTML (from an untrusted source) into valid XML
    That sounds like a job for htmlTidy or the Tidy Extension as its known , tidy_clean_repair() you didn't mention it explicitly, so I thought I would.

  7. #7
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    79
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    OK, so I've been doing some research on this tonight. Here's where I've got to so far:

    - It seems like a flavour of XML is the obvious interchange format. In the early stages, I must support BBCode, HTML, Markdown and plain text, and XHTML seems to be the natural superset.

    - I need to convert incoming messages into XML. For both BBCode and Markdown I would need to write or adapt a parser, with HTML I would imagine I could use HTMLPurifier to ensure it's valid XML and to remove malicious code.

    - The XML interchange format would effectively be XHTML with a limited whitelist of tags - similar to the standard phpBB set of BBCode tags. My users won't need more complex tags, so I will set HTMLPurifier to strip these from incoming HTML messages.

    - I have been looking into XSL for the first time, and it seems that if I define a standardised XML format as above, I can use a separate XSL template for each of the output formats I need. The interchange XML should be valid XHTML, so that needs no further processing.

    - The question now for me, before I get cracking, is about the parsers for BBCode -> XML and Markdown -> XML. I've done a bit of research and found an array of information, including an old article by Harry Fuecks which recommends the PEAR Text_Wiki library. I'm keen to find a solution which is future-proof, so want to use a Parser/Lexer which can be extended for each input format I want to support. I'd like to avoid using a separate solution for each format!

    - is it crazy to be using a Lexer to split a BBCode string into tokens, parse them and reassemble into the XML interchange format, only to pass the XMl to an XSL template to be transformed into another format? That seems like two costly operations where there should be only one. Should I instead use a Lexer to get an array of tokens and use that as the interchange format? Then I loose the power of XSL, which seems a shame.
    ____________________


    George

  8. #8
    SitePoint Member
    Join Date
    Dec 2007
    Location
    Annecy, FRANCE
    Posts
    1
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You could have a look at dokuwki parser :
    http://wiki.splitbrain.org/wikiarser

    Harry Fuecks Have done some work on it

    It's base on Simple Test Parser http://www.phppatterns.com/docs/deve...st_lexer_notes
    http://www.sitepoint.com/blogs/2005/...d-performance/

  9. #9
    SitePoint Enthusiast
    Join Date
    Oct 2005
    Posts
    79
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Looks interesting. Thanks 2mx!
    ____________________


    George


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •