SitePoint Sponsor

User Tag List

Results 1 to 6 of 6
  1. #1
    <? echo "Kick me"; ?> petesmc's Avatar
    Join Date
    Nov 2000
    Location
    Hong Kong
    Posts
    1,508
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    IMG bbcode regex

    Hi,

    Trying to write a regular expression to parse a img bbcode. I want it to be able to handle this type of syntax:

    [img width=123 height=432 alt="asdfa"]img.gif[/img]

    Where the white space doesn't matter, the alt can take anything inside it except for " characters. So far i have:

    PHP Code:
    $text preg_replace("/\[img( width=[\"]?([0-9]*?)[\"]?)?( height=[\"]?([0-9]*?)[\"]?)?( alt=\"([a-zA-Z0-9]*?)\")?\](.*?)\[\/img\]/eis"'convertImage(\'$7\', \'$2\', \'$4\', \'$6\')'$text); 
    This works, however, if i change the order of width/hieght in the tag it won't work. Also, i've had to limit the characters accepted by alt to alphanumeric as if i want to just exclude any "s within the alt's "s, then it takes the closest one, not the right one.

    THe main problem is the whitespace and reordering of the attributes. Any ideas how to correct this?

    -Peter

  2. #2
    SitePoint Wizard Young Twig's Avatar
    Join Date
    Dec 2003
    Location
    Albany, New York
    Posts
    1,355
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I don't know what convertImage is/does, but perhaps this for the regex?
    Code:
     
    /\[img([\s]+(width|height|alt)=\"?([a-zA-Z0-9]+)\"?)?[\s]+(width|height|alt)=\"?([a-zA-Z0-9]+)\"?)?[\s]+(width|height|alt)=\"?([a-zA-Z0-9]+)\"?)?\](.*?)\[\/img\]/eis

  3. #3
    Resident Java Hater
    Join Date
    Jul 2004
    Location
    Gerodieville Central, UK
    Posts
    446
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    There was a similar to a post about regex matching nested <div> tags.

    Any situation where you are parsing "complex" text formats like BB Code, XML, and CSS, i would use a lexer. The problem is using regular expressions gets messy and hard to test as you need to have a number of different strings to use as test cases. I assume that you are using reg exps to match other BBcode data. As a result, it is cleaner to use a lexer. The advantage with the lexer is that parsing is state based, and therefore you can apply different sets of regexp rules for different states. BBCode also is clearly state-dependant, as you have situations like [code] / [php] where you don't want it to parse things like [b] / [u] inside the [code] blocks.

    Regular Expressions are great for matching text patterns, however they don't cope with different parsing states.

    Young Twig seems/appears to have a solution for this scenario of yours but you'll find that regular expressions on their own won't work as generallisation, assuming you are parsing other BBcode tags.

  4. #4
    <? echo "Kick me"; ?> petesmc's Avatar
    Join Date
    Nov 2000
    Location
    Hong Kong
    Posts
    1,508
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I haven't had time to try out the regex that you provided, but it looks as if it should work. I'll get back to you on that.

    On the topic of lexers, i've googled around for a bit and found some useful information about them, but would it be overkill to implement a lexer for a CMS? Has anyone actually coded a lexer in PHP here? As i see it, it is parsing the text and building a tree of the tags found within the document and then, apply rules on those tags?

    I'm actually thinking this would be a nice project to try out, if i can get more information on the design of them, and specifically how to parse tag attributes.

  5. #5
    SitePoint Addict been's Avatar
    Join Date
    May 2002
    Location
    Gent, Belgium
    Posts
    284
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by petesmc
    Has anyone actually coded a lexer in PHP here?
    Marcus (lastcraft) did, it's in the SimpleTest toolkit.
    Per
    Everything
    works on a PowerPoint slide

  6. #6
    Resident Java Hater
    Join Date
    Jul 2004
    Location
    Gerodieville Central, UK
    Posts
    446
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by petesmc
    I haven't had time to try out the regex that you provided, but it looks as if it should work. I'll get back to you on that.

    On the topic of lexers, i've googled around for a bit and found some useful information about them, but would it be overkill to implement a lexer for a CMS? Has anyone actually coded a lexer in PHP here? As i see it, it is parsing the text and building a tree of the tags found within the document and then, apply rules on those tags?

    I'm actually thinking this would be a nice project to try out, if i can get more information on the design of them, and specifically how to parse tag attributes.
    A lexer is is a bit of an over kills, but it's the most reuseable way to solve the problem. I use a lexer because I use the same lexer class for many other jobs. If you need to parse HTML or other complex text formats, the lexer will pay off and be much better than using over complex regexps that are hard for others to understand.

    I use an adapted version of the Simpletest lexer. The original lexer Marcus made won't support sub patterns in the rules so I made a small hack to get round this. The lexer Marcus wrote is very efficient on code size (14k). it's not super fast, but often you can cache the parsed results of things to avoid parsing. BBCode should parse quickly as it has very simple / basic rules and a ***VERY*** low token density.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •