SitePoint Sponsor

User Tag List

Results 1 to 8 of 8
  1. #1
    SitePoint Member
    Join Date
    Aug 2005
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Validating HTML line by line

    I'm writing a validation script for a little cms i'm working on which uses special tags to denote where editable content will go, the content is stored in a database. I'm trying to decide the best way of validating the attributes in these special tags.

    Now I'm wondering the best way to go about it should I validate line by line in a loop regexing each line or should I do it a convoluted using PHP's string functions, do you reckon that regex would be very detrimental to the performance of the script compared to doing lots of string manipulation with string functions. I basically trying to work out a good way of doing this and i'm not sure which route to go down, does anyone have any ideas on this, the html in each page could potentially be quite long 100+ lines. I'm wanting to do it line by line so I can get the line numbers that contain the error so its easy for someone to find the error and fix it before saving
    Last edited by sturreal; Sep 20, 2007 at 14:54.

  2. #2
    SitePoint Wizard
    Join Date
    Dec 2003
    Location
    USA
    Posts
    2,582
    Mentioned
    29 Post(s)
    Tagged
    0 Thread(s)
    Even with several 100 lines, I still don't think it would be that difficult for the PHP to parse it using the string functions. That is the route I would go, primarily because I am more familiar with them than I am regex. I do not think either would have any noticeable advantages or disadvantages as far as performance is concerned.

  3. #3
    SitePoint Zealot Bill Palmer's Avatar
    Join Date
    Oct 2005
    Location
    London, UK
    Posts
    148
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    In all the tests I've done, preg_* functions are only slightly faster than their str_* equivalents (e.g. preg_replace vs str_replace), but there is a noticable improvement with using preg_* when processing a LOT of content.

    I recommend taking a look at how an open source forum validates the bbcode tags it uses. If I recall correctly, they simply make sure that there are an even number of each tag.

  4. #4
    . shoooo... silver trophy logic_earth's Avatar
    Join Date
    Oct 2005
    Location
    CA
    Posts
    9,013
    Mentioned
    8 Post(s)
    Tagged
    0 Thread(s)
    Why loop? why not give the preg_* functions the full thing? no need to break in down into lines.
    Logic without the fatal effects.
    All code snippets are licensed under WTFPL.


  5. #5
    SitePoint Wizard silver trophy kyberfabrikken's Avatar
    Join Date
    Jun 2004
    Location
    Copenhagen, Denmark
    Posts
    6,157
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Regular expressions aren't really suitable for validating HTML. This is because a tag may have different rules, depending on the context (Its parent element). You should use something like RelaxNG or DTD instead.

  6. #6
    SitePoint Member
    Join Date
    Aug 2005
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by kyberfabrikken View Post
    Regular expressions aren't really suitable for validating HTML. This is because a tag may have different rules, depending on the context (Its parent element). You should use something like RelaxNG or DTD instead.
    thats not so much of an issue as there special tags, they look like HTML tags but there not so parent elements aren't a problem, when the page is render those tags are replaced by content.

  7. #7
    Non-Member
    Join Date
    Jan 2003
    Posts
    5,748
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Kyber's point is in regards to proper, well formed and validate markup, and I would agree that this would be the best route to take. You are free and able to create your own DTD and control what exactly the rules that go into the DTD to govern the validation required.

  8. #8
    SitePoint Member
    Join Date
    Aug 2005
    Posts
    23
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by Dr Livingston View Post
    Kyber's point is in regards to proper, well formed and validate markup, and I would agree that this would be the best route to take. You are free and able to create your own DTD and control what exactly the rules that go into the DTD to govern the validation required.
    hmmmm actually thats not too bad an idea i'll have a look into how to create one of those, I doubt for a minute the w3c make this easy, but here goes.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •