SitePoint Sponsor

User Tag List

Results 1 to 4 of 4
  1. #1
    SitePoint Zealot imagize's Avatar
    Join Date
    Oct 2004
    Location
    Australia
    Posts
    197
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Article Intro Parser

    Hello,

    I have articles which I am extracting from a database. These articles may contain BBCode and HTML.

    On a certain Page I wish to show only the first 200 characters of each article and then have a read more link.

    No issues so far.

    The problem is that if the end of the 200 characters is in the middle of a html tag, this obviously causes a problem.

    Now i know the solution will use regular expressions, I'm just having trouble deciding what to look for. Basically if the end of the 200 character string is like

    hello world <img src="http://

    I want to use regex to end the string at hello world instead. To makes things more complex I want this to relate to any html tag not just an image.

    Also the end of the string could be something like this

    <b>this is bold text

    So the <b> has no closing tag, I want to be able to add unclosed tags onto the end of the string.

    What I was thinking of doing is stripping all html from the entire text with strip_tags before taking the substring. This would solve the problem of partial tags, but it would also remove any legit tags. This could also be heavy on the server load for a big lot of text. So hopefully there is a better solution.

    Thanks

  2. #2
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,290
    Mentioned
    198 Post(s)
    Tagged
    3 Thread(s)

    tags

    On my "wildflowers" feed, I use regex to see if the string cut-off is inside a tag and if so, preg_replace to remove it. Then I use preg_match_all to get opening and closing b and i tags. Then I count the arrays, If they don't match I add the closing tag to the string.
    You could do something similar, although you'ld have to account for more tag pairs.

  3. #3
    SitePoint Zealot imagize's Avatar
    Join Date
    Oct 2004
    Location
    Australia
    Posts
    197
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thank you that is a good solution, however are you willing to provide your code?

  4. #4
    Programming Team silver trophybronze trophy
    Mittineague's Avatar
    Join Date
    Jul 2005
    Location
    West Springfield, Massachusetts
    Posts
    17,290
    Mentioned
    198 Post(s)
    Tagged
    3 Thread(s)

    tag matching

    Actually, now that I've been thinking about it, maybe my code needs to be changed. The reason I use regex is because I've been
    1. truncate string at max length
    2. add "....."
    3. determine if string was cut mid-tag, if so remove fragment
    4. balance tags
    I think it would be easier and more effiicient, and allow the appending of a "more" link, if I did things differently.
    1. use strrpos to find position of last "<" before offset of max length
    2. use strrpos to find position of last ">" before offset of max length
    3. if index of ">" is before index of "<" then cut was inside a tag so substr to the "<"s index else not inside a tag (maybe mid word though, if that matters substr to index of last space)
    4. str_count tags (less resource use than regex)
    5. balance tags if need to
    6. then add ".... more"


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •