SitePoint Sponsor

User Tag List

Results 1 to 3 of 3
  1. #1
    SitePoint Enthusiast
    Join Date
    Nov 2007
    Posts
    54
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    How do I prevent backtracking in this regular expression?

    Hi there Perl users!

    I am trying to do some parsing with PHP but figure you Perl coders are probably more adept at regular expressions than my PHP compatriots so...I am asking here.

    Here's the problem....

    The test data is as follows:

    </span> <a href="/someurl_here">Learn more</a> </div> <div id="pocs1"> Hi there. </div> <div id="pocs2">Press Enter.</div> </div> <div id="pets" style="color:#767676;display:none;font-size:9pt;margin:5px 0 0 8px">Press Enter.</div> </td> </tr> </table> </div> </div> </form> </div> <div id="asdfasdfsrchdsc"> </div> <div id="asdfsdb"> </div> <a href="http://www.domain.com/clubsinfo/cheese/cheeses_2/monthly_products.asp?itemid=30005&amp;year=2009" class=l onmousedown="return rwt(this"><div id="nossln"></div> <div id="subform_ctrl"> </div> </div> <div id="holiday"> </div> <div id="appbar"> <div id="ab_name"><span></span></div> <div><div id=asdf>Page 8 <nobr> (0.18 seconds)&nbsp;</nobr></div></div> <ol id="ab_ctls"><li class="ab_ctl" id="ab_ctl_ss"><div'
    Just a bunch of gibberish. But within it you will notice there are two links.

    <a href="/someurl_here"

    and

    <a href="http://www.domain.com/clubsinfo/cheese/cheeses_2/monthly_products.asp?itemid=30005&amp;year=2009"

    What I want to do is capture ONLY the last link inside the quotation marks.

    When I use the regular expression...

    <a href="(.*?)"\sclass=l

    I end up capturing from the first <a to the class=l which captures both the links.

    How do I prevent my regular expression from backtracking to the first <a?

    I have been beating my head against this for hours and have tried all kinds of ?!, ?=, ?>, ?<, and all manner of stuff and none of it works.

    Would really appreciate any insight or tips you all could give me. Thanks!

    Carlos

  2. #2
    SitePoint Zealot
    Join Date
    Apr 2005
    Location
    London
    Posts
    163
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    hi Carlos,

    You can change your regexp from the '.' which means any character,
    to [^"] which means any character but '"'. (double quotes).

    Jurn

  3. #3
    SitePoint Enthusiast
    Join Date
    Nov 2007
    Posts
    54
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks for very, very much! That did it! Much simpler than the gibberish that one sees when searching for ways to extract links from HTML on the Internet.

    Carlos


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •