SitePoint Sponsor

User Tag List

Results 1 to 5 of 5

Thread: regex help

  1. #1
    SitePoint Enthusiast
    Join Date
    Jan 2007
    Posts
    32
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    regex help

    i am trying to get hyperlinks and anchor text from a page. for example
    <A href="google.com" >Text</a>

    any help to extract like "google.com | Text" format plz.

  2. #2
    SitePoint Member
    Join Date
    Oct 2009
    Posts
    16
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    try this one:

    <a href="(.*?)">(.*?)</a>

    then you can use $1 to get "google.com", $2 to get "Text"
    Car Pictures | Agozo - Hairstyle | Anozo - Hairstyle

  3. #3
    SitePoint Enthusiast
    Join Date
    Jan 2007
    Posts
    32
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    not working..... i am getting only $1

    sample:

    <a href="/A-Better-Switchboard-for-MS-Access/3000-10254_4-10499883.html">

    A Better Switchboard for MS Access 2.1
    </a>

    preg_match_all('/<a href="(.*?)">(.*?)<\/a\>/', $result['EXE'], $matches);

  4. #4
    SitePoint Enthusiast nrg_alpha's Avatar
    Join Date
    Dec 2008
    Posts
    81
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    As you can see, there are pitfalls involved when using regex to parse html. Whenever using dot_match_all in conjunction with the need to take newlines into account, you can always add the 's' modifier after the closing delimiter. And if there is a chance of anchor tags using uppercase characters, you can also add the 'i' modofier to make things case insensitive. Additionally, what happens if the anchor tag contains more than the href attribute? Patterns like that will not catch it.

    For parsing stuff like html, I would consider using DOM / XPath instead of regex:

    Example:
    PHP Code:
    $result['EXE'] = <<<EOF
    <A href="google.com" >Text</a>
    Some garbage...
    <a href="/A-Better-Switchboard-for-MS-Access/3000-10254_4-10499883.html">

    A Better Switchboard for MS Access 2.1
    </a>
    EOF;

    $dom = new DOMDocument;
    @
    $dom->loadHTML($result['EXE']); // change loadHTML to loadHTMLFile and put the full url in quotes within the parenthesis for a site
    $xpath = new DOMXPath($dom);
    $aTag $xpath->query('//a[@href]');
    foreach (
    $aTag as $val) {
        echo 
    $val->getAttribute('href') . ' | ' $val->nodeValue "<br />\n";

    Output:
    Code:
    google.com | Text
    /A-Better-Switchboard-for-MS-Access/3000-10254_4-10499883.html | A Better Switchboard for MS Access 2.1
    Off Topic:


    In the event you are interesting in advancing yourself in learning regex, the following links can help you get started:
    http://www.phpfreaks.com/tutorial/re...--basic-syntax
    http://www.regular-expressions.info/
    http://weblogtoolscollection.com/regex/regex.php

    Obviously, there are plenty more learning resources out there. Googling regex tutorials will certainly reveal more than enough.

    EDIT - Same goes for DOM/XPath

  5. #5
    SitePoint Enthusiast
    Join Date
    Jan 2007
    Posts
    32
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    thanks alpha..it worked.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •