SitePoint Sponsor

User Tag List

Results 1 to 15 of 15

Thread: regex help

  1. #1
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    regex help

    I'm trying to get both the link url and img url from this string:

    Code:
    <a href="linkurl"><img src="imgurl" alt="alttext" width="346" height="260" border="1"></a>
    what would the regex be for this to get all links and their images on page? I would like to ignore the other attributes both infront and behind for the link and the image tags

  2. #2
    SitePoint Addict
    Join Date
    Dec 2004
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    You would need to use preg_match_all(). I am giving you a very simple working example. Probably you would need to make something a little bit more complicated.
    PHP Code:
    <?php
    // Sample string:
    $str = <<< TEXT
    <a href="linkurl1"><img src="imgurl1" alt="alttext" width="346" height="260" border="1"></a>
    <a href="linkurl2"><img src="imgurl2" alt="alttext" width="346" height="260" border="1"></a>
    <a href="linkurl3"><img src="imgurl3" alt="alttext" width="346" height="260" border="1"></a>
    TEXT;
    // Extracting URL's to the array $m
    preg_match_all("/<a href=\"([^\"]*)\"><img src=\"([^\"]*)\".*?><\/a>/si",$str,$m);
    // Displaying the result
    echo '<pre>' htmlspecialchars(print_r($m[1],true)) . '</pre>'// link URL's
    echo '<pre>' htmlspecialchars(print_r($m[2],true)) . '</pre>'// image URL's
    ?>

  3. #3
    SitePoint Enthusiast BurakUeda's Avatar
    Join Date
    Apr 2005
    Posts
    81
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Another simple example here:
    PHP Code:
    $reg "/(a +href ?= ?|img +src ?= ?)(\"|\')?(http:\/\/)?([\w-_]*\.?)*[\/\w\.?]*/i";

    preg_match_all($reg$string$match);

    foreach(
    $match[0] as $key => $val){
        
    $urls[$key] = str_replace(array("=""src""a href""img"" ""'""\""), ""$val);
    }
    echo 
    "<pre>".htmlspecialchars(print_r($urlstrue))."</pre>"
    works for:
    Code:
    <a href="http://www.someurl-website.com/subfolder/index.php"><img src="http://domain.something.com/images/imgurl.gif" alt="alttext" width="346" height="260" border="1"></a>
    <a href="www.someurl-website.com/subfolder/index.php"><img src="http://domain.something.com/images/imgurl.gif" alt="alttext" width="346" height="260" border="1"></a>
    <a href="someurl-website.com/subfolder/index.php"><img src="http://something.com/images/imgurl.gif" alt="alttext" width="346" height="260" border="1"></a>
    <a href="/subfolder/index.php"><img src="something.com/images/imgurl.gif" alt="alttext" width="346" height="260" border="1"></a>
    <a href="index.php"><img src="images/imgurl.gif" alt="alttext" width="346" height="260" border="1"></a>
    Won't work for:
    Code:
    <a target = "_blank" href = "linkhere"><img width = "" alt = "" src="linkhere" /></a>
    link and img tags must begin with <a href and <img src
    H u m o
    Uncensored Forums for Intelligent People

  4. #4
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Thanks guys,

    If i use eg this one:

    "/<a href=\"([^\"]*)\"><img src=\"([^\"]*)\".*?><\/a>/si"

    how can i make it case insensitive? Some of the links are eg <a HREF

  5. #5
    SitePoint Addict
    Join Date
    Dec 2004
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    This regexp is already case-insensitive because of i-modifier
    Code:
    "/<a href=\"([^\"]*)\"><img src=\"([^\"]*)\".*?><\/a>/si"

  6. #6
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ah yes, actually the problem is when this example crops up:

    <a HREF="imglink"><img name="" src="imgurl" width="140" height="140" border="0"></a>

    basically i need the regex to ignore everything other than the src for both the a tag and the inbetween img tag

  7. #7
    SitePoint Addict
    Join Date
    Dec 2004
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Try this (not checked):
    Code:
    "/<a [^>]*?href=\"([^\"]*)\"><img [^>]*?src=\"([^\"]*)\".*?><\/a>/si"

  8. #8
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ok that was working brilliantly until i hit some examples where there was other stuff before the closing a tag:

    eg:

    Code:
    <a HREF="link"><img src="image" width="360" height="270" border="0"><br>
    
                      </a>
    in this case there is a <br> tag and also some character returns etc (presumabley /n/r ??)

    so how can i make it so that anything else can appear after the close of the img tag, but before the close a tag?

  9. #9
    SitePoint Addict
    Join Date
    Dec 2004
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    E.g. you could try this:
    Code:
    "/<a [^>]*?href=\"([^\"]*)\"><img [^>]*?src=\"([^\"]*)\".*?>.*?<\/a>/si"

  10. #10
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    ok thanks

  11. #11
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    To be honest, when it comes to something like this where attributes can be in different orders, and tags can vary on the inside, shouldn't you be taking advantage of PHP's DOM capabilities?
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  12. #12
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by arkinstall View Post
    To be honest, when it comes to something like this where attributes can be in different orders, and tags can vary on the inside, shouldn't you be taking advantage of PHP's DOM capabilities?
    I did briefly look at that, but its not something i have any experience with so i thought i'd leave it! This seems to work at the moment at least...

  13. #13
    Theoretical Physics Student bronze trophy Jake Arkinstall's Avatar
    Join Date
    May 2006
    Location
    Lancaster University, UK
    Posts
    7,062
    Mentioned
    2 Post(s)
    Tagged
    0 Thread(s)
    Yeah, however if you do come across a different URL, you may run into problems.
    Jake Arkinstall
    "Sometimes you don't need to reinvent the wheel;
    Sometimes its enough to make that wheel more rounded"-Molona

  14. #14
    SitePoint Evangelist
    Join Date
    Jan 2005
    Location
    UK
    Posts
    539
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Another issue:

    sometimes there are spaces and carriage returns between the <a> and the <img tag...

    eg:

    Code:
    			<A HREF="link">
    				<IMG SRC="image" WIDTH=163 HEIGHT=62 BORDER=0 ALT=""></A>
    How can i allow just /n and /s and /r between the opeing a and the img tag?

  15. #15
    SitePoint Addict
    Join Date
    Dec 2004
    Posts
    240
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I think the simplest would be to use \s*
    Code:
    "/<a [^>]*?href=\"([^\"]*)\">\s*<img [^>]*?src=\"([^\"]*)\".*?>.*?<\/a>/si"
    It would allow 0 or more of so called whitespace characters: \r, \n, \t, space, formfeed (and probably vertical tab) between <a> and <img> opening tags.


Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •