Find URLs in a string with preg_match_all

Hi All,

I found a short script for counting and displaying ONLY links from a string :


preg_match_all('/<a href="(.*)">/',$post_content,$a);

$count = count($a[1]);
echo "<b>Number of Urls</b> = " .$count."<p>";
for ($row = 0; $row < $count ; $row++) {
echo $a[1]["$row"]."<br />";
}

and the $post_content variable ‘looks’ like :


<p>From press release :</p> <p><a href="http://www.whateversitehere.com/" target="_blank">WhateverSiteHere.com</a> is a web-based marketing company specialized.... blah blah....

Now the script should say “Number of URLs = 1” and echo the url itself, right?
For some reason it doesn’t work, it says Number of URLs = 0… I’m not familiar with reg_exp, suppose that’s the place in the code where I should start to debug… Can someone help me please?
Thnx in advance!

I managed to find a regex to find ALL URL’s with :


preg_match_all('@((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?(/([-\\w/_\\.]*(\\?\\S+)?)?)*)@',$post_content,$a);

this works for ALL the URLs, but what I want is to find ONLY ‘href’ tags and simply couldn’t find one that does exactly that! Can someone help me with this please?
THNX

The following works for me:


$post_content = '<a href="http://www.test.com" title="test">test.com</a> there is another example <a href="http://www.test1.com">test1.com</a>
one more here too <a href="http://www.test2.com" title="test2">test2.com</a>';
preg_match_all('/<a href="(.*?)"/s', $post_content, $matches);
print_r($matches);

Though I am also a poor one on regex :slight_smile:

Thanks rajug, unfortunately it’s not working for me :frowning:
I’ll try explain the whole situation, maybe it will make sense to you guys…

There’s a tiny_mce form on my site which can be used to create a blog post. I want to limit the <a href… > tags in the created post to 1 (single ONE) so the members can’t overwhelm the post with links to their sites. However they are allowed to insert images, videos, or whatever embedded content that is not a clickable link.

This works in case of finding ALL URLs (including <img src=‘http://www.blah.us’ />) :


preg_match_all('@((https?://)?([-\\w]+\\.[-\\w\\.]+)+\\w(:\\d+)?(/([-\\w/_\\.]*(\\?\\S+)?)?)*)@',$post_content,$a);

BUT as I said I only want to check the <a href…> tags…

Have found this pattern, seems OK but doesn’t work either :


$post_content = stripslashes($_POST['content']);
$post_content = htmlspecialchars($post_content, ENT_QUOTES);

// CHECK URL's START
$href_regex ="<"; // 1 start of the tag
$href_regex .="\\s*"; // 2 zero or more whitespace
$href_regex .="a"; // 3 the a of the tag itself
$href_regex .="\\s+"; // 4 one or more whitespace
$href_regex .="[^>]*"; // 5 zero or more of any character that is _not_ the end of the tag
$href_regex .="href"; // 6 the href bit of the tag
$href_regex .="\\s*"; // 7 zero or more whitespace
$href_regex .="="; // 8 the = of the tag
$href_regex .="\\s*"; // 9 zero or more whitespace
$href_regex .="[\\"']?"; // 10 none or one of " or '
$href_regex .="("; // 11 opening parenthesis, start of the bit we want to capture
$href_regex .="[^\\"' >]+"; // 12 one or more of any character _except_ our closing characters
$href_regex .=")"; // 13 closing parenthesis, end of the bit we want to capture
$href_regex .="[\\"' >]"; // 14 closing chartacters of the bit we want to capture
			
$regex = "/"; // regex start delimiter
$regex .= $href_regex; //
$regex .= "/"; // regex end delimiter
$regex .= "i"; // Pattern Modifier - makes regex case insensative
$regex .= "s"; // Pattern Modifier - makes a dot metacharater in the pattern
// match all characters, including newlines
$regex .= "U"; // Pattern Modifier - makes the regex ungready			
			
preg_match_all($regex, $post_content, $a);
			
$count = count($a[1]);
			
echo $count;
			
if ($count > 1) { $error_flag = 1; }
// CHECK URLs END

Can you post the sample content that you are working on? So that I can try here till the regex experts come and see the post :slight_smile:

Ok, here a sample content :


<div class="postContent"> <p>As I&rsquo;m a technical consultant for conveyor systems at Deme Ltd.,
we&rsquo;ve visited <a href="http://www.sajam.co.rs/active/en/home/details/_params/sajam_id/4833.html" target="_blank">7th PACKTECH EXPO BALKAN 2008</a> held from 17-09-2008 to 20-09-2008 in Belgrade, capital city of Serbia .
It&rsquo;s an exhibition of machines and technical packing equipment. There were a lot of printing machines as well so I couldn&rsquo;t resist taking photos of all those colorful surfaces.</p>
<p>Hope you&rsquo;ll enjoy the colors &ndash; onsite they looked much more live though&hellip;</p> <p><img src="http://www.blogofd.com/graphics/colors_at_packtec.gif" border="0" alt="Packtech 2008 Belgrade" /></p> </div>
<p>&#65279;A friend of mine (<a href="http://www.a-styledesign.com/" target="_blank">3D artist portfolio</a>) asked me to develop a software that can be used to track visitor clicks to external sites
(links that are pointing to 3rd party websites from your page). He wanted to better understand his visitors&rsquo; behavior and to find out if his links are at the best possible place.</p>

Ok


$post_content = '<div class="postContent"> <p>As I&rsquo;m a technical consultant for conveyor systems at Deme Ltd., 
we&rsquo;ve visited <a href="http://www.sajam.co.rs/active/en/home/details/_params/sajam_id/4833.html" target="_blank">7th PACKTECH EXPO BALKAN 2008</a> held from 17-09-2008 to 20-09-2008 in Belgrade, capital city of Serbia . 
It&rsquo;s an exhibition of machines and technical packing equipment. There were a lot of printing machines as well so I couldn&rsquo;t resist taking photos of all those colorful surfaces.</p> 
<p>Hope you&rsquo;ll enjoy the colors &ndash; onsite they looked much more live though&hellip;</p> <p><img src="http://www.blogofd.com/graphics/colors_at_packtec.gif" border="0" alt="Packtech 2008 Belgrade" /></p> </div> 
<p>&#65279;A friend of mine (<a href="http://www.a-styledesign.com/" target="_blank">3D artist portfolio</a>) asked me to develop a software that can be used to track visitor clicks to external sites 
(links that are pointing to 3rd party websites from your page). He wanted to better understand his visitors&rsquo; behavior and to find out if his links are at the best possible place.</p>';
preg_match_all('/<a href="(.*?)"/s', $post_content, $matches);
print_r($matches[1]);

This gives me the result:


Array
(
    [0] => http://www.sajam.co.rs/active/en/home/details/_params/sajam_id/4833.html
    [1] => http://www.a-styledesign.com/
)

And I think it worked with that sample code.

Yeah, that way works by me too. But the $post_content is a dynamic variable and when I echo it it’s the same as above. Why doesn’t work it then with


$post_content = stripslashes($_POST['content']);
$post_content = htmlspecialchars($post_content, ENT_QUOTES);

??? Confused…

Then I am sure it is because of the following line:


$post_content = htmlspecialchars($post_content, ENT_QUOTES);

because it encodes the double quotes which are being used in the regex pattern. So try commenting the above line then it must work.

Edit:
Manual says:

The optional second argument, flags, tells the function what to do with single and double quote characters and with invalid multi-byte sequences. The default mode, ENT_COMPAT, is the backwards compatible mode which only translates the double-quote character and leaves the single-quote untranslated. If ENT_QUOTES is set, both single and double quotes are translated and if ENT_NOQUOTES is set neither single nor double quotes are translated. In addition, since 5.3.0, these constants can be combined with ENT_IGNORE. In that case, strings that contain invalid code unit sequences have those invalid sequences discarded instead of having the function return an empty string. Avoid using it, as it may have introduce vulnerabilities.

If you have to use that function, just let single and double quotes intact with ENT_NOQUOTES.


$post_content = htmlspecialchars($post_content, ENT_NOQUOTES);

I could have thought the


$post_content = htmlspecialchars($post_content, ENT_NOQUOTES);

line was ‘responsible’ for my troubles ALL the time… :injured:
I commented it out and now I have what I need!
Thank you so much for pointing this out!

hopefully this is the best place to post this rather than a new thread as i’m doing something similar except i want to replace each url i find with a new randomly generated one.

I’ve got the find bit working but i don’t know how to loop through and replace from the array.

this is what i have so far


<?php
$data1 = "This is one website <a href=\\"example.com\\">blah</a> oh but there is also this one <a href=\\"example2.com\\">blah</a>";

preg_match_all('/\\<a href="(.*?)\\">/', $data1, $matches);

//echo for testing whats in the array
print_r($matches[1]);


// here we would insert the url into the database and generate a random number for the url to replace it with.
//don't know how to do this bit????

// I'm making this up but something like foreach $matches[1] {
$new_link = 'http://www.mysite.com/redirect.php?id='.rand().date(hms);
$org_link = $matches;


//mysql_query ("INSERT INTO redirect ( org_link, new_link) VALUES ('$org_link', '$new_link') ") or die(mysql_error());

//}



?>

any help would be much appreciated. hopefully i’m not too far off the mark.
thanks

I hope I got the point what are you trying to do…
Maybe you can try this :


$i = 0;
foreach ($matches as $value) {
$new_link = 'http://www.mysite.com/redirect.php?id='.rand().date(hms);
$org_link = $matches[1][$i];
echo "<p>Inside loop : <br />Matches : ".$org_link."</p>";
$i++;
//mysql_query ("INSERT INTO redirect ( org_link, new_link) VALUES ('$org_link', '$new_link') ") or die(mysql_error());

}

I’m not sure about inserting into database in a loop (because of the potential large amount of calls), maybe a MySql guru can give some advice on that…
Hope that helps…

thanks i’ll give it a go on monday and post how i get on.
thanks