Finding links in a string

ulthane · March 12, 2012, 1:40pm

Hey guys,
I got the following line that finds links in a string and stories them in urls array

if (preg_match_all('/((ht|f)tps?:\\/\\/([\\w\\.]+\\.)?[\\w-]+(\\.[a-zA-Z]{2,4})?[^\\s\\r\
\\(\\)"\\'<>\\,\\!]+)/si', $text, $urls))

I don’t have a big understanding of regex and such, the above line works only with lniks starting with http, what do i need to add to make it also work for links starting with www?

thanks.

gvre · March 12, 2012, 4:10pm

try this

$s = '<a href="http://www.example.com">example.com</a> test <a href="/foo">foo</a> test <a href="../bar">bar</a>';
$pattern = '#href="((?:(?:http|ftp)s?://)?[^"]+)"#si';
if (preg_match_all($pattern, $s, $m))
{
        $links = $m[1];
        print_r($links);
}

ulthane · March 12, 2012, 4:16pm

sorry i think there’s a small miss understanding, the links are not contained within <a> tags in the string
The string may look like this:

$text = "this string has a link starts with www.example.com and also a link starts with http://example.com or http://www.example.com, all these 3 links should be put into an array named urls."

gvre · March 12, 2012, 4:23pm

OK. Try this

$s = 'http://www.example.com test www.foo.com test test www.bar.com';
$pattern = '#\\b((?:(?:http|ftp)s?://)?www\\.[^\\s]+)\\b#si';
if (preg_match_all($pattern, $s, $m))
{
        $links = $m[1];
        print_r($links);
}

ulthane · March 12, 2012, 4:28pm

thanks that works.

ulthane · March 12, 2012, 10:47pm

Sorry the above solution didnt work perfectly it works with links that starts with ‘http://www’ and ‘www’ but not with ‘http://’, for example the following link:

http://bar.com

wonshikee · March 12, 2012, 11:13pm

Try this one:

$pattern = ‘~\b([a-z]+://)?([a-z-]+\.)+[^\s]+\b~si’;

This one should work on ANY string that resembles a URI

ulthane · March 12, 2012, 11:36pm

doesn’t seem to work for me, for the following string:

$s = 'text http://www.example.com test www.foo.com test test http://bar.com text ';

it returns:

Array ( [0] => http:// [1] => [2] => http:// )

wonshikee · March 12, 2012, 11:44pm

You need to redo print_r($m); to see where the full strings you want are.

Also it should be:

$pattern = ‘~\b([a-z]+://)?([a-z0-9-]+\.)+[^\s]+\b~si’;

ulthane · March 12, 2012, 11:50pm

Right…it works, thanks

ulthane · March 13, 2012, 11:06am

Well i ran into another problem regarding the matter, i use the following function to replace all links in a string with <a> tags, and make them shorter if they are longer than 35 characters.

function make_clickable($text)
{
	if (preg_match_all('~\\b([a-z]+://)?([a-z0-9-]+\\.)+[^\\s]+\\b~si', $text, $urls))
	{
		foreach (array_unique($urls[0]) AS $url)
		{
			$urltext = strlen($url) > 35 ? substr($url, 0, 21).'...'.substr($url, -10) : $url;
			$text = $url[0]!='h' ? str_replace($url, '<a href="http://'.$url.'" target="_blank" rel="nofollow">'.$urltext.'</a>', $text) : str_replace($url, '<a href="'.$url.'" target="_blank" rel="nofollow">'.$urltext.'</a>', $text);
		}
	}
	return $text;
}

However, when putting more than 1 link it might get messed, because it finds the same link twice, like so:
http://www.example.com
www.example.com

It’ll find the same link twice and replace it twice, making a nested <a> tags which messes up the string, any idea on how can i solve that?

AnthonySterling · March 14, 2012, 2:10pm

Try not to reinvent the wheel.

https://github.com/cakephp/cakephp/blob/master/lib/Cake/View/Helper/TextHelper.php#L100

ulthane · April 1, 2012, 11:30am

I doubt i need all of that just for such a (simple) task, I managed to get the above problem fixed using preg_match instead of str_replace to repalce only excact links.
However now a new problem! (it just never stops)
links with parameters are not getting transferred (like www.example.com/page.php?param=1)

~\\b([a-z]+://)?([a-z0-9-]+\\.)+[^\\s]+\\b~si

what do i need to add to the above pattern to make it solved?

Mittineague · April 1, 2012, 6:26pm

As you are finding out it’s not so simple as it may seem at first glance. IMHO you should try AnthonySterling’s suggestion

John_Betong · April 2, 2012, 7:11am

Here’s my New Wheel:



#============================
class string_to_urls
{

#============================
#
#============================
private function url_maker($text)
{
  $result = '';

  # remove http://
  $text = str_replace('http://', '', $text);

  # split into separate words
  $words   = explode(' ', $text);

  $item = array(); #required result
  foreach( $words as $word ):

    #assume URL if and only if has period - should trap tailing . here
    if( strpos( $word, '.'  ) )
    {
      $urltext = strlen($word) > 20 ? substr($word, 0, 17) .'...' : $word;
       $item[] = '<a href="http://'
                  .   $word
                  .   '" target="_blank" rel="nofollow">'
                  .   $urltext
                  . '</a>';
    }
    else
    {
      $item[] =  $word; # plain text
    }
  endforeach;

  #DEBUG
    echo '<pre>';
      #print_r($item);
    echo '</pre>';

  $result = implode($item, ' ' );

  return $result;
}

#============================
#
#============================
function index()
{
  $text = "this string  http://www.example.com/page.php?param=1  has a link starts with www.example.com and also a link starts with http://example.com or http://www.example.com, all these 3 links should be put into an array named urls.";

  echo '<dl style="width:42em; margin:0 auto; border:solid 1px #f00">';
    echo '<dt>Original $text</dt>';
    echo '<dd>' .$text  .'<br /><br /></dd>';

    echo '<dt>function url_maker($text)</dt>';
    echo '<dd>' .$this->url_maker($text)   .'<br /><br /></dd>';

  echo '</dl>';
}

Output:


Original $text
    this string http://www.example.com/page.php?param=1 has a link starts with
    www.example.com and also a link starts with http://example.com or
    http://www.example.com, all these 3 links should be put into an array named
    urls.


function url_maker($text)
    this string www.example.com/p... has a link starts with www.example.com and also a link starts with example.com or www.example.com, all these 3 links should be put into an array named urls.

Only the last trailing period requires some attention

ulthane · April 2, 2012, 10:43am

Hey John thanks for your solution it looks like a nice way of solving this however i cant trust only checking for dots as many words ends with a dot (like an end of a sentence)
How can we just check if a certain word in a string starts with http or www OR has one of the following strings in it? (.co , .org , .net , .gov) then its a link for sure i’d say… (unless there is something i dont know, if checking for domain extenstion there’s no even need to check for www|http)

John_Betong · April 2, 2012, 1:33pm

@ulthane,

Try this:



    # Old line
    # if( strpos($item, '.') )

    #  replace with this line to elimininate  .' and ." and ...
    if( strpos($item, '.') && ( ! strpos($item, '."') )   && ( ! strpos($item, ".'") )  && ( ! strpos($item, "..") )  ) 
   {
      ...
      ...
   }

I cannot think of any other occurrences of the period except those eliminated, if you think of any let me know.

ulthane · April 2, 2012, 2:45pm

Well I tested your method alittle bit deeper, it seems to screw up linebreaks, For example with the following input:

this string
www.example.com
has links
example.com
with line breaks www.tests.com
in it tests.com and before it

I’m getting such an html result:


this <a href="http://string<br />
www.example.com<br />
has" target="_blank" rel="nofollow">string<br />
www.example.com<br />
has</a> <a href="http://links<br />
example.com<br />
with" target="_blank" rel="nofollow">links<br />
example.com<br />
with</a> line breaks <a href="http://www.tests.com<br />
in" target="_blank" rel="nofollow">www.tests.com<br />
in</a> it <a href="http://tests.com" target="_blank" rel="nofollow">tests.com</a> and before it

I guess it doesnt consider newlines as a space and therefore aint splitting it…
note : same issue even without nl2br getting involved.

And i’ve done a small progress with my regex try aswell

~\\b([a-z0-9-]+\\.)+[^\\?\\s]+\\b~si

(just to remind it converts all links correctly except of links with parameters
So for a link like:
test.com/index.php?param=1
It’ll return
test.com/index.php (all this as link) and then ?param=1 but as normal text… anyone?

John_Betong · April 2, 2012, 3:30pm

@ulthane,

Post: #3

$text = “this string has a link starts with www.example.com and also a link starts with http://example.com or http://www.example.com, all these 3 links should be put into an array named urls.”

The code supplied (Post #15 and #17) extracts relevant text from your original $text and makes the correct html links.

By adding line breaks the original specification has changed.

It is now quite late and if you or other posters are unable to offer a solution then tomorrow I will endeavour to create a new script.

PS Any chance of a later version having images

ulthane · April 2, 2012, 6:41pm

Sorry if it wasn’t clear that the text could also contain line breaks

Anyways this thing just driving me nuts! my aim is to get this thing done with preg_match as it looks much cleaner code like that, im so close to the solution but the only problem is when parameters are in the url ! and i bet it is because preg_match sees ‘?’ as a “reserved character” and it somehow needs to be escaped when using in preg_match_all… anyone with any ideas?
full code:

function make_clickable($text)
{
	// $text = str_replace('?', "\\?", $text); lol..., nah that didn't work ;)
	$text = str_replace('http://', '', $text);
	if (preg_match_all('~\\b([a-z0-9-]+\\.)+[^\\s]+\\b~si', $text, $urls))
	{
		foreach (array_unique($urls[0]) AS $url)
		{
			$urltext = strlen($url) > 35 ? substr($url, 0, 21).'...'.substr($url, -10) : $url;
			$text = preg_replace('~^'.$url.'~m',"<a href=\\"http://$url\\" target=\\"_blank\\" rel=\\"nofollow\\">$urltext</a>",$text);
		}
	}
	return $text;
}

Topic		Replies	Views
Find URLs in a string with preg_match_all PHP	12	75070	December 10, 2010
Looking for a regex to get anchor tags from a string PHP	1	3210	June 21, 2016
PHP - preg_match_all to find all URLs PHP	3	5046	October 8, 2014
Help with regex to grab all instances of pattern from string PHP	5	264	November 13, 2010
Regex help PHP	3	357	September 12, 2010

Finding links in a string

Output:

Related topics