Finding links in a string

Hey guys,
I got the following line that finds links in a string and stories them in urls array

if (preg_match_all('/((ht|f)tps?:\\/\\/([\\w\\.]+\\.)?[\\w-]+(\\.[a-zA-Z]{2,4})?[^\\s\\r\
\\(\\)"\\'<>\\,\\!]+)/si', $text, $urls))

I don’t have a big understanding of regex and such, the above line works only with lniks starting with http, what do i need to add to make it also work for links starting with www?

thanks.

try this

$s = '<a href="http://www.example.com">example.com</a> test <a href="/foo">foo</a> test <a href="../bar">bar</a>';
$pattern = '#href="((?:(?:http|ftp)s?://)?[^"]+)"#si';
if (preg_match_all($pattern, $s, $m))
{
        $links = $m[1];
        print_r($links);
}

sorry i think there’s a small miss understanding, the links are not contained within <a> tags in the string
The string may look like this:

$text = "this string has a link starts with www.example.com and also a link starts with http://example.com or http://www.example.com, all these 3 links should be put into an array named urls."

OK. Try this

$s = 'http://www.example.com test www.foo.com test test www.bar.com';
$pattern = '#\\b((?:(?:http|ftp)s?://)?www\\.[^\\s]+)\\b#si';
if (preg_match_all($pattern, $s, $m))
{
        $links = $m[1];
        print_r($links);
}

thanks that works.

Sorry the above solution didnt work perfectly :stuck_out_tongue: it works with links that starts with ‘http://www’ and ‘www’ but not with ‘http://’, for example the following link:

http://bar.com

Try this one:

$pattern = ‘~\b([a-z]+://)?([a-z-]+\.)+[^\s]+\b~si’;

This one should work on ANY string that resembles a URI

doesn’t seem to work for me, for the following string:

$s = 'text http://www.example.com test www.foo.com test test http://bar.com text ';

it returns:

Array ( [0] => http:// [1] => [2] => http:// )

You need to redo print_r($m); to see where the full strings you want are.

Also it should be:

$pattern = ‘~\b([a-z]+://)?([a-z0-9-]+\.)+[^\s]+\b~si’;

Right…it works, thanks :slight_smile:

Well i ran into another problem regarding the matter, i use the following function to replace all links in a string with <a> tags, and make them shorter if they are longer than 35 characters.

function make_clickable($text)
{
	if (preg_match_all('~\\b([a-z]+://)?([a-z0-9-]+\\.)+[^\\s]+\\b~si', $text, $urls))
	{
		foreach (array_unique($urls[0]) AS $url)
		{
			$urltext = strlen($url) > 35 ? substr($url, 0, 21).'...'.substr($url, -10) : $url;
			$text = $url[0]!='h' ? str_replace($url, '<a href="http://'.$url.'" target="_blank" rel="nofollow">'.$urltext.'</a>', $text) : str_replace($url, '<a href="'.$url.'" target="_blank" rel="nofollow">'.$urltext.'</a>', $text);
		}
	}
	return $text;
}

However, when putting more than 1 link it might get messed, because it finds the same link twice, like so:
http://www.example.com
www.example.com

It’ll find the same link twice and replace it twice, making a nested <a> tags which messes up the string, any idea on how can i solve that?

Try not to reinvent the wheel. :slight_smile:

https://github.com/cakephp/cakephp/blob/master/lib/Cake/View/Helper/TextHelper.php#L100

I doubt i need all of that just for such a (simple) task, I managed to get the above problem fixed using preg_match instead of str_replace to repalce only excact links.
However now a new problem! (it just never stops) :slight_smile:
links with parameters are not getting transferred (like www.example.com/page.php?param=1)

~\\b([a-z]+://)?([a-z0-9-]+\\.)+[^\\s]+\\b~si

what do i need to add to the above pattern to make it solved?

As you are finding out it’s not so simple as it may seem at first glance. IMHO you should try AnthonySterling’s suggestion

Here’s my New Wheel:



#============================
class string_to_urls
{

#============================
#
#============================
private function url_maker($text)
{
  $result = '';

  # remove http://
  $text = str_replace('http://', '', $text);

  # split into separate words
  $words   = explode(' ', $text);

  $item = array(); #required result
  foreach( $words as $word ):

    #assume URL if and only if has period - should trap tailing . here
    if( strpos( $word, '.'  ) )
    {
      $urltext = strlen($word) > 20 ? substr($word, 0, 17) .'...' : $word;
       $item[] = '<a href="http://'
                  .   $word
                  .   '" target="_blank" rel="nofollow">'
                  .   $urltext
                  . '</a>';
    }
    else
    {
      $item[] =  $word; # plain text
    }
  endforeach;

  #DEBUG
    echo '<pre>';
      #print_r($item);
    echo '</pre>';

  $result = implode($item, ' ' );

  return $result;
}

#============================
#
#============================
function index()
{
  $text = "this string  http://www.example.com/page.php?param=1  has a link starts with www.example.com and also a link starts with http://example.com or http://www.example.com, all these 3 links should be put into an array named urls.";

  echo '<dl style="width:42em; margin:0 auto; border:solid 1px #f00">';
    echo '<dt>Original $text</dt>';
    echo '<dd>' .$text  .'<br /><br /></dd>';

    echo '<dt>function url_maker($text)</dt>';
    echo '<dd>' .$this->url_maker($text)   .'<br /><br /></dd>';

  echo '</dl>';
}

Output:


Original $text
    this string http://www.example.com/page.php?param=1 has a link starts with
    www.example.com and also a link starts with http://example.com or
    http://www.example.com, all these 3 links should be put into an array named
    urls.


function url_maker($text)
    this string www.example.com/p... has a link starts with www.example.com and also a link starts with example.com or www.example.com, all these 3 links should be put into an array named urls.

Only the last trailing period requires some attention :slight_smile:

Hey John thanks for your solution it looks like a nice way of solving this however i cant trust only checking for dots as many words ends with a dot (like an end of a sentence)
How can we just check if a certain word in a string starts with http or www OR has one of the following strings in it? (.co , .org , .net , .gov) then its a link for sure i’d say… (unless there is something i dont know, if checking for domain extenstion there’s no even need to check for www|http)

@ulthane,

Try this:



    # Old line
    # if( strpos($item, '.') )

    #  replace with this line to elimininate  .' and ." and ...
    if( strpos($item, '.') && ( ! strpos($item, '."') )   && ( ! strpos($item, ".'") )  && ( ! strpos($item, "..") )  ) 
   {
      ...
      ...
   }


I cannot think of any other occurrences of the period except those eliminated, if you think of any let me know.

Well I tested your method alittle bit deeper, it seems to screw up linebreaks, For example with the following input:

this string
www.example.com
has links
example.com
with line breaks www.tests.com
in it tests.com and before it

I’m getting such an html result:


this <a href="http://string<br />
www.example.com<br />
has" target="_blank" rel="nofollow">string<br />
www.example.com<br />
has</a> <a href="http://links<br />
example.com<br />
with" target="_blank" rel="nofollow">links<br />
example.com<br />
with</a> line breaks <a href="http://www.tests.com<br />
in" target="_blank" rel="nofollow">www.tests.com<br />
in</a> it <a href="http://tests.com" target="_blank" rel="nofollow">tests.com</a> and before it

I guess it doesnt consider newlines as a space and therefore aint splitting it…
note : same issue even without nl2br getting involved.

And i’ve done a small progress with my regex try aswell

~\\b([a-z0-9-]+\\.)+[^\\?\\s]+\\b~si

(just to remind it converts all links correctly except of links with parameters
So for a link like:
test.com/index.php?param=1
It’ll return
test.com/index.php (all this as link) and then ?param=1 but as normal text… anyone?

@ulthane,

Post: #3

$text = “this string has a link starts with www.example.com and also a link starts with http://example.com or http://www.example.com, all these 3 links should be put into an array named urls.”

The code supplied (Post #15 and #17) extracts relevant text from your original $text and makes the correct html links.

By adding line breaks the original specification has changed.

It is now quite late and if you or other posters are unable to offer a solution then tomorrow I will endeavour to create a new script.

PS Any chance of a later version having images :slight_smile:

Sorry if it wasn’t clear that the text could also contain line breaks :stuck_out_tongue:

Anyways this thing just driving me nuts! my aim is to get this thing done with preg_match as it looks much cleaner code like that, im so close to the solution but the only problem is when parameters are in the url ! and i bet it is because preg_match sees ‘?’ as a “reserved character” and it somehow needs to be escaped when using in preg_match_all… anyone with any ideas?
full code:

function make_clickable($text)
{
	// $text = str_replace('?', "\\?", $text); lol..., nah that didn't work ;)
	$text = str_replace('http://', '', $text);
	if (preg_match_all('~\\b([a-z0-9-]+\\.)+[^\\s]+\\b~si', $text, $urls))
	{
		foreach (array_unique($urls[0]) AS $url)
		{
			$urltext = strlen($url) > 35 ? substr($url, 0, 21).'...'.substr($url, -10) : $url;
			$text = preg_replace('~^'.$url.'~m',"<a href=\\"http://$url\\" target=\\"_blank\\" rel=\\"nofollow\\">$urltext</a>",$text);
		}
	}
	return $text;
}