Detect www

I need a regular expression that detects a web address in a string of text.

I need it to find any http://www or www. web address.

Any domain (.co.uk, .com anything)

All these would be picked up:

www.domain.com
http://domain.co.uk
http://www.domain.co.uk
www.domain.co.uk

Also, it must pick up all folders and other url variables (www.site.com/page1?a=123 etc.)

** Also, most importantly:***
It must NOT pick up web addresses that are inside a <a href=“”>xxx</a> link already, only oes that are plain text and not embedded in this HTML.

I have tried but it only does bits of the above.

I can do the PHP code, just need to know the regular expression to drop into my preg_match_all code.

Thanks in advance.

Wouldnt it just be “/(https?:\/\/)|(www\.\s+)/” with the proper DOMDocument walking?

Thanks for the response.

How would I ensure that any web address already used inside the HTML <a> tag isn’t picked up?

I imagine preg_match_all and using “/(https?:\/\/)|(www\.\s+)/” will be enough to match against the http:// and www. addresses, but I don;t want it to pick up <a href="“http://www.domain.,com”>link</a>

Hence ‘the proper [FPHP]DOMDocument[/FPHP] walking’. You need to step through the document, node by node, and replace text context where appropriate, and skip anchor nodes.

Thanks for your help, but I don;t understand this.

I underatand if you can;t help further though.

Surely there is a regular expression that ignores anything that has <a href=" before the web address??

Sure, but what about <a style=‘moo’ href="? Now your regex doesnt work again. And you cant do <a Anything here href= because there exists <a name=‘stuff’> which would cause your regex to crash miserably…
Much easier to learn the DOMDocument class and use it.

And also, the regex you gave doesn;t work

/(www\.[a-zA-Z0-9-]+\.[a-zA-Z\.]{2,})/

…is the only one I can get to half work, but it doens;t pick everything up.

Are you wanting to extract the urls into an array, or replace them with something? Are there other html tags that they can’t be inside of (e.g. img tags, style tags, etc)?