Need help with regex (spaces)

Hi, I have the following code to scan text for URLs and return all the urls in array.
I’ve been testing this for days but keep not getting anywhere success in tracking front and trailing spaces.

var result = text.match(“[^\\s\(\)\”\'<>\,\!]+(((ht|f)tp(s?)\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+(\.[a-zA-Z]{2,4})?[^\\s]+)", “gi”);

With this regex, it always fail when I try to type the following:

http://www.domain.com/test?id=dsd sakads http://www.xample.com/test?id=dsd

It returns:

http://www.domain.com/test?id=dsd sakads, http://www.xample.com/test?id=dsd

With the “sakads”. How do I remove any text and spaces in between URLs?

Thanks in advanced.

By the way may I ask if javascript expression the same as php???

I tried to use [^\s] to detect spaces but it always seems to remove the character “s” :(:frowning:

var result = text.match(https?\://[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?", “ig”)

give that a try…

That gave me null result. I tested using this string:

var text = “http://www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c”;

Ok, well in order to test out the regExp, I used a quick PHP script, and it works fine. So I’m pretty sure the regExp works, but perhaps there’s an issue with syntax.

Do you have a test page online I can have a look at?

I haven’t had this kinda issue using PHP. Seems like regex in javascript really is a pain :(.

I only have this simple page to quick test any pattern I can have, currently it is even matching invalid domain URL like youtube[without the extension] which I do not want:

<style>
body {
    background:#000;
}
</style>
<script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js" type="text/javascript"></script>
<script>
$(document).ready( function( ) {
    var text  = "http://www.youtube/watch?v=KOFqBrwld3c  Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    alert(text.match( "https?://([a-z0-9].+)(\\.[a-z0-9\\-]+)(.[a-z]{2,4})+", "igm" ))
});
</script>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<p>&nbsp;</p>
<table width="100%"><tr><td align="center">
<input type="submit" onclick="window.location.reload( true );" style="padding:100px;" value="submit" />
</td></tr></table>

OK, so string.match will return an array of matches so this:


var text  = "http://www.youtube/watch?v=KOFqBrwld3c  Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
var matches = text.match( "https?\\://[-a-zA-Z0-9\\.]+(/[-a-zA-Z0-9\\?\\=]+)?", "ig" );
alert(matches);

will return:

http://www.youtube/watch?v=KOFqBrwld3c, http://www.youtube2.com/watch?v=KOFqBrwld3c

So to return the first url matched you use this


alert(matches[0]);

the second would be:


alert(matches[1]);

and so on…

Is that any help?

Thanks for your help micky kinda close but the first match is not valid domain? It is without an extension.

Also i need to make it such that it matches:

www.youtube.com/watch?v=KOFqBrwld3c
youtube.com/watch?v=KOFqBrwld3c

Do I need to seperate these check or is there a pattern to check this as well? Realy appreciate your help micky.

Additionally is you want to iterate through the array and output the urls, you can do this:


$.each(matches, function(){
	$('body').appendTo('<p>' this '</p>');
});

But I’ve just noticed, it’ll only match the first occurrence… the second array entry is the stuff in parenthesis. As far as I’m aware the /g modifier should make it global, but it’s not in this case.

Hi micky, I think I can make do with the “without the extension” issue because it seems like even when i tried on facebook, it is the same result. For now the only check I need to add in is for the below result as well.

I tried replace your pattern https in the front with this ((ht|f)tp(s?)\://)
but it allows plain text to pass thru the check…:frowning:

I cannot understand why this wouldn’t match :

var text = “wer www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c”;
var matches = text.match( “((ht|f)tp(s?)\://)?[\-a-z0-9\.]+(/[\-a-z0-9\?\=]+)?”, “ig” );
alert(matches);

Result: wer,www.youtube/watch?v=KOFqBrwld3c,Longest,Word,http://www.youtube2.com/watch?v=KOFqBrwld3c

Shouldn’t the [\-a-z0-9\.] period sign inside the square brackets enforce the string to “must consist of a period”? Not sure why plain text is allowed through…arg :frowning:

OK, the first is not regular valid domain cos in the string I forgot to put in the ‘.com’, but you’re right, the regex should not have match it in that case.

To match either http(s)://www.etc.com/blah, or www[.]etc[.]com , you need to make the http(s):// part optional like so:


(https?\\://)?[-a-zA-Z0-9\\.]+(/[-a-zA-Z0-9\\?\\=]+)?

But I’m struggling with the .com, .co.uk part atm.

I’m thinking:


"(https?\\://)?www\\.[-a-zA-Z0-9_]+[b]\\.[/b]([a-zA-Z\\.]{2,4})+(/[-a-zA-Z0-9\\?\\=]+)?", "g"

It will only match stuff with www. at the start, but it doesn’t seem to care if the 2nd dot’s there or not. Not really sure why.

If my text is like this:

var text = “wer www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c”;

It fail :frowning:

I think for this part I can make do with not checking because it will definitely cause more headache.

Yeah, for some reason that second dot is not matching, and I don’t have a clue why. If i use text = “wer www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c”, it should match the second url, but it’s matching ‘www.youtube/watc’.

Right, so here’s my regex:
(https?\://)?www\.[-a-zA-Z0-9_]+\.([a-zA-Z\.]{2,4})+(/[-a-zA-Z0-9\?\=]+)?

The first array element is the whole match, which is ‘www.youtube/watc’, the second array element is the first matched parethesis which is ‘watc’.

Now for some reason, it’s including the ‘/’ as being matched by [-a-zA-Z0-9_], which as far as I’m aware it shouldn’t do. Then, it’s matching ‘watc’ as the last 4 characters after the first backslash with ‘([a-zA-Z\.]{2,4})’. Which is also confusing, since there’s no backslash featured in the preceding part of the RegExp.

Now in PHP (and most probably Perl), I think preg_match behaves a little differently. So I’m afraid I’m gonna have to duck out here, cos I’m at a loss as to why this isn’t working. Hopefully someone who knows a little more about javascript RegExp will be able to pick up the threat from here.

Yes micky, this is just weird, but thanks for your great help anyway. If anyone have any advice would greatly appreciate.

Meanwhile I’ll just keep trying…

Hey there,

So a friend of mine, who is much more intelligent than I has pointed out that javascript behaves differently depending on the way you use a regex. So if you use this method:


var text  = "wer www.youtube.com/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
var re = /(https?\\:\\/\\/)?www\\.[-a-zA-Z0-9_]+\\.([a-zA-Z\\.]{2,4})(\\/[-a-zA-Z0-9\\?\\=]+)?/ig;
var matches = text.match(re);
alert(matches);

will operate ‘properly’. In that case the regex I gave you works correctly, and will return an array of matches. Note how the regex is no longer in inverted commas, and the modifiers are after the last backslash. The backslash operates at the delimiter of the regex, so you need to escape all the other backslashes throughout the regex.

Hope that helps.

:slight_smile: