SitePoint Sponsor

User Tag List

Results 1 to 16 of 16
  1. #1
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)

    Need help with regex (spaces)

    Hi, I have the following code to scan text for URLs and return all the urls in array.
    I've been testing this for days but keep not getting anywhere success in tracking front and trailing spaces.

    var result = text.match("[^\\s\(\)\"\'<>\,\!]+(((ht|f)tp(s?)\://)?(www.|[a-zA-Z].)[a-zA-Z0-9\-\.]+(\.[a-zA-Z]{2,4})?[^\\s]+)", "gi");
    With this regex, it always fail when I try to type the following:
    It returns:

    http://www.domain.com/test?id=dsd sakads, http://www.xample.com/test?id=dsd

    With the "sakads". How do I remove any text and spaces in between URLs?

    Thanks in advanced.
    I Dunno LOL \(_o)/

  2. #2
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    By the way may I ask if javascript expression the same as php???

    I tried to use [^\s] to detect spaces but it always seems to remove the character "s"
    I Dunno LOL \(_o)/

  3. #3
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    var result = text.match(https?\://[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?", "ig")

    give that a try...

  4. #4
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mickyginger View Post
    var result = text.match(https?\://[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?", "ig")

    give that a try...
    That gave me null result. I tested using this string:

    var text = "http://www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    I Dunno LOL \(_o)/

  5. #5
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Ok, well in order to test out the regExp, I used a quick PHP script, and it works fine. So I'm pretty sure the regExp works, but perhaps there's an issue with syntax.

    Do you have a test page online I can have a look at?

  6. #6
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mickyginger View Post
    Ok, well in order to test out the regExp, I used a quick PHP script, and it works fine. So I'm pretty sure the regExp works, but perhaps there's an issue with syntax.

    Do you have a test page online I can have a look at?
    I haven't had this kinda issue using PHP. Seems like regex in javascript really is a pain .

    I only have this simple page to quick test any pattern I can have, currently it is even matching invalid domain URL like youtube[without the extension] which I do not want:

    PHP Code:
    <style>
    body {
        
    background:#000;
    }
    </
    style>
    <
    script src="//ajax.googleapis.com/ajax/libs/jquery/1.6.1/jquery.min.js" type="text/javascript"></script>
    <script>
    $(document).ready( function( ) {
        var text  = "http://www.youtube/watch?v=KOFqBrwld3c  Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
        alert(text.match( "https?://([a-z0-9].+)(\.[a-z0-9\-]+)(.[a-z]{2,4})+", "igm" ))
    });
    </script>
    <p>&nbsp;</p>
    <p>&nbsp;</p>
    <p>&nbsp;</p>
    <p>&nbsp;</p>
    <table width="100%"><tr><td align="center">
    <input type="submit" onclick="window.location.reload( true );" style="padding:100px;" value="submit" />
    </td></tr></table> 
    I Dunno LOL \(_o)/

  7. #7
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    OK, so string.match will return an array of matches so this:
    Code:
    var text  = "http://www.youtube/watch?v=KOFqBrwld3c  Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    var matches = text.match( "https?\://[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?", "ig" );
    alert(matches);
    will return: So to return the first url matched you use this
    Code:
    alert(matches[0]);
    the second would be:
    Code:
    alert(matches[1]);
    and so on...

    Is that any help?

  8. #8
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mickyginger View Post
    OK, so string.match will return an array of matches so this:
    Code:
    var text  = "http://www.youtube/watch?v=KOFqBrwld3c  Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    var matches = text.match( "https?\://[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?", "ig" );
    alert(matches);
    will return:

    So to return the first url matched you use this
    Code:
    alert(matches[0]);
    the second would be:
    Code:
    alert(matches[1]);
    and so on...

    Is that any help?
    Thanks for your help micky kinda close but the first match is not valid domain? It is without an extension.

    Also i need to make it such that it matches:

    www.youtube.com/watch?v=KOFqBrwld3c
    youtube.com/watch?v=KOFqBrwld3c
    Do I need to seperate these check or is there a pattern to check this as well? Realy appreciate your help micky.
    I Dunno LOL \(_o)/

  9. #9
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Additionally is you want to iterate through the array and output the urls, you can do this:

    Code:
    $.each(matches, function(){
    	$('body').appendTo('<p>' this '</p>');
    });
    But I've just noticed, it'll only match the first occurrence... the second array entry is the stuff in parenthesis. As far as I'm aware the /g modifier should make it global, but it's not in this case.

  10. #10
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Hi micky, I think I can make do with the "without the extension" issue because it seems like even when i tried on facebook, it is the same result. For now the only check I need to add in is for the below result as well.

    www.youtube.com/watch?v=KOFqBrwld3c
    youtube.com/watch?v=KOFqBrwld3c
    I tried replace your pattern https in the front with this ((ht|f)tp(s?)\://)
    but it allows plain text to pass thru the check...
    I Dunno LOL \(_o)/

  11. #11
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    I cannot understand why this wouldn't match :

    var text = "wer www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    var matches = text.match( "((ht|f)tp(s?)\://)?[\-a-z0-9\.]+(/[\-a-z0-9\?\=]+)?", "ig" );
    alert(matches);
    Shouldn't the [\-a-z0-9\.] period sign inside the square brackets enforce the string to "must consist of a period"? Not sure why plain text is allowed through...arg
    I Dunno LOL \(_o)/

  12. #12
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    OK, the first is not regular valid domain cos in the string I forgot to put in the '.com', but you're right, the regex should not have match it in that case.

    To match either http(s)://www.etc.com/blah, or www[.]etc[.]com , you need to make the http(s):// part optional like so:

    Code:
    (https?\://)?[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?
    But I'm struggling with the .com, .co.uk part atm.

    I'm thinking:
    Code:
    "(https?\://)?www\.[-a-zA-Z0-9_]+\.([a-zA-Z\.]{2,4})+(/[-a-zA-Z0-9\?\=]+)?", "g"
    It will only match stuff with www. at the start, but it doesn't seem to care if the 2nd dot's there or not. Not really sure why.

  13. #13
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Quote Originally Posted by mickyginger View Post
    OK, the first is not regular valid domain cos in the string I forgot to put in the '.com', but you're right, the regex should not have match it in that case.

    To match either http(s)://www.etc.com/blah, or www[.]etc[.]com , you need to make the http(s):// part optional like so:

    Code:
    (https?\://)?[-a-zA-Z0-9\.]+(/[-a-zA-Z0-9\?\=]+)?
    If my text is like this:

    var text = "wer www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    It fail

    Quote Originally Posted by mickyginger View Post
    But I'm struggling with the .com, .co.uk part atm.

    I'm thinking:
    Code:
    "(https?\://)?www\.[-a-zA-Z0-9_]+\.([a-zA-Z\.]{2,4})+(/[-a-zA-Z0-9\?\=]+)?", "g"
    But it doesn't seem to care if the 2nd dot's there or not. Not really sure why.
    I think for this part I can make do with not checking because it will definitely cause more headache.
    I Dunno LOL \(_o)/

  14. #14
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Yeah, for some reason that second dot is not matching, and I don't have a clue why. If i use text = "wer www.youtube/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c", it should match the second url, but it's matching 'www.youtube/watc'.

    Right, so here's my regex:
    (https?\://)?www\.[-a-zA-Z0-9_]+\.([a-zA-Z\.]{2,4})+(/[-a-zA-Z0-9\?\=]+)?

    The first array element is the whole match, which is 'www.youtube/watc', the second array element is the first matched parethesis which is 'watc'.

    Now for some reason, it's including the '/' as being matched by [-a-zA-Z0-9_], which as far as I'm aware it shouldn't do. Then, it's matching 'watc' as the last 4 characters after the first backslash with '([a-zA-Z\.]{2,4})'. Which is also confusing, since there's no backslash featured in the preceding part of the RegExp.

    Now in PHP (and most probably Perl), I think preg_match behaves a little differently. So I'm afraid I'm gonna have to duck out here, cos I'm at a loss as to why this isn't working. Hopefully someone who knows a little more about javascript RegExp will be able to pick up the threat from here.

  15. #15
    SitePoint Zealot
    Join Date
    Jul 2007
    Posts
    170
    Mentioned
    0 Post(s)
    Tagged
    0 Thread(s)
    Yes micky, this is just weird, but thanks for your great help anyway. If anyone have any advice would greatly appreciate.

    Meanwhile I'll just keep trying...
    I Dunno LOL \(_o)/

  16. #16
    SitePoint Addict
    Join Date
    Oct 2009
    Location
    London, UK
    Posts
    382
    Mentioned
    1 Post(s)
    Tagged
    0 Thread(s)
    Hey there,

    So a friend of mine, who is much more intelligent than I has pointed out that javascript behaves differently depending on the way you use a regex. So if you use this method:
    Code:
    var text  = "wer www.youtube.com/watch?v=KOFqBrwld3c Longest Word http://www.youtube2.com/watch?v=KOFqBrwld3c";
    var re = /(https?\:\/\/)?www\.[-a-zA-Z0-9_]+\.([a-zA-Z\.]{2,4})(\/[-a-zA-Z0-9\?\=]+)?/ig;
    var matches = text.match(re);
    alert(matches);
    will operate 'properly'. In that case the regex I gave you works correctly, and will return an array of matches. Note how the regex is no longer in inverted commas, and the modifiers are after the last backslash. The backslash operates at the delimiter of the regex, so you need to escape all the other backslashes throughout the regex.

    Hope that helps.



Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •