Help on Javascript regex

Hello,
I want to use js to parse links from html as eg below:

'test <a href=http://link1.com>asdf</a> sdios <a title="test2" href=http://link2.com>asdf</a> adsd'

I coded as follow


var str = 'test <a href=http://link1.com>asdf</a> sdios <a title="test2" href=http://link2.com>asdf</a> adsd';

var arrMatch = str.match(/href=(.*)>(.*)<\\/a>/g);
alert(arrMatch);

But it returns

href=http://link1.com>asdf</a> sdios <a title="test2" href=http://link2.com>asdf</a>

not

href=http://link1.com>asdf</a>
href=http://link2.com>asdf</a>

as expected.

Please help me on this.

Thank you,

you need to use non-greedy experession, like this:


match(/href=(.*?)>(.*?)<\\/a>/g);

Got it, thank you so much

match(/href=(.*?)>(.*?)<\\/a>/g);

It won’t match the following:


<a href=test0.php>this is
test0</a>
<a href=test1.php class="test">test1</a>
<a href='test2.php'>test2</a>
<a href="test3.php">test3</a>
<a HREF=test4.php>test4</a>

If you want to support cases above, use this regex:


match(/href=['"]?([^\\s'"<>]*)['"]?[^<>]*>([\\s\\S]{1,100}?)<\\/a>/gi);

[‘"]? - matches single/double quote or no quote
[^\s’"<>]* - href, match everything except: whitespace, quotes, tags, 0 or more
[^<>]* - any additional attributes after href, for example target=_blank class=smth
[\s\S]{1,100}? - link name, 1-100 characters, “?” means ungreedy so it ends on first closing tag
/i - case insensitive

Actually it detects 3 out 5 of your tests without any changes


str = '<a href=test0.php>this is\
test0</a>' + 
'<a href=test1.php class="test">test1</a>' + 
'<a href=\\'test2.php\\'>test2</a>' +
'<a href="test3.php">test3</a>' +
'<a HREF=test4.php>test4</a>'

var arrMatch = str.match(/href=(.*?)>(.*?)<\\/a>/g);
alert(arrMatch);

returns:


 ["href=test1.php class="test">test1</a>", "href='test2.php'>test2</a>", "href="test3.php">test3</a>"]

so if only ‘i’ modifier is added it will run all except multi-line example:


match(/href=(.*?)>(.*?)<\\/a>/gi);

Actually it detects none of my tests. We want to match links, and not garbage like: ‘test.php’ with quotes, your pattern matches 0 of 5 cases from my previous post.

Let’s examine the examples in details:


<a href=test0.php>this is
test0</a>

0 : 1, does not match


<a href=test1.php class="test">test1</a>

0 : 2, garbage match: test1.php class=“test” - this is not a correct link


<a href='test2.php'>test2</a>

0 : 3, garbage match: ‘test.php’ - this is not a correct link


<a href="test3.php">test3</a>

0 : 4, garbage match: “test3.php” - this is not a correct link


<a HREF=test4.php>test4</a>

0 : 5, does not match

Summary:

You lose 0 : 5!

GG.
GL next time.

Look at the original post, expected result should be:


href=http://link1.com>asdf</a>
href=http://link2.com>asdf</a>

which is what my script was detecting.
It was not specified that only url should be detected.
So, let not go beyond this topic.

His approach is wrong, this is not the way to match links. It wasn’t specified? It also wasn’t specified that we shouldn’t show him the right direction if he was making wrong assumptions.

He did say that he wants to parse links, not parse urls, as it was stated in the first post. And link is not specifically what is inside href attribute, it could be whole ‘a’ tag from html perspective. It depends what it was needed for.