Help on Javascript regex

hothandao · January 8, 2010, 3:06pm

Hello,
I want to use js to parse links from html as eg below:

'test <a href=http://link1.com>asdf</a> sdios <a title="test2" href=http://link2.com>asdf</a> adsd'

I coded as follow


var str = 'test <a href=http://link1.com>asdf</a> sdios <a title="test2" href=http://link2.com>asdf</a> adsd';

var arrMatch = str.match(/href=(.*)>(.*)<\\/a>/g);
alert(arrMatch);

But it returns

href=http://link1.com>asdf</a> sdios <a title="test2" href=http://link2.com>asdf</a>

not

href=http://link1.com>asdf</a>

href=http://link2.com>asdf</a>

as expected.

Please help me on this.

Thank you,

igv · January 8, 2010, 3:46pm

you need to use non-greedy experession, like this:


match(/href=(.*?)>(.*?)<\\/a>/g);

hothandao · January 8, 2010, 3:51pm

Got it, thank you so much

BooBooGotU · January 8, 2010, 7:32pm

match(/href=(.*?)>(.*?)<\\/a>/g);

It won’t match the following:


<a href=test0.php>this is
test0</a>
<a href=test1.php class="test">test1</a>
<a href='test2.php'>test2</a>
<a href="test3.php">test3</a>
<a HREF=test4.php>test4</a>

If you want to support cases above, use this regex:


match(/href=['"]?([^\\s'"<>]*)['"]?[^<>]*>([\\s\\S]{1,100}?)<\\/a>/gi);

[‘"]? - matches single/double quote or no quote
[^\s’"<>]* - href, match everything except: whitespace, quotes, tags, 0 or more
[^<>]* - any additional attributes after href, for example target=_blank class=smth
[\s\S]{1,100}? - link name, 1-100 characters, “?” means ungreedy so it ends on first closing tag
/i - case insensitive

igv · January 8, 2010, 7:43pm

Actually it detects 3 out 5 of your tests without any changes


str = '<a href=test0.php>this is\
test0</a>' + 
'<a href=test1.php class="test">test1</a>' + 
'<a href=\\'test2.php\\'>test2</a>' +
'<a href="test3.php">test3</a>' +
'<a HREF=test4.php>test4</a>'

var arrMatch = str.match(/href=(.*?)>(.*?)<\\/a>/g);
alert(arrMatch);

returns:


 ["href=test1.php class="test">test1</a>", "href='test2.php'>test2</a>", "href="test3.php">test3</a>"]

so if only ‘i’ modifier is added it will run all except multi-line example:


match(/href=(.*?)>(.*?)<\\/a>/gi);

BooBooGotU · January 8, 2010, 9:06pm

Actually it detects none of my tests. We want to match links, and not garbage like: ‘test.php’ with quotes, your pattern matches 0 of 5 cases from my previous post.

Let’s examine the examples in details:


<a href=test0.php>this is
test0</a>

0 : 1, does not match


<a href=test1.php class="test">test1</a>

0 : 2, garbage match: test1.php class=“test” - this is not a correct link


<a href='test2.php'>test2</a>

0 : 3, garbage match: ‘test.php’ - this is not a correct link


<a href="test3.php">test3</a>

0 : 4, garbage match: “test3.php” - this is not a correct link


<a HREF=test4.php>test4</a>

0 : 5, does not match

Summary:

You lose 0 : 5!

GG.
GL next time.

igv · January 8, 2010, 9:16pm

Look at the original post, expected result should be:


href=http://link1.com>asdf</a>
href=http://link2.com>asdf</a>

which is what my script was detecting.
It was not specified that only url should be detected.
So, let not go beyond this topic.

BooBooGotU · January 8, 2010, 9:49pm

His approach is wrong, this is not the way to match links. It wasn’t specified? It also wasn’t specified that we shouldn’t show him the right direction if he was making wrong assumptions.

igv · January 8, 2010, 9:52pm

He did say that he wants to parse links, not parse urls, as it was stated in the first post. And link is not specifically what is inside href attribute, it could be whole ‘a’ tag from html perspective. It depends what it was needed for.

Topic		Replies	Views
REGEX match URL in link ignore other tags? PHP regex	9	12306	October 16, 2018
<a> attributes regexp JavaScript	1	415	January 25, 2011
Regular expression PHP	2	421	September 28, 2011
Regex help if you please, shouldn't be a hard one! JavaScript regex	11	2220	November 11, 2017
Regex Match with javascript JavaScript	3	577	October 8, 2014

Help on Javascript regex

Related topics