preg_match_all matches too much text

tomnewego · April 12, 2011, 7:36am

I have a problem with the following;

I’m trying to edit H2 links to add id attributes to them with the following code;

preg_match_all("/\\<h2 id=\\"(.*)\\">(.*)\\<\\/h2\\>/i",$content,$matches);

This code works fine with most of my texts, but when I have a text like;

<h2>Title</h2>more text without space

It won’t stop at the 2nd boundary and matches the whole string till the next </h2> tag. When I have a \r
(newline) in place after the 2nd </h2> the script works perfectly. Anyone have an idea on how to fix this? I think I’m missing some kind of limiter. (I’ve tried \b and \B without success)

Your help is greatly appreciated,

Tom

rpkamp · April 12, 2011, 8:56am

The problem is that (.*) is known as greedy regex; that is, it will eat anything and everything it sees, and sometimes even eats up what we think it shouldn’t because it’s later in our regex (the <h2> in this case).
There are two things you could do

Replace (.*) with an atom that tells exactly what to match, so something like ([a-zA-Z0-9\s]+) to match any character, digit and spaces OR
if you don’t what you will be matching, make the (.) lazy by adding a question mark: (.?)

Method is one is preferred, but method two also works