Why can't "?" regex quantifier be used to achieve this result?

I’m trying to match everything up till “REGEX” ((if it’s present)) or to the end of the subject if it’s not. Variables involved are

$str = 'init text
				REGEX
are you the last?
				
STOP-HERE!!!					

continue';

$pattern = '/REGEX([\s\S]*)(?:STOP-HERE)?/';

Now, it just matches the entire string even after encountering STOP-HERE, whereas, if I remove the “?” quantifier from the pattern, it stops exactly before STOP-HERE. However, that expression will fail in the absence of STOP-HERE. How do I capture both scenarios? I’ve attempted to explicitly exclude it from the match with this pattern

'/REGEX([\s\S]+(?!STOP-HERE))*(?:STOP-HERE)?/'

But it just returns pretty much the same result.
Isn’t my opening question the description for “?” quantifier – one or zero matches? I now say “if you see it once, return the current match”. Why doesn’t that work?!

It’s not clear from your description what matches you want to get exactly. You say ‘match everything up till “REGEX”’ but in your pattern you don’t provide any capturing parentheses before REGEX so obviously that pattern cannot match anything up till “REGEX” - or is it your omission? Next, what is the end of the subject? Is it the end of the string or just before “STOP-HERE”? It would be best if you provided examples of exact matches of the strings you are expecting to capture.

Don’t overwork your problem. I assume from your attempted pattern you actually mean “Between REGEX and STOP-HERE”

 if(preg_match("/REGEX(.*)STOP-HERE/",$str,$matches) === 1)  {
   $result = $matches[1];
 } else {
   $result = end(explode("REGEX",$str));
 }

Oh, yes. You’re correct. My description was misleading as a result of swapping out the actual variables and mixing things up. This

should have been

…match everything from “REGEX” until “STOP-HERE” ((if it’s present)), or to the end of the subject string i.e. “continue” if “STOP-HERE” is absent.

The pattern reflects this goal, to the best of my knowledge that is.

Yes, you are correct. I was in a haste and authored that misleading spec.

Your parenthesis doesn’t account for new lines, which I expect to encounter. Secondly, I didn’t think to use explodes because I may have multiple occurrences of “REGEX” and you can imagine how unwieldy tracking each index, substringing and exploding will be. I’m sure you didn’t account for that since I didn’t mention this possibility

I’m currently using my pattern without the trailing “?” quantifier and it works for most use cases in my project. I was just curious why including it doesn’t work.

Okay, now I see what you mean. This is a tricky one, it looks like when you use the trailing ? then the first subpattern matches everything because “STOP-HERE” can be there or not and the engine chooses the “not” case just because it can. I suppose we go into precedence details because here the two subpatterns fight for which should be the one acted upon. Probably such cases are documented somewhere in the regex library. Sometimes the lazy modifier can solve the problem but not here.

Therefore we might need to look for an alternative strategy. For example:

$pattern = '/REGEX(?|(.*)(?:STOP-HERE)|(.*))/s';

Here, we match one of the two alternative strings: either one that is followed by “STOP-HERE” or another one that isn’t (and effectively doesn’t contain it).

(BTW, you don’t need to use [\s\S] to match newlines because it’s clearer to use a dot and the s modifier.)

And @m_hutley’s solution might not be a bad one, either, and you can solve the problem of multiple REGEX strings by using the third limiting argument to explode().

1 Like

So how would you expect the code to handle the following?

$str = "Stuff
REGEX
More Stuff
REGEX
Do This
STOP HERE
Extra stuff";

?

Also isnt your pattern just
/REGEX(?:(?!STOP-HERE).)*/gms

?

Aha!. I appreciate your efforts.

This is the pattern I would’ve used; building upon your last example '/(REGEX(?|(.*)(?:STOP-HERE)|(.*)))+?/s'
with preg_match_all. Not too surprisingly, it doesn’t seem especially effective for the task of multi matching. I’ve just realized the problem lies with this “*” quantifier, which cannot be slowed down – it’s always giving.

The remaining option could be sequential matching/altering the subject string. For instance if we had an initial pattern catching all text in-between REGEX like '/(REGEX(.+?)(?:REGEX))/gs', then looping through the result set, each unit checking for the presence of STOP-HERE using your aforementioned expression.

In my own case, I’m writing a templating engine where each REGEX is punctuated by an opposite STOP-HERE tag. So, I can afford to start out with '/(REGEX(.+?)STOP-HERE)/gs'. The context that warranted this question is a trailing STOP-HERE that occasionally appears at the tail end of parsing. This happens when I want to hop from one set of matches to another. See an high level illustration of a use case https://github.com/nmeri17/Tilwa/blob/4820a2fbf8c7ef3bc7942719e81b330cf112efcc/lib/Templating/TemplateEngine.php#L288, then this specific use case falls at https://github.com/nmeri17/Tilwa/blob/master/views/dashboard.tmpl#L245

In the absence of a more elegant solution, I can live with resorting to explodes and substringing.

As far as I know, parsing nested markup is too complex a task for regular expressions. Popular template engines have their own complex lexers for this task. Regex can become really difficult both to write and maintain when you get into the realm of multiple occurrences, nesting levels and then perhaps attempts to behave appropriately in the case of bad markup and to generate useful debugging messages.

This would seem easier to manage by changing the template tags into xml tags and everything between them into CDATA sections and parse the content with DOM. Just an idea, I haven’t tried it…

Excellent. But this is what you get when one person unilaterally embarks on a project.

I don’t see any advantage special markup syntax has over using XML. In some engines like smarty, the markup can be written to simulate the behavior they represent, such as alternate markup, loop blocks etc. Maybe, semantic upper hand? What other reason would none of them have for using XML cdata?
Or tight decoupling? I imagine an engine of this nature can receive any markup of a sort pertaining to its syntax and return an expected output. With cdata however, the controllers have to dictate how they want each particular block displayed.
I’m looking at behavior definition here, on a tag basis. Can I, for instance, give class attributes to cdata, that I can lift from the Dom reader in order to realize what to do with that block (considering the class name won’t get parsed into an attribute)?

I’ve tested this engine in 3 projects so far and can confirm that it handles alternation, nesting, looping, template includes, grouping, and pretty much most templating needs the average Dev is likely to come across.

Still, this new revelation will haunt me for a long time until I overhaul the engine altogether or find substantial advantages it poses to using cdata.

Maintenance? If you can find the time, please open an issue illustrating fail cases i.e. after adhering to the markup structure and supplying valid data. Learning curve amounts to less than 7 rules, so you can jump right in and start breaking things.

@Lemon_Juice still eagerly awaiting to hear your thoughts on my last comment. Please find the time to share.

Sorry for not replying but I’m too busy right now and can’t get deeper into this subject, I just shared a couple of loose thoughts.

I thought that cdata sections could contain text without any template markup so those wouldn’t need to be parsed in any way. This is roughly what I had in mind:

<text><![CDATA[<ol>]]></text>
<foreach from="$cars" as="$car">
  <text><![CDATA[<li>]]></text>
  <var>$car</var>
  <text><![CDATA[</li>]]></text>
</foreach>
<text><![CDATA[</ol>]]></text>

The text element could represent the textual sections with cdata. You can of course add attributes to those elements and invent any number of elements for any use case.

The disadvantage is you’d need to do a translation from your tpl syntax to tpl syntax but I image that would not be difficult and then it might be fairly easy to traverse the template with DOM functions.

This is just an idea and I can’t foresee if there won’t be any serious issues if it was to be used for a complex template engine.

If I understand properly, I can then do a string/Dom node replace for all var tags ( maybe using variable variables or extract within a closed scope), then liberate the container tags stored inside cdata.

It does make sense, and I’ll say it’s worth giving a shot. Thanks for stopping by

What did you intend to say when you wrote

Yes, I think this is what I had in mind. You could also use container tags to store all kinds of attributes for special stuff.

Sorry, I meant “from your tpl syntax to xml syntax”, this forum unfortunately won’t allow me to edit my post now. Then I imagine, at the final stage, you’d have to translate the xml syntax to php syntax - what template engines call compilation to php. Without compilation any kind of template parsing will end up very slow.

Compilation to PHP? You mean I can’t call toString() on the Dom object of the XML container or some variant of grabbing its contents? I didn’t see that coming. Mine contains no such construct. It just regex replaces placeholders with their values and outputs the master string when done.

When you say template parsing is slow, I wonder how slow that could get. I don’t imagine it’ll be significant enough for the user to notice. Benchmarking with thousands of requests may yield different results, though. That’s probably another advantage low-level templating may have above abstract/high level stuff like Dom objects. I think anything involving reference types is likely to drag. Could be wrong.

Today, I found out about https://github.com/Level-2/Transphporm and while it is truly a brilliant concept they’re accomplishing, with your comment now, one can only imagine how long such syntax-less, placeholder-less solution is likely to take.

Thanks for your time.

You can call toString() on the DOM but this will not get you executable php code that you can run but xml crafted to your own rules and you have to do something with it. You can have a class that traverses the DOM and upon finding each subsequent element executes it - like run a foreach loop, inject a value from a variable or inject plain text. This is what I call parsing the DOM.

Another option is to parse the DOM and instead of running (outputting) it convert it to compiled php code, which in effect is just a template in pure php. Then on each page request you don’t have to traverse the xml file, nor do you have to parse your initial template with your syntax - you can just include the php template and it will execute very fast.

It depends how complex your template engine is. If it’s fairly simple and you can parse it with regex then it will not take much time. However, large template engines like Smarty or Twig parse the templates into php code and then that code is executed. With the number of features they provide if they didn’t compile the templates then the speed would become unacceptably slow, really.

I’ve read about Transphporm but as you read the docs there is a cache mechanism that will prevent parsing the templates/sheets on every request.

BTW, why are you creating your own template engine? What advantages does it have above existing libraries like Smarty or Twig?

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.