The "my regexp questions" thread

Hello,

I’m working at learning Regular Expressions, and I will put all my questions in that thread (if that’s ok).

Let’s consider the following text:


Hello(some characters may appear or not) {
  foo = var;
}

Hello{
  foo = var;
}

I’m trying to catch “foo = bar”. So far so good: /.\s=\s*.*;/

Now the trick is: I don’t want the first occurence, so I need to find a way to tell my expression NOT to take into account a match that occures between brackets IF there were paranthesis invovled.

Thanks in advance.

Regards,

-jj. :slight_smile:

So you mean you want to capture a subpattern of
/Hello{(.*?);/

Yep. :tup:

What about, now if I have a third occurence of “foo = var;”, as in


Hello(some characters may appear or not) {
  foo = var;
}

Hello{
  foo = var;
}

foo = var;

and I want only that third occurence.

:slight_smile:

What language are you attempting to parse?

That would be close to CSS.

Do you mean you only want the last one?

how about you analyse the text file one line at a time?

Cups, foo and var are dynamically generated. So all I know is that there is a string, an equal sign, and another string followed by semi-colon. And that it doesn’t occur within brackets.

So is that only within curly brackets?

So if its after an opening curly bracket, but it does not close, should it be found?

What if there is more than one, return the first one, or the last one?

One openeing, one closing. I don’t want stuff that are within curly brackets. Only the occurences outside.

I think I should use “[^{]” but I don’t really know how to elaborate my pattern

Given these cases:


$case[0] = "foo0=var";
$case[1] = "foo1=var;";
$case[2] = "foo = var;";
$case[3] = "foo = bar";
$case[4] = "foo= bar;";
$case[5] = "{foo=bar";
$case[6] = "{foo=bar}";

which should return positive?

I’ll just like to point out what you are after cannot be done purely in Regex without a lot of munking about and filtering.

(:

So the third occurence of “foo = var;” in post #3 cannot be found via Regexp?

It can be found, but ignoring the ones that are in braces is another issue.

Thanks for your helpful answers :slight_smile: I solved my problems by using some string functions.

Here’s another question:


Hello(some characters may appear or not) {
  foo = var;
}

Hello(some characters may appear or not) {
  foo = var;
  boo = bar;
  hoo = jar;
}

I’d like to capture both “Hello … }”. I can get the first one, but not the second one.

Here’s my current approach:


preg_match_all("/.*\\s*\\(\\s*.*\\s*\\)\\s*\\{\\s*.*\\s*\\}/",$string,$array);

Please don’t hesitate to post a more elegant pattern. I think mine is quite…heavy-nooby.

Have you thought about parsing this properly?


<?php
class Token
{
  const
    ALPHA       = 'Token::ALPHA',
    NUMERIC     = 'Token::NUMERIC',
    SPACE       = 'Token::SPACE',
    BRACELEFT   = 'Token::BRACELEFT',
    BRACERIGHT  = 'Token::BRACERIGHT',
    PARENLEFT   = 'Token::PARENLEFT',
    PARENRIGHT  = 'Token::PARENRIGHT',
    EQUAL       = 'Token::EQUAL',
    SEMICOLON   = 'Token::SEMICOLON';
}

Anthony: sounds interesting, but I have to admit that I don’t see how I would retrieve what I’m asking for using your code.

And just for the sake of interest, how would I capture the occurence with more than one line between brackets?

:slight_smile:

It’s a little more involved than I’m hinting at, but at least it’s bulletproof, extendable and easy to manipulate later - worth the effort IMO.

You essentially break the “source code” into tokens of a particular type, you can then use this sequence of tokens to parse the code properly.

Let’s take this:


Hello(some characters may appear or not) {
  foo = var;
  boo = bar;
  hoo = jar;
}

If we break this down into tokens, we’d get a stream which looks like this:


Position: 0, Character:  , Type: Token::NEWLINE
Position: 1, Character: H, Type: Token::ALPHA
Position: 2, Character: e, Type: Token::ALPHA
Position: 3, Character: l, Type: Token::ALPHA
Position: 4, Character: l, Type: Token::ALPHA
Position: 5, Character: o, Type: Token::ALPHA
Position: 6, Character: (, Type: Token::PARENLEFT
Position: 7, Character: s, Type: Token::ALPHA
Position: 8, Character: o, Type: Token::ALPHA
Position: 9, Character: m, Type: Token::ALPHA
Position: 10, Character: e, Type: Token::ALPHA
Position: 11, Character:  , Type: Token::SPACE
Position: 12, Character: c, Type: Token::ALPHA
Position: 13, Character: h, Type: Token::ALPHA
Position: 14, Character: a, Type: Token::ALPHA
Position: 15, Character: r, Type: Token::ALPHA
Position: 16, Character: a, Type: Token::ALPHA
Position: 17, Character: c, Type: Token::ALPHA
Position: 18, Character: t, Type: Token::ALPHA
Position: 19, Character: e, Type: Token::ALPHA
Position: 20, Character: r, Type: Token::ALPHA
Position: 21, Character: s, Type: Token::ALPHA
Position: 22, Character:  , Type: Token::SPACE
Position: 23, Character: m, Type: Token::ALPHA
Position: 24, Character: a, Type: Token::ALPHA
Position: 25, Character: y, Type: Token::ALPHA
Position: 26, Character:  , Type: Token::SPACE
Position: 27, Character: a, Type: Token::ALPHA
Position: 28, Character: p, Type: Token::ALPHA
Position: 29, Character: p, Type: Token::ALPHA
Position: 30, Character: e, Type: Token::ALPHA
Position: 31, Character: a, Type: Token::ALPHA
Position: 32, Character: r, Type: Token::ALPHA
Position: 33, Character:  , Type: Token::SPACE
Position: 34, Character: o, Type: Token::ALPHA
Position: 35, Character: r, Type: Token::ALPHA
Position: 36, Character:  , Type: Token::SPACE
Position: 37, Character: n, Type: Token::ALPHA
Position: 38, Character: o, Type: Token::ALPHA
Position: 39, Character: t, Type: Token::ALPHA
Position: 40, Character: ), Type: Token::PARENRIGHT
Position: 41, Character:  , Type: Token::SPACE
Position: 42, Character: {, Type: Token::BRACELEFT
Position: 43, Character:  , Type: Token::NEWLINE
Position: 44, Character:  , Type: Token::SPACE
Position: 45, Character:  , Type: Token::SPACE
Position: 46, Character: f, Type: Token::ALPHA
Position: 47, Character: o, Type: Token::ALPHA
Position: 48, Character: o, Type: Token::ALPHA
Position: 49, Character:  , Type: Token::SPACE
Position: 50, Character: =, Type: Token::EQUAL
Position: 51, Character:  , Type: Token::SPACE
Position: 52, Character: v, Type: Token::ALPHA
Position: 53, Character: a, Type: Token::ALPHA
Position: 54, Character: r, Type: Token::ALPHA
Position: 55, Character: ;, Type: Token::SEMICOLON
Position: 56, Character:  , Type: Token::NEWLINE
Position: 57, Character:  , Type: Token::SPACE
Position: 58, Character:  , Type: Token::SPACE
Position: 59, Character: b, Type: Token::ALPHA
Position: 60, Character: o, Type: Token::ALPHA
Position: 61, Character: o, Type: Token::ALPHA
Position: 62, Character:  , Type: Token::SPACE
Position: 63, Character: =, Type: Token::EQUAL
Position: 64, Character:  , Type: Token::SPACE
Position: 65, Character: b, Type: Token::ALPHA
Position: 66, Character: a, Type: Token::ALPHA
Position: 67, Character: r, Type: Token::ALPHA
Position: 68, Character: ;, Type: Token::SEMICOLON
Position: 69, Character:  , Type: Token::NEWLINE
Position: 70, Character:  , Type: Token::SPACE
Position: 71, Character:  , Type: Token::SPACE
Position: 72, Character: h, Type: Token::ALPHA
Position: 73, Character: o, Type: Token::ALPHA
Position: 74, Character: o, Type: Token::ALPHA
Position: 75, Character:  , Type: Token::SPACE
Position: 76, Character: =, Type: Token::EQUAL
Position: 77, Character:  , Type: Token::SPACE
Position: 78, Character: j, Type: Token::ALPHA
Position: 79, Character: a, Type: Token::ALPHA
Position: 80, Character: r, Type: Token::ALPHA
Position: 81, Character: ;, Type: Token::SEMICOLON
Position: 82, Character:  , Type: Token::NEWLINE
Position: 83, Character: }, Type: Token::BRACERIGHT
Position: 84, Character:  , Type: Token::NEWLINE

You can now build something to read this stream of tokens and obtain the data you need.

Woah.
How would I break the source into tokens in the first place?

A quick and dirty way (which I used above) can be as simple as:


function tokenize($source){
  $patterns = array(
    Token::ALPHA      => '~[a-z]~i',
    Token::NUMERIC    => '~[0-9]~',
    Token::SPACE      => '~ ~',
    Token::BRACELEFT  => '~\\{~',
    Token::BRACERIGHT => '~\\}~',
    Token::PARENLEFT  => '~\\(~',
    Token::PARENRIGHT => '~\\)~',
    Token::EQUAL      => '~=~',
    Token::SEMICOLON  => '~;~',
    Token::NEWLINE    => "~\\r?\
~"
  );
  $tokens = array();
  foreach(str_split($source) as $position => $char){
    foreach($patterns as $name => $pattern){
      if(preg_match($pattern, $char)){
        $tokens[$position] = array(
          'value'     => $char,
          'name'      => $name
        );
        break;
      }
    }
  }
  return $tokens;
}

This doesn’t take into account instances of Token::UNKNOWN though.