RegEx Question

Trying to highlight code within a <code> block. Now I am completely dumb to RegEx, and I’m trying my best to LEARN and not copy/paste, as this has held me back for a few years now. I simply must learn it. But here’s what I tried, and I didn’t get any result.


<?php

$fulltext = preg_replace('~^.*(<code>.*\\s</code>).*?~i', highlight_string('<code>$1</code>'), $fulltext);

?>

I don’t think I quite understand the $1, and $2 thing much. :goof:

Hello again. I tried that code, and it screwed up my page, it also just printed “1” for every article. Not sure why. So I tried this (using the above code in and IF condition)


			if ( preg_match('~^.*(<code>.*</code>).*?~', $fulltext) == true ):
				$fulltext = '<code>' . highlight_string(preg_replace('~^.*(<code>.*\\s</code>).*?~i', '$1', $fulltext)) . '</code>';
			endif;

When I ran that code, my paragraph text came back to normal, but when I got down to a <code> block, it did nothing for it. I’m also having problems getting jQuery to parse URL’s (another post) and I’m wondering if something deep in the code, or CSS even, is screwing this up. It’s frustrating. I’m frustrated lol.

Thanks for your help though, I hope I can get this figured out.

It doesn’t work because you used the code I said wouldn’t work …
You should use this code:


<?php
$fulltext =
  '<code>'.
  highlight_string(
    preg_replace('~^.*(<code>.*\\s</code>).*?~i', '$1', $fulltext)
  ).
  '</code>';
?>

:wink:

$1, $2 and so on denote text from the capture groups from your expression.
A capture group is a set of terms in your expression encapsulated inside ()'s.

If my pattern were this: “/<code>(.*?)</code>/is” then $1 would be all the text inside your <code> block, as it’s the first capturing group in the expression.

I’m not regex expert, but I think you’re looking for this:

$fulltext = preg_replace('/<code>(.*?)</code>/isg', highlight_string('<code>$1</code>'), $fulltext);

I tried this and it destroyed my content area. The <code></code> blocks are inside this:

div#layout-wrapper div#content-wrapper p.article-fulltext

(i know, this is probably wrong to do…)

the content is stored in the database, and then when rendered i have lines of code like this:



<?php
#begin =========================================================
$sql = $rsql['fetch_latest_news'];
$rs = mysql_query(safeQuery($sql), $db)
	or die(mysqlDebug('rs'));
if ( mysql_num_rows($rs) >= 1 ):
	$sql2 = $rsql['fetch_tags_by_article'];
	while ( ( $row = mysql_fetch_object($rs) ) == true ):
		$rs2 = mysql_query(safeQuery($sql2, $row->TextID), $db)
			or die(mysqlDebug('rs2'));
		if ( mysql_num_rows($rs2) >= 1 ):
			while ( ( $row2 = mysql_fetch_object($rs2) ) == true ):
				if ( $x == 1 ):
					$tag .= '<li class="second-item"><a href="';
				else:
					$tag .= '<li><a href="';
				endif;
				$tag .= makeURL('tags?name='.urlencode($row2->TagName));
				$tag .= '">';
				$tag .= $row2->TagName;
				$tag .= '</a></li>';
				$x++;
			endwhile;

			$fulltext = $row->TextFull;
			#$fulltext = preg_replace_callback('~((?:https?://|www\\d*\\.)\\S+[-\\w+&@#/%=\\~|])~', 'parse_links_v2', $fulltext);
			#$fulltext = preg_replace('~^.*(<code>.*</code>)\\s.*?~i', '<code>'.highlight_string($1).'</code>, $fulltext);
			$fulltext = ( strlen($fulltext) >= $truncate['latest_news'] ) ? nl2br(substr($fulltext, 0, ($truncate['latest_news']-1))) . '... <a href="'.makeURL('entry', 'id='.$row->TextID).'" class="disblock"><span class="rarr">&#0187;</span> Read More</a>' : nl2br($fulltext);
# break ========================================================
?>									
	<h3>
		<a href="<?php echo makeURL('entry', 'id='.$row->TextID); ?>">
			<?php echo $row->TextTitle; ?>
		</a>
	</h3>
	<h4>Word Count: <em><?php echo str_word_count($row->TextFull); ?> | <a href="#top">Top Of Page</a></h4>
	<p class="article-abstract hide">
		<?php echo $row->TextAbstract; ?>
	</p>
	<p class="article-fulltext">
		<?php echo $fulltext; ?>
	</p>
	<ul class="article-auth">
		<li class="label-tag">Posted on</li>
		<li class="second-item"><?php echo date_format(date_create($row->TextTimestamp, timezone_open("America/Indianapolis")), DATE_RFC850); ?></li>
	</ul>
	<ul class="article-auth">
		<li class="label-tag">Category:</li>
		<li class="second-item"><a href="<?php echo makeURL('category', 'id='.$row->CategoryID); ?>"><?php echo $row->CategoryName; ?></a></li>
	</ul>
	<ul class="article-auth">
		<li class="label-tag">Author:</li>
		<li class="second-item"><a href="<?php echo makeURL('user', 'id='.$row->UserID); ?>"><?php echo $row->Username; ?></a> <span class="rarr">&#0187;</span> <a href="<?php echo makeURL('user', 'id='.$row->UserID); ?>">View User Profile</a></li>
		<li><a href="<?php echo makeURL('entry', 'id='.$row->TextID); ?>">Permalink</a></li>
	</ul>
	<ul class="article-auth">
		<li class="label-tag">Options</li>
		<li class="second-item"><a href="#top">Top Of Page</a></li>
		<li><a href="#">Post Comment</a></li>
		<li><a href="#">View Comments (0)</a></li>
	</ul>
	<ul class="article-tags">
		<li class="label-tag">Filed Under</li>
		<?php echo $tag; ?>
	</ul>
	<hr />
<?php
# continue =====================================================
			$tag = null;
			$x = 1;
		endif;
	endwhile;
	mysql_free_result($rs);
	mysql_free_result($rs2);
else:
#break =========================================================
?>
	<h3>Whoops!</h3>
	<p>Sorry, but I cannot process your request at this time.</p>
<?php
#continue ======================================================
endif;
#break - final =================================================
?>




i know i know.,. it’s a mess :stuck_out_tongue:

Ah. I was just working off the poster’s function, I wasn’t aware it would cause an issue, but thanks for the tip!
I’ve never seen a pattern using ~'s instead of slashes. What purpose does that serve?

I’m not sure why you’re anchoring your pattern to the beginning of your string. Now, if you were capturing the part at the beginning of your pattern that would make sense, especially since this is preg_replace we’re talking about and anything not captured is going to be unavailable to replace with. (it will be available in $0, but as the entire string which sends you back to where you were before the match making it pointless)

Don’t anchor this pattern. Start and end your pattern with <code> and </code>. Before you make this search, the contents between your <code/> wrapper should have any html entities replaced with their htmlentity counterparts. ( < with <, > with > ) This will prevent ungreedy matches (matches that will end on the first occurrence of the marker instead of the last) from being closed if there’s a </code> inside your code. For instance if you were to post code showing someone how to post code.

If you leave the <code/> portions outside of your sub-match parenthesis, you can add them back manually and be left with only the contents of that wrapper inside the $1 back reference.

The i flag is fine, though if you’re positive your <code/> elements will all be lowercase, as is common when modern coding standards are used, you can leave that i flag off so the engine doesn’t have to test twice for each letter. (one for upper and once for lower, case)

You’ll probably want to include the s flag in this case though. If there’s no chance of a newline in your code blocks you can leave it out, but if there’s newlines in your code you need the s flag in order to tell the engine that it can include newlines in the dot (anything) metacharacters list of things to match.

In the replacement, you don’t want to include the <code/> portions inside your call to highlight_string. They’re not actually part of the code and shouldn’t be processed as if they were.

With all of that out of the way, you get this.

$html = preg_replace('#<code>(.+?)</code>#s',
	'<code>' . highlight_string('$1') . '</code>',
	$str
);

I noticed another pattern use the following bit, which is wasteful. Use the one-or-more (+) metacharacter instead of the zero-or-more metacharacter. If there’s nothing inside the <code/> wrapper, we want to ignore that wrapper. Otherwise it’s like opening an empty glass jug to see if there’s anything to drink inside of it. :slight_smile:

(.*?)

I noticed the use of a g flag out there somewhere as well. My best guess is that this was an artifact from someone doing Javascript development lately. Javascript regular expressions have a g flag to tell the engine that things should be replaced globally in the string, instead of just replacing the first found occurance. PHP doesn’t have a g flag though. At least not a documented one, and if it’s not documented it’s not safe to use anyways because it’s probably experimental and subject to change or removal in a later release. :slight_smile:

You can use any delimiter you choose. Most commonly used is the / (because that’s how it works in perl, javascript, and probably some other languages as well), but it has the drawback that you need to escape any / occurring in the regex itself (eg preg_match(‘/<\/form>/’)). The ~ is used less frequently inside regex and thus the probability is smaller you’d ever need to escape it (eg preg_match(‘~</form>~’) )

So, ~ is just the same as using / but avoids having to escape slashes in the rest of the regex.

Does that make sense? :slight_smile:

That makes perfect sense, thank you!

Zarin Denatrose, your explanation on $1 etc (commonly known as back references in regex) is entirely correct, however the PHP you provided is not.

It will first evaluate the highlight_string(‘<code>$</code’) in the 2nd parameter, such that your code is equivalent to


<?php
$fulltext = preg_replace('~^.*(<code>.*\\s</code>).*?~i', '<code>$1</code>', $fulltext);
?>

To do what you want you to do you should either use the /e modifier in the regex, but since that is based on eval() it’s very evil and should be avoided at all costs. A correct way to do it:


<?php
$fulltext =
  '<code>'.
  highlight_string(
    preg_replace('~^.*(<code>.*\\s</code>).*?~i', '$1', $fulltext)
  ).
  '</code>';
?>

:slight_smile: