Help with preg_replace syntax for numbers (alphanumeric)?

Hi all,

I have a WordPress site and I’ve created a custom 404 page using Joost de Valk’s code. In this code he has this line:


$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","$1",$s);

This is stripping unwanted stuff from urls that visitors are coming from, and he is using the results of this to do perform a site search later in the code. i am trying to also strip out any alphanumeric characters from these urls and have been reading (and trying to add in the correct code) for hours and cannot get it.

I wonder if someone here can help?

I think I need to add this in there:


^([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$.

I think I am close, but in all honesty, I have no idea.


$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","^([0-9]|[1-9][0-9]|1[0-9][0-9]|2[0-4][0-9]|25[0-5])$.","$1",$wp_query->query_vars['name']);

Can anyone please advise?

I’d love to help you Adam, but I’m not really clear what you are trying to achieve.

Can you give some examples of before and after your desired operation, so I can see exactly what you are trying to match and remove?

Thanks,
Frank.

Hi Frank,

Thanks for your interest. In short, the following code is stripping the specified characters from urls that hit the page the code is on. I also need alphanumeric characters to be stripped from these urls. This code is in a 404 page on my WP install.


$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","$1",$s);

I’m trying to get a url that looks like this:


http://mysite.com/show-podcasts/49-green-patriot-radio-with-david-steinman/4804

to look like this:


http://mysite.com/show-podcasts/green-patriot-radio-with-david-steinman/

I think this can be accomplished by telling the code to also strip alphanumeric characters, but I haven’t been able to get it right yet.

Does this help explain my goal a bit better?

The code I’m using is found on this page (sitepoint won’t let me add urls yet):


http://yoast.com/404-error-pages-wordpress/

This is working great for me, but the numbers are causing a problem. When someone hits the 404 page this code strips all the funky characters then uses the results to perform a search through the WP posts and displays some suggested links for the visitor. If the links aren’t helpful, the visitor also has the choice to perform a search through a search form. This search form is pre-populated with the results captured earlier.

The numbers are causing a problem in the searches. In the example url above, the code strips the characters to these “49 green patriot radio with david steinman”. This is returning zero suggested link results. This is what’s also populated in the search box, but when you hit search…no results. The reason I believe the numbers are causing a problem is because if I simply remove “49” from the search box, I get very good results.

So, again, the goal is to strip alphanumeric characters by adding to that preg_replace code.

Any help would be greatly appreciated.

OK. I think I get it. From your example you want to remove any series of numbers including a following dash (-) from the start of the article name, and any slash (/) and trailing numbers from the end of the article name.

Try something like the following:


$s = preg_replace("/([0-9]+-)?(.*)-(html|htm|php|asp|aspx)(/[0-9]+)$/","$2",$s);

What I have done is added an optional match against 1 or more digits and a dash to the start of your expression, and an optional match against a slash and 1 or more digits to the end of your expression.

The tricky bit is that adding the bracketed group to the start has created an extra “capture group”. I have bumped the replacement pattern from $1 to $2 so that it still finds the correct part of the text. It is possible to mark groups as non-capturing, but the syntax can be less reliable, so this approach seems simpler in this case.

Let me know how you get on (or if I have misinterpreted your needs!)

Frank.

Thanks so much Frank! A couple questions:

This only strips numbers from the beginning and end correct?

Second, your example didn’t include the end of the original code. Can you advise on if this is correct?

$s = preg_replace("/([0-9]+-)?(.*)-(html|htm|php|asp|aspx)(/[0-9]+)$/","$2",$wp_query->query_vars['name']);

I tried and am getting an error:


Warning: preg_replace() [function.preg-replace]: Unknown modifier '[' in /home/info/public_html/wp-content/themes/rt_solarsentinel_wp/404.php on line 70

Grr. Sitepoint threw away a relatively long answer because I had a URL in it, then I had to go out so now I am trying to remember what I wrote a few hours ago.

I’m not sure what the problem was which was causing your error report, but I have got another potential candidate regular expression for you. As for the missing $wp_query->query_vars[‘name’] I based my suggestion on your snippet from Joost de Valk’s code which did not have that bit. Is this a change you have made?

Anyway, I have been using “regex powertoy” an on-line tool at regex dot powertoy dot org to try things out and come up with the following regular expression:

\\/([0-9]+-)?([^/]*/?)(\\.(html|htm|php|asp|aspx)|([0-9]+))$

in your php this would look like

$s = preg_replace("/\\/([0-9]+-)?([^/]*/?)(\\.(html|htm|php|asp|aspx)|([0-9]+))$/","$2",$wp_query->query_vars['name']);

given

http://mysite.com/show-podcasts/49-green-patriot-radio-with-david-steinman/4804

this produces

green-patriot-radio-with-david-steinman/

which seems to be what you wanted.

To have a play for yourself, go to “regex powertoy”, click to edit the bottom panel and add your URL. Then enter the following into the top panel

s!\\/([0-9]+-)?([^/]*/?)(\\.(html|htm|php|asp|aspx)|([0-9]+))$!/$2!

Note that the S!..!$2! is the equiavalent of your call to preg_replace

You can choose “highlight matches” or “show replacements” from the drop-down menu in the middle to see the effect.

If you wanted to remove all digits and leave everything else, you might be better doing it in two passes. the first is as originally supplied, then the second eats all the digits:


$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","$1",$wp_query->query_vars['name']);
$s = preg_replace("/[0-9]+/","",$s);

This will potentially leave some odd dashes hanging around, if you want to convert those to spaces at the same time, you could do the following instead:


$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","$1",$wp_query->query_vars['name']);
$s = preg_replace(array("/[0-9]+/","/-/"),array(""," "),$s);

I hope this gives you some approaches to think about.

Frank.

Oh man, I hate that the no url rule doesn’t give any kind of warning either!!!

I do see that the code on Joost’s post is a bit different from the code that is in the example 404 page he makes available, but…

It works! It works! Hit the following link to check it out:


http://webtalkradio.net/show-podcasts/49-green-patriot-radio-with-david-steinman/4804

I used the last bit of code that you suggested.


$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","$1",$wp_query->query_vars['name']);
$s = preg_replace(array("/[0-9]+/","/-/"),array(""," "),$s);

The full code I’m using now inside the 404.php page is:


<h2>We&#146;re having trouble finding what you were looking for...</h2>

		<p>Maybe we can help you find what you came here for:</p>
		<?php 
			$s = preg_replace("/(.*)-(html|htm|php|asp|aspx)$/","$1",$wp_query->query_vars['name']);
			$s = preg_replace(array("/[0-9]+/","/-/"),array(""," "),$s);
			$posts = query_posts('post_type=any&name='.$s);
			$s = str_replace("-"," ",$s);
			if (count($posts) == 0) {
				$posts = query_posts('post_type=any&s='.$s);
			}
			if (count($posts) > 0) {
				echo "<ol><li>";
				echo "<p>Were you looking for <strong>one of the following</strong> episodes or shows?</p>";
				echo "<ul>";
				foreach ($posts as $post) {
					echo '<li><a href="'.get_permalink($post->ID).'">'.$post->post_title.'</a></li>';
				}
				echo "</ul>";
				echo "<p>If not, don't worry, We've got a few more tips to help you find what you need:</p></li>";
			} else {
				echo "<p><strong>Don't worry though!</strong> We've got a few tips to help you find what you need:</p>";
				echo "<ol>";
			}
		?>
			<li>
				<label for="s"><strong>Search</strong> for it:</label>
				<form style="display:inline;" action="<?php bloginfo('siteurl');?>">
					<input type="text" value="<?php echo esc_attr($s); ?>" id="s" name="s"/> <input type="submit" value="Search"/>
				</form>
			</li>
			<li>
				<strong>If you typed in a URL...</strong> make sure the spelling, cApitALiZaTiOn, and punctuation are correct. Then try reloading the page.
				
			</li>
			<!--<li>
				<strong>Look</strong> for it in the <a href="<?php bloginfo('siteurl');?>/sitemap/">sitemap</a>.
				
			</li>-->
			<li>
				<strong>Start over again</strong> at our <a href="<?php bloginfo('siteurl');?>">homepage</a> (and please contact us to let us know what went wrong so we can fix it).
			</li>
		</ol>

Frank, what can I say. You have my deepest thanks and admiration, and please let me know if there’s anything I can do to return the favor…seriously.