Exampls of using the PHP Trim Function

Hi,

I have a string like this:

<a href=“http://domain.com/members/admin/” title=“Admin”>Admin</a> posted on the forum topic <a href=“http://domain.com/groups/introduce-yourself-232181584/forum/topic/hi-everyone-im-bill/”>Hi Everyone, I’m Bill</a> in the group <a href=“http://domain.com/groups/introduce-yourself-232181584/”>Introduce Yourself</a>

What i’m trying to do is strip out everything except for this

<a href=“http://domain.com/groups/introduce-yourself-232181584/forum/topic/hi-everyone-im-bill/”>Hi Everyone, I’m Bill</a>

(This is the forum topic for BuddyPress)

Something that’s going to be the same for every string is that the Forum topic will be followed by text:

“posted on the forum topic”

and precedes:

“in the group”

What would be the proper syntax to select just that part?

Thanks for any help.

The [fphp]trim[/fphp] function is designed to strip whitespace from the ends of a string. You want to get information from the middle of a string that is surrounded with other data. In other words, it’s not what you want.

What you need is to use regular expressions via [fphp]preg_match[/fphp]. This code will achieve your desired effect:

<?php
	$string = '<a href="http://domain.com/members/admin/" title="Admin">Admin</a> posted on the forum topic <a href="http://domain.com/groups/introduce-yourself-232181584/forum/topic/hi-everyone-i-am-bill/">Hi Everyone, I am Bill</a> in the group <a href="http://domain.com/groups/introduce-yourself-232181584/">Introduce Yourself</a>';
	$matches = array();
	if (preg_match('/posted on the forum topic\\s+([\\s\\S]*?)\\s+in the group/', $string, $matches)) {
		//We found what we are looking for
		echo $matches[1];
	}
?>

The above code outputs:

<a href="http://domain.com/groups/introduce-yourself-232181584/forum/topic/hi-everyone-i-am-bill/">Hi Everyone, I am Bill</a>

You can follow the link to the regular expressions tutorial website above to learn more about regular expressions and how they can help you. I will elaborate on this example in particular:

/posted on the forum topic\\s+([\\s\\S]*?)\\s+in the group/

Let’s break this apart into its individual pieces:

  • /
    PHP needs to wrap the expression in delimiters. The expression must begin and end with this character. It can be any character you want, as long as it doesn’t appear in the expression. I made the standard selection of the forward slash.
  • posted on the forum topic
    This is literal text. Any substring that matches the pattern must begin with this phrase.
  • \s
    This is a shorthand character class which matches any whitespace character (e.g. spaces and tabs).
  • +
    This is a repetition character that says the previous character (the whitespace character class) can be repeated an infinite number of times, as long as there is at least one.
  • (
    This character indicates the start of a grouping. In this example, the group represents the information that you would like to extract.
  • [
    This begins a character set; it means that the next character in the string can match any of the characters defined in this character set.
  • \s
    This is another whitespace character class.
  • \S
    This is another shorthand character class that matches anything EXCEPT whitespace.
  • ]
    This indicates the end of the character set. When the two character classes in this character set appear together, they form a tautology. In other words, any character at all will match this character set (which states “this matches any character which is whitespace or is not whitespace”). This notation is often used to avoid [url=http://www.regular-expressions.info/dot.html]ambiguity with the dot character.
  • *
    This is another repetition character which states that the previous character (the character class) can be repeated an infinite number of times, or not appear at all.
  • ?
    In this context, the question mark indicates that the asterisk repetition character should be lazy rather than greedy.
  • )
    This character indicates that the previously opened grouping ends here. Characters after this point will not be captured by PHP (unless we started another grouping).
  • \s
    This is another whitespace character.
  • +
    This is another “one or more” repetition character.
  • in the group
    This is a second literal string that anchors the end of the expression.
  • /
    Since this is the delimiter, it indicates that the expression is finished and any following characters are pattern modifiers.

Hopefully that helps to demystify the regex voodoo and give you a solid understanding of this solution. :slight_smile:

Also read about strPos, which may give you the start and end positions of your standard strings “posted on the forum topic” and “in the group” so you could use substr. Though Tarh’s solution is more elegant.

$start = strpos($string, “posted on the forum topic”) + 26 //start + length of posted
$end = strpos($string, “in the group”)//start of in the group
$length = $end - $start //
$mystring = substr ( string $string , int $start [, int $length ]

I’m interested in finding out more about this. How and why does [/s/S]* as opposed to using .* instead.

I’ve used [\s\S] before because . doesn’t match a newline character unless you set a flag.

That’s interesting, does it work differently across PHP and JavaScript?
http://www.regular-expressions.info/dot.html says that [\s\S] is used in JavaScript to ensure that break characters are also selected.

JavaScript and VBScript do not have an option to make the dot match line break characters. In those languages, you can use a character class such as [\s\S] to match any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character.

@tarh Thanks a ton for the detailed explanation, I always wanted to understand regular expressions better, i’ll study that fairly closely as i’ll be needing to do it quite a bit in my next project.

@esearing - I think i’ll be using strpos alot as well so thanks for the reminder.

I guess you could say so. php has more modifiers
http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php

But even then, you might still use [\s\S] if you want the pattern to be able to use both the original semantic of . in one place, and in another place you need to match anything.

There is of course also the option to use the PCRE_DOTALL (s) modifier as part of the regex (c.f. as a modifier outside of the delimiters) if you need the dot metacharacter (.) to work differently in different spots. For example: font=monospace((?s).*)[/font] to capture the first line and then all other lines, or the option can be specified in a non-capturing group like font=monospace[/font]

That said, I often fall back on the catch-all character class too. (:

@Tarh

I just implemented the method you suggested and it worked perfectly except for when I tried to select a couple of strings, the code you gave worked perfect for the forum topic but then I tried to pull the Group name out of the string:

<a href=“http://domain.com/members/admin/” title=“Admin”>Admin</a> posted on the forum topic <a href=“http://domain.com/groups/introduce-yourself-232181584/forum/topic/hi-everyone-i-am-bill/”>Hi Everyone, I am Bill</a> in the group <a href=“http://domain.com/groups/introduce-yourself-232181584/”>Introduce Yourself</a>:

See this one:

$activityGroup = array();

if (preg_match(‘/in the group\s+([\s\S]*?)\s+:/’, $activityString, $activityGroup)) {

echo 'In the Group: ' . $activityGroup[1];

}

But the it doesn’t echo out anything, any ideas?

The problem here is that in your string, the link that you want to extract ends simply with a colon. Your “catch-all” character class that is capturing the link data must be anchored by a colon at the end. In the code that you gave, it is being anchored by “one or more whitespace characters followed by a colon” by the sub-expression \s+:. You might be tempted to simply remove the \s+, but this does not solve the problem entirely since the “href” attribute of the link contains a colon in the URL. This means that your capturing group would terminate prematurely.

The solution is to include the closing “a” tag in your anchor text:

$activityGroup = array();

if (preg_match('~in the group\\s+([\\s\\S]*?</a>):~', $activityString, $activityGroup)) {
	echo 'In the Group: ' . $activityGroup[1];
}

Note that the </a> literal text appears inside the round brackets rather than outside with the colon. This is because, although it is known literal text, you also want to include it in the captured substring. Also note that the delimiters were changed to the tilde character. This is because the forward slash is included in literal text in the expression.

@Tarh

Thanks but that didn’t seem to work I have this:


/* If activity is associated with a group/category, print that out */
if (preg_match('/in the group\\s+([\\s\\S]*?</a>):/', $activityString, $activityGroup)) {
	echo 'Group: ' . $activityGroup[1];
}

It produces the following error:


Warning: preg_match() [function.preg-match]: Unknown modifier 'a' in

It think it might have something to do with the /a being an escape character or something because I tried a /: before and it gave the same error message.

You may have caught an earlier version of my post; it was edited to solve this issue shortly after posting. :blush:

You’re right, it works perfect now! Thanks a ton for your help, i’ll keep reading on regular expressions, they’re pretty complicated but very powerful.