Help on string extraction

Lets say I have the following breadcrumb generated by drupal.

<div class="breadcrumb">
  <a href="/some/domain/">Home</a>

I want to extract the link “/some/domain/” and the text Home into an array like $bc = array(‘link’ => $link, ‘title’ => $title).

Should I be using regular expression for this kind of extraction? Or should I be using some string manipulation methods?

Have been trying to find a regular expression pattern that fits this but, I’m not really sure of how the engine works and just couldn’t find a solution.

I was thinking of the pattern as in anything in between href=" " and anything in between <a href…> </a>.

Any help??

Regex can be very tricky. The empty values are the “\s*” before the next “</a>” in the mark-up.

Have you tried using the DOM instead?

I’m not 100% sure what your main goal is, but a rough string manipulation method would be:

$string = '<a href="/some/domain/">Home</a>';
$break = explode('"', $string);

$link = $break[1]; // $link == /some/domain/

$title = str_replace('</a>', '', $break[2]); // $title == >Home
$title = substr($title, 1); // $title == Home

$bc = array('link' => $link, 'title' => $title);

It’s not the prettiest, but I think it does what you’re talking about. Care to give us the big picture?

Just to clear the doubts.
Yup, the manipulation should be done before the system starts printing all the output to screen.
Although technically, it could also be done with javascript.

I’m trying to extract all information (the link and title) in the breadcrumb that drupal generated, and rebuild them. The reason to rebuild them is simply to add a class name in the <a> tag, and to append a text to the end of the breadcrumb.

I finally found the pattern to get the link: /(?<=href="\/)[\w\d\/]/
and the pattern to get the title: /[\w\s]

The only irritating thing is /[\w\s]*(?=<\/a>)/ generates empty values like this:

    [0] => Array
            [0] => Home
            [1] => 
            [2] => Photo Gallery
            [3] => 
            [4] => Cambodia
            [5] => 



<div class="breadcrumb">
<a href="/example/domain/">Home</a> 
<a href="/example/domain/gallery">Photo Gallery</a>
<a href="/example/domain/image/tid/78">Cambodia</a>

Why is there an empty value??
Just curious. I simply increment my counter by 2 to overcome this.

Thats weird, theres no space in between the words and tags.

Thats some cool idea. Never played with the DOM in php before. I’ll check that out. Thanks for pointing out the invalid HTML entity too :slight_smile:

Why not just override the breadcrumb theme function in the template file to render the breadcrumbs as you would like?

using this function: drupal_set_breadcumb()

An array of stored breadcrumbs will be returned.

So than you can loop through the links in your own theme implementation of theme_breadcrumb() to display them as you would like using the work flow of the system, rather than hacking it.

could force it through the XML parser too, i suppose. (It would treat <div> as the wrapper class, and then <a> would be the first element)

An example using the DOM extension could look something like the following basic example. Note: I “fixed” your broken HTML snippet (the &raquo; entities didn’t have semi-colons) in my example.

$snippet = '<div class="breadcrumb">
<a href="/example/domain/">Home</a>
<a href="/example/domain/gallery">Photo Gallery</a>
<a href="/example/domain/image/tid/78">Cambodia</a>

$doc  = new DOMDocument;

$wrapper = $doc->getElementsByTagName('div')->item(0);

foreach($wrapper->getElementsByTagName('a') as $anchor) {
    $anchor->setAttribute('class', 'breadcrumblink');

// Append "» New Text" after the last anchor
$anchor->parentNode->appendChild($doc->createTextNode('» New Text'));

echo $doc->saveXML($wrapper);

Which outputs (the entities might look strange, but they’re just an alternate XML-friendly representation of the » character):


Home » Photo Gallery » Cambodia » New Text

In an ideal world, we would use a DOMDocumentFragment (rather than a document) and be able to use saveHTML (rather than saveXML) but the above gets the job done.

I’m not sure what you’re after, but I would probably use the DOM. i.e.
getAttribute(‘href’) and nodeValue

I’m not sure exactly what you are after either.

All the links are already in a DOM array


from which you can extract whatever attributes you like and put them into other arrays or whatever.

Does it have to be done server side?