Detect RSS feeds in a web page?

Hi guys!

I’m trying to figure out the best way to detect a RSS feed in a web page, or more specifically, for a website.

Assuming a feed is discovered, I want to recover the URI for that feed, which I can then do things with.

First I tried regular expressions, but they’re slow, resource intensive and incredibly difficult to make accurate.

I’ve since seen that DOMDocument offers some hope, but I have no idea where to start, and I can’t find examples of what I’m trying to do.

So if anyone can help out here, that’d be excellent!

Here’s a script I wrote a few months ago that uses DOMDocument and XPath to grab feeds from a page.


function feedSearch($url) {

    if($html = @DOMDocument::loadHTML(file_get_contents($url))) {

        $xpath = new DOMXPath($html);
        $feeds = $xpath->query("//head/link[@href][@type='application/rss+xml']/@href");

        $results = array();

        foreach($feeds as $feed) {
            $results[] = $feed->nodeValue;
        }

        return $results;

    }

    return false;

}

print_r(feedSearch('http://www.flickr.com/photos/tags/bristol/'));

/*
Array
(
    [0] => http://api.flickr.com/services/feeds/photos_public.gne?tags=bristol&lang=en-us&format=rss_200
    [1] => http://api.flickr.com/services/feeds/geo/?tags=bristol&lang=en-us
)
*/

Sam, that’s absolutely spot on, mate!

Exactly what I needed.

I was just trying out a few other examples, based on the documentation, but not having any luck.

Thanks again.

Sam, I have one question — is it possible to retrieve the title along with the URI?

Providing the title attribute is set on the <link> tag, grabbing the feed title is fairly trivial.


function feedSearch($url) {

    if($html = @DOMDocument::loadHTML(file_get_contents($url))) {

        $xpath = new DOMXPath($html);
        $feeds = $xpath->query("//head/link[@href][@type='application/rss+xml']");

        $results = array();

        foreach($feeds as $feed) {
            $results[] = array(
                'title' => $feed->getAttribute('title'),
                'href' => $feed->getAttribute('href'),
            );
        }

        return $results;

    }

    return false;

}

print_r(feedSearch('http://www.flickr.com/photos/tags/bristol/'));

/*
Array
(
    [0] => Array
        (
            [title] => Flickr: &quot;bristol&quot; RSS feed
            [href] => http://api.flickr.com/services/feeds/photos_public.gne?tags=bristol&lang=en-us&format=rss_200
        )

    [1] => Array
        (
            [title] => Flickr: "bristol" Geo feed
            [href] => http://api.flickr.com/services/feeds/geo/?tags=bristol&lang=en-us
        )

)
*/

However, you may find the title attribute is not specified for every feed you come across.

When this is the case, it will be necessary to parse each individual feed using SimpleXML and extract the title from the feed itself.

Sam, that’s brilliant.

I’d sort of figured out a solution, but it was using two loops to pick out the title and href attributes.

Your solution is much more elegant.

Thanks again!

No problem, glad I could be of help :slight_smile: