Extract RSS link from page source

Hi all. Long time reader - first time poster.

I’ve been teaching myself PHP over the last month or so and have encountered something that is currently beyond my limited knowledge.

What I have to do is get the page source of another one of my websites and strip the RSS URL.

The RSS URL is structured as follows:

<link rel="alternate" type="application/rss+xml" title="Some photo feed:  search: xx-xxx " 
href="/search/photo.look/rss/JnJlZ3NlYXJjaD1WSC12ZWQmZGlzdGluY3RfZW50cnk9dHJ1ZQ/">

There is a line break after xx-xxx "

I have been trying to teach myself preg_match_all to extract the data, and have literally tried about 1000 combinations while looking at other examples. I’ve managed to pull sections of the page but never what I’m after.

I’ve read a lot of tutorials etc but can’t find my problem.

Any help would appreciated.

This has no bearing on your existing issue but this is where I love Yahoo Pipes. They make things like this very simple.

You know, you really shouldn’t say things like that to me - I may just begin to believe it. :stuck_out_tongue:
I’m happy I could be of assistance. Feel free to send me a PM or email me if you need anything else.
I’m here to help!

In company with what I’ve read in the manual that now makes perfect sense. Thanks. You’re a legend!

If you mean you only want to match the first rss link, using my pattern with preg_match should do it. If you mean matching the first “/> once it’s matching the string, that’s done using the ‘?’ operator. It makes the previous operator ungreedy. As an example, let’s look at rss\/(.+?)\/\”>

This says "After matching ‘rss/’, capture B[/B] one or more characters of any type .+, but no more than necessary to reach the next part of the pattern ? Which in this case, is "/>

So this is basically saying from rss/(Get everything here until you reach)"/>
The key is to be ungreedy, using ?, so you get only what is needed to continue the pattern.

Sorry, one more question if I may. How do you state the expression as " end the match at the first occurrence of /">… where /"> is the end of the string but repeats itself multiple times on the page " ? I can’t seem to find that in the expression (partly because I’m somewhat of a moron when it comes to these things).

Edit. Will preg_match_all always match the first occurrence or does that have to built into the expression, or does it match the first occurrence (more the end of the expression) by default?

It works nicely on the website now but I’m using your code to help understand how they work.

Alright, so if I understand you, you want everything between rss/ and "/>?

If so, this should do it:


<pre><?php

$pattern = "/<link.+?type=\\"application\\/rss\\+xml\\".+?href=\\".+?rss\\/(.+?)\\/\\">/s";
$text = '<link rel="alternate" type="application/rss+xml" title="Some photo feed:  search: xx-xxx " 
href="/search/photo.look/rss/JnJlZ3NlYXJjaD1WSC12ZWQmZGlzdGluY3RfZW50cnk9dHJ1ZQ/">';

preg_match_all($pattern, $text, $matches);

print_r($matches);

If that’s your desired result, I’d be happy to walk you through it so you can understand how to reproduce this in the future.

Zarin. You’re brilliant. Thank you so much for your help. You know, I literally spent 2 days trying to get it right and refrained from posting a question because I wanted to work it out myself. As it turns out, I was close, but not close enough. I made a few errors in at the end of the string.

I can’t begin to explain how much I appreciate your help. DM a postal address and I’ll send you a t-shirt and cap from my website :slight_smile:

I’ve managed to write out your expression in plain english and it’s gone a long way in helping me work out these things.

Thank you again.

Thank you for the quick reply.

I should have been more specific in my question.

First thing I’m doing is getting the content of the entire remote page.

$url = "http://xxxxxxxx.net/search/photo.look?id=xx-xxx";
$content = @file_get_contents($url) or die("Could not access file: $url");

The last string of characters at the end of the RSS feed (from my first post) is random (as is everything else marked as ‘x’) and it all changes based on the specific page I’m visiting.

I’m then trying to match the RSS pattern and extract only that string of text.

Having a look at it now, how would I retrieve everything prior to the first occurence of /"> - which is at the end of the RSS string? Having a look at the random nonsense I’m getting now, I think that’s where the expression is falling apart.

Thanks so much for the reply. I forgot how hard it was to learn something and you’ve given me some good material to reference.

Try this out:


<?php

$pattern = "/<link.+?type=\\"application\\/rss\\+xml\\".+?href=\\"(.+?)\\">/s";
$text = '<link rel="alternate" type="application/rss+xml" title="Some photo feed:  search: xx-xxx " 
href="/search/photo.look/rss/JnJlZ3NlYXJjaD1WSC12ZWQmZGlzdGluY3RfZW50cnk9dHJ1ZQ/">';

preg_match_all($pattern, $text, $matches);

print_r($matches);