Advice on Updating Links each week

Good afternoon,

I am in the process of updating my church’s website. My goal is to automate as much of the updating process as possible. One of the complaints is that the group that is responsable for updating the links to the latest bulletins would often forget. We are currently linking out to the publishers page that lists all of the last four bulletins.

What I would like to do is to get those links to display right on our main site. I talked to the publishing company, and they have no problem with us linking directly to the bulletins on their server.

What would be the best way to get these automatically updating on our site? Should I use screen scraping? The links are always going to be in a format that we know. Should I enter all of the future names in a database and only display them based on today’s date? What is the best way to do that?

I am really open to any suggestions.

Thanks!

Here are the links to the bulletins. I would like to replicate that exact box on our main site.

http://www.thecatholicdirectory.com/directory.cfm?fuseaction=display_site_info&siteid=58870

If you have permission from the publisher, scraping should be fine.

This is the HTML which is responsible for the bulletins:

<fieldset style="width:150px; padding-top:5px; clear:both; text-align:center;">
    <legend align="center"><span class="PageTitleLight">Bulletins</span></legend>
    <a href="http://www.catholicweb.com/bulletins/58870/Mar-28-2010.pdf" target="newwin" rel="nofollow">Mar&nbsp;28,&nbsp;2010</a><br />
    <a href="http://www.catholicweb.com/bulletins/58870/Mar-21-2010.pdf" target="newwin" rel="nofollow">Mar&nbsp;21,&nbsp;2010</a><br />
    <a href="http://www.catholicweb.com/bulletins/58870/Mar-14-2010.pdf" target="newwin" rel="nofollow">Mar&nbsp;14,&nbsp;2010</a><br />
    <a href="http://www.catholicweb.com/bulletins/58870/Mar-07-2010.pdf" target="newwin" rel="nofollow">Mar&nbsp;07,&nbsp;2010</a><br />
    <a href="http://www.catholicweb.com/bulletins/58870/Feb-28-2010.pdf" target="newwin" rel="nofollow">Feb&nbsp;28,&nbsp;2010</a><br />
    <a href="http://www.catholicweb.com/user_home.cfm?fuseaction=add_bulletin&bulletin=58870" rel="nofollow"><img src="http://www.catholicweb.com/images/layout/email_delivery.gif" alt="Get this bulletin in your email" border="0" /></a>
</fieldset>

Normally, I would suggest an XML-based approach, parsing the HTML file and searching for the fieldset with the legend “Bulletins” and then loop through the links.

However, as all of the bulletins use the same URL structure, and only bulletins have this url structure, you can get away with using regular expressions:

<?php
$Content = file_get_contents('http://www.thecatholicdirectory.com/directory.cfm?fuseaction=display_site_info&siteid=58870');
Preg_Match_All('~<a href="(http://www.catholicweb.com/bulletins/58870/[A-Za-z]+-\\d+-\\d+.pdf)" target="newwin" rel="nofollow">([^<]+)</a><br />~', $Content, $Matches);
//To output all of the links as they are:
echo '<fieldset><legend>Bulletins</legend>' . implode(PHP_EOL, $Matches[0]) . '</fieldset>';
//To output them in a format you'd prefer:
echo '<fieldset>' . PHP_EOL;
echo '<legend>Bulletins</legend>' . PHP_EOL;
echo '<ul>' . PHP_EOL;
foreach($Matches[1] as $Key => $Match){
    printf('<li><a href="&#37;s">%s</a></li>', $Match, $Matches[2][$Key]);
    echo PHP_EOL;
}
echo '</ul>';
?>

I’d recommend cacheing your results, rather than requesting the external webpage each load. By cacheing the results, you are less likely to aggrivate the site owner, and more importantly you’ll save a huge portion of the loading time.

Thank you very much for all of the help! I am working on this right now! Much appreciated.