I am trying to find when a URL or site has been last modified…
I was under the impression that using PHP curl(…) would be an easy task but unfortunately I am slowly coming to the conclusion it is remarkably difficult because the remote site may not have a Last-Modified date
I have also tried saving the strlen of the downloaded web-page and to compare with the current strlen web-page but curl appears to give conflicting results when I know the page has not been modified because I have tested on my own sites?
Any suggestions to test if a site has been modified?
Well #1 here is always going to ‘beware the cache’.
strlen would be insufficient - if i fix a typo and change a k to a c, I’ve changed the content but not the length.
Only real way (and even thats not 100% foolproof, though it is ignorably close) would be to hash the contents. But any dynamic element on the page will result in false positives…
I tried the syrlen and even though the page had not changed and there is no dynamic adverts on the page the total number of bytes varied by an appreciable amount!
I also unsuccessfully tried Hashing the content.
Looks like I will have to continue with my research
I was able to consistently retrieve the last-modified date of an image and it’s etag reference but not the main site URL.
I was wanting to use curl_multi_exec($rsc, $running); because there are quite a few URLs to test and best to run simultaneously rather than one after the other.
I assume the most common way to do this is to look at the modified date of each file. This can be done by recursively or iteratively walking the files. You need access to the file system in the server to do that. That should satisfy the requirement to get the last-modifed date. Your original question does not state a requirement to know what has changed within any file.
One problem with the previous is if a file exists in the file system but not used in the website. Solving that problem is likely much more complex. You probably must parse every HTML file and get all referenced files; not just HTML files (as in links) and image files but also stylesheets and JavaScript files.
Comparing files to determine changes can be quite complex.
However, there are ways to get an approximation of a page’s last modified date , even if you’re not the web site owner.
You can check by using two method:
Using RSS Feeds to check Published Dates of Articles
Using Google Cache to Check the Last Crawl of a Page
Yeah John, just take those potentially thousands of rows in your database table and manually compare them. Every day. I’m sure it won’t take you long. Or be a repetitive menial task that we definitely didn’t create computers to do such things for.
When I last tried to do something similar, to have an automatic routine to check whether a specific URL had changed, I found that even something as trivial as the page containing the current date or time as an ASP variable would cause the server to give the current date/time as the “last modified”. Frustrating, as I wasted ages learning about how to send specific HTTP headers to only retrieve if modified-since. If you can call learning wasted, that is.