I am trying to find when a URL or site has been last modified…
I was under the impression that using PHP curl(…) would be an easy task but unfortunately I am slowly coming to the conclusion it is remarkably difficult because the remote site may not have a Last-Modified date
I have also tried saving the strlen of the downloaded web-page and to compare with the current strlen web-page but curl appears to give conflicting results when I know the page has not been modified because I have tested on my own sites?
Any suggestions to test if a site has been modified?
Well #1 here is always going to ‘beware the cache’.
strlen would be insufficient - if i fix a typo and change a k to a c, I’ve changed the content but not the length.
Only real way (and even thats not 100% foolproof, though it is ignorably close) would be to hash the contents. But any dynamic element on the page will result in false positives…
I assume the most common way to do this is to look at the modified date of each file. This can be done by recursively or iteratively walking the files. You need access to the file system in the server to do that. That should satisfy the requirement to get the last-modifed date. Your original question does not state a requirement to know what has changed within any file.
Comparing files to determine changes can be quite complex.
Yeah John, just take those potentially thousands of rows in your database table and manually compare them. Every day. I’m sure it won’t take you long. Or be a repetitive menial task that we definitely didn’t create computers to do such things for.
When I last tried to do something similar, to have an automatic routine to check whether a specific URL had changed, I found that even something as trivial as the page containing the current date or time as an ASP variable would cause the server to give the current date/time as the “last modified”. Frustrating, as I wasted ages learning about how to send specific HTTP headers to only retrieve if modified-since. If you can call learning wasted, that is.