How to get a site's Last-modifed date?

#1

I am trying to find when a URL or site has been last modified…

I was under the impression that using PHP curl(…) would be an easy task but unfortunately I am slowly coming to the conclusion it is remarkably difficult because the remote site may not have a Last-Modified date :frowning:

I have also tried saving the strlen of the downloaded web-page and to compare with the current strlen web-page but curl appears to give conflicting results when I know the page has not been modified because I have tested on my own sites?

Any suggestions to test if a site has been modified?

#2

Well #1 here is always going to ‘beware the cache’.
strlen would be insufficient - if i fix a typo and change a k to a c, I’ve changed the content but not the length.
Only real way (and even thats not 100% foolproof, though it is ignorably close) would be to hash the contents. But any dynamic element on the page will result in false positives…

1 Like
#3

Save Markup, so all JS and all CSS.

#4

I tried the syrlen and even though the page had not changed and there is no dynamic adverts on the page the total number of bytes varied by an appreciable amount!

I also unsuccessfully tried Hashing the content.

Looks like I will have to continue with my research :frowning:

I was able to consistently retrieve the last-modified date of an image and it’s etag reference but not the main site URL.

#5

Any recommendations for saving the markup because curl’s results do not appear to be consistent.

#6

Then more investigation is required because that should not be the case unless there is some dynamic element to the code.

How’re you performing the curl? Why can you not simply capture the output?

2 Likes
#7

OK more investigation and discovered that:

  1. it was essential to use curl_close($ch)
  2. certain curl_setopts(…) were essential otherwise different values were returned,

TEST-003

// test
echo '',
   $url = 'https://sitepoint.com/community/';
echo  '<pre>';  // prettify output - adds linefeeds
  $tests->test_003( $url ); 
echo '</pre>';

Output:

$url = 'https://sitepoint.com/community/';
Array
(
    [sizeof] => 38,533
    [md5] => 92ad137c2ed63959df5a8e891924297e
    [sha1] => 4a02b57a5717ed7296754ff2282315567e67e576
    [crypt] => san3MWNiy7wSU
    [md5_file] => 92ad137c2ed63959df5a8e891924297e
    [sha1_file] => 4a02b57a5717ed7296754ff2282315567e67e576
)
 

Test function:

# ============================================================
public function test_003
(
  string $url = ''
)
:array 
{
  $result = [];
  $ch = curl_init();

  $aOpts = [
    CURLOPT_URL             => $url,
    CURLOPT_RETURNTRANSFER  => TRUE,
    CURLOPT_SSL_VERIFYPEER  => FALSE,
    CURLOPT_SSL_VERIFYHOST  => FALSE,
    CURLOPT_CONNECTTIMEOUT  => 8.8,
  # CURLOPT_AUTOREFERER     => true,     // set referer on redirect
  # CURLOPT_ENCODING        => "",       // handle all encodings
  # CURLOPT_MAXREDIRS       => 10,       // stop after 10 redirects
  # CURLOPT_RETURNTRANSFER  => TRUE,     // DO NOT DISPLAY CONTENT IMMEDIATELY
    CURLOPT_FOLLOWLOCATION  => true,     // follow redirects
    CURLOPT_USERAGENT       => "spider", // who am i // spider
    CURLOPT_SSL_VERIFYPEER  => TRUE,  
    CURLOPT_SSL_VERIFYHOST  => 2,     
    CURLOPT_HEADER          => FALSE,    // FALSE === RETURN HEADER 
    CURLOPT_NOBODY          => FALSE,    // FALSE === HAS BODY
  ];
  curl_setopt_array($ch, $aOpts); 

  $ch1 = curl_exec($ch); // 'https://supiet2.tk');
  curl_close($ch);

// FILE STUFF
  $fff = '/tmp/kill.html'; // clear space - delete all  kill*.*
  $ptr = fopen($fff, 'w');
    fwrite($ptr, $ch1);
  fclose($ptr);  

// not rendered but useful for verification
  $ch2['content'] = $ch1;

  $ch2['sizeof'] = number_format( (float) strlen( file_get_contents($fff) ) );   
  $result['sizeof'] = $ch2['sizeof']; //  27842

  $ch2['md5'] = md5($ch1);   
  $result['md5'] = $ch2['md5']; 

  $ch2['sha1'] = sha1($ch1);   
  $result['sha1'] = $ch2['sha1'];

  $ch2['crypt'] = crypt($ch1, 'salt string goes here');   
  $result['crypt'] = $ch2['crypt']; 

  $ch2['md5_file'] = md5_file($fff);   
  $result['md5_file'] = $ch2['md5_file'];

  $ch2['sha1_file'] = sha1_file($fff);   
  $result['sha1_file'] = $ch2['sha1_file'];

  return $result;
}//

Now on to the next hurdle :slight_smile:

1 Like
#8

I was wanting to use curl_multi_exec($rsc, $running); because there are quite a few URLs to test and best to run simultaneously rather than one after the other.

#9

I assume the most common way to do this is to look at the modified date of each file. This can be done by recursively or iteratively walking the files. You need access to the file system in the server to do that. That should satisfy the requirement to get the last-modifed date. Your original question does not state a requirement to know what has changed within any file.

One problem with the previous is if a file exists in the file system but not used in the website. Solving that problem is likely much more complex. You probably must parse every HTML file and get all referenced files; not just HTML files (as in links) and image files but also stylesheets and JavaScript files.

Comparing files to determine changes can be quite complex.

#10

The modified date is required on third-party sites and I do not have access to their sites.

The Kludge is to save a hash of the site URL contents and to later compare the saved hash with the current hashed contents.

It would be a lot easier if a modified date was available.

#11

Have you looked in to using ETag?

#12

Curl gets the file, right? You need HTTP header data, such as what Mittineague refers to. See HTTP headers - HTTP | MDN. You could use either of:

#13

@SamuelCalifornia

Yes and like the Last-modified tag there are numerous sites that do not show these tags?

If they are not available it looks as though I will have to stick with downloading the complete web-page and saving a hash of the contents.

1 Like
#14

However, there are ways to get an approximation of a page’s last modified date , even if you’re not the web site owner.
You can check by using two method:

  1. Using RSS Feeds to check Published Dates of Articles
  2. Using Google Cache to Check the Last Crawl of a Page
#15

I am wanting to select URLs from a MySql database and use PHP to check for updates rather than manually check web-pages.

#16

Of course. We all understand how labor-intensive it would be to manually compare pages.

1 Like
#17

Yeah John, just take those potentially thousands of rows in your database table and manually compare them. Every day. I’m sure it won’t take you long. Or be a repetitive menial task that we definitely didn’t create computers to do such things for.

1 Like
#18

When I last tried to do something similar, to have an automatic routine to check whether a specific URL had changed, I found that even something as trivial as the page containing the current date or time as an ASP variable would cause the server to give the current date/time as the “last modified”. Frustrating, as I wasted ages learning about how to send specific HTTP headers to only retrieve if modified-since. If you can call learning wasted, that is.

#19

Did you manage to retrieve Modified-date for every URL tested? If so I would be interested in some hints on how to extract the date.

I get the distinct impression that Servers can set flags to prevent the date from showing.

#20

They certainly can, it’s not required by the HTTP spec to send it:

HTTP/1.1 servers SHOULD send Last-Modified whenever feasible.

See https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.29

2 Likes