Last-Modified

I used get_headers to get Last-Modified of a url. But returned array does not have Last-Modified key. Is there a trick I can get last-modified date of a url?

Are you sure the server provides that information? I suspect that if the server isn’t sending it, there’s not a lot you can do about it.

Yes, going by the RFC it looks to be it’s a “usually”

https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

14.29 Last-Modified
…
HTTP/1.1 servers SHOULD send Last-Modified whenever feasible.

Last-Modified is not always present. Browsers send Last-Modified header only when you have set it before for the given URL. If you haven’t sent it before in your earlier response then the header is simply not there.

Do you mean I have to set last-modified manually as header of the page that I can get it with get_headers from another url?

Actually I want to detect if any change is done in text of a page. As last-modified is not present I think of this trick: I do get_file_contents then strip_tags to get text only then md5 it. Whenever md5 is different this means a change is done on page. Any better way to detect it?

Manually or not manually you have to send it first - but not from another url but from the same url. Actually, I was wrong in my previous answer - you can’t get the Last-Modified header at all because Last-Modified is a response header, not a request header! Browsers never send it and therefore you can’t read it with get_headers.

But the browsers will send the If-Modified-Since header after you send the Last-Modified header in your response earlier.

Suppose you have a pdf file that you want to send but only when it’s not in the browser’s cache. The scenario is like this:

  1. A user (via a browser) makes a request for http://example.com/mydoc.pdf. Because it’s the first request for this document the browser doesn’t send If-Modified-Since and you don’t receive it in get_headers. Therefore, you must send the whole file in your response and in this response you also send the Last-Modified header.

  2. A few minutes later the users makes another request for http://example.com/mydoc.pdf. This time the browser sends If-Modified-Since with the time from your previous Last-Modified header. You read it with get_headers, compare the time with the actual modification time and if it turns out the document has changed since that time then you go back to point 1 and send it again in whole. If it hasn’t changed you only send a blank 304 response and the browser will load the document from its cache.

What about my latest reply above?

You have two options:

  1. If you keep the last modified time of a page somewhere (e.g. in a database) then you can use it with Last-Modified header.

  2. You can use md5 in the etag header - it’s used analogously to Last-Modified. This is also a good solution.

1 Like

That given url is not mine so I cannot customize its etag or last-modified header. And that way is my trick to have a change detection for it. Any better way than what I said assuming we cannot customize any header of it?

track down what exactly you see as a change. if the website contains any dynamic data, like a clock, or a click-counter, you will always detect a “change”.

1 Like

And even if the server delivers the “last-modified” header, it’s worth checking whether it decides to always output “now” if the page contains any dynamic data. I once spent some time writing code to connect to a web page and check the last-modified date, and if it was later than my locally-stored date I’d download and index the page. Turned out, though, that because the page showed current date and time on the top, the server would notice this as a dynamic element and set last-modified to that date/time. Hence every time I checked, it looked as if it had changed.

So, analyse the content you’re interested in rather than assuming the headers tell you what you need.

3 Likes

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.