Saved webpages unreadable

UpstateLeafPeeper · November 24, 2018, 10:41pm

Whenever I find something interesting online, I save the webpage as complete HTML webpage so I have it for reference later on.

Today I went to open one of these pages, and while it opened it was basically blank. (The webpage was from an online newspaper, and all that I saw was a basc head and footer.)

There is both a .html file and a folder for the article full of various files, so why cannot I read the page I saved?

coothead · November 24, 2018, 11:06pm

That is an impossible question to answer without the benefit
of having the files that are causing your problems to hand.

coothead

UpstateLeafPeeper · November 24, 2018, 11:26pm

And what do you propose?

The only thing I can think is that some stupid Javascript thing is disabling the pages when they are not actively connected to the source’s server.

coothead · November 24, 2018, 11:32pm

I propose that you give the members here access
to the files that are the cause of your grief.

Or would you rather that they ran around like headless
chickens trying to guess a possible cause?

coothead

UpstateLeafPeeper · November 24, 2018, 11:46pm

How do I attach all of those files?

Would it suffice to post a link to imgur and at least show you what I see?

John_Betong · November 24, 2018, 11:55pm

@UpstateLeafPeeper
Whenever I find something interesting online, I save the webpage as complete HTML webpage so I have it for reference later on

I prefer copying and pasting the link which can be opened at a later date.

This is especially good if the web page has comments which would show the latest updates.

Edit:
Web pages can also be saved as PDF files which may be useful.

Try searching for other methods and utilities because a lot depends on which browser is being used to save the files.

coothead · November 24, 2018, 11:56pm

Have you not considered putting them all in a
zip file and then attaching it to your post?

coothead

Mittineague · November 24, 2018, 11:59pm

My guess is the browser is saving the HTML source and the browser is displaying the computed HTML.

If you visit the page URL with JavaScript disabled, does it look like the local “save as” version?

UpstateLeafPeeper · November 25, 2018, 12:21am

Except most online newspapers pull articles after a very short period of time, which is why I save things - so I have them forver!

That is material for another thread… I used to be able to PDF things just fine, but there is so much Javascript crap online, most webpages I try to PDF turn out to be gobbedygook.

Just try to save a webpage on this website!

UpstateLeafPeeper · November 25, 2018, 12:21am

I didn’t know you could attach things on SitePoint.

UpstateLeafPeeper · November 25, 2018, 12:25am

I’ve got several issues going on here with the Internet going farther down the toilet each day.

In my OP, the issue i that I save a webpage, there is a .html file and an HTML folder on my hard-drive. Then, later on - maybe a year later - when I have double-clicked on the .html file I either get a blank page, or I see the webpage load for a second, and then get a white page.

After supper let me see if I can post some examples.

To think I have been saving things for the last decade, and apparently in the last year or two something changed online, and now what I thought was saved is total gibberish!

UpstateLeafPeeper · November 25, 2018, 1:06am

Here is an article I just read online, and saved to my computer…

The Website That Shows How a Free Press Can Die - The New York Times.html (1.9 MB)

When I went back to the saved file and opened it, I got a New York Times heading and “Page Not Found” in the center of the page.

How can a saved, offline page not be found?

By the way, the .html file and associted folder were about 2MB and when I zipped them it became 8MB so I can’t upload the .zip which is probably what you need.

UpstateLeafPeeper · November 25, 2018, 1:25am

Here is what I see… https://i.imgur.com/5ToWi6K.png

Mittineague · November 25, 2018, 2:24am

I’m sure the issue for that site is that it’s The New York Times.

There are at least a few things in place used to protect content that’s under copyright protection.

Do you have the same problem saving pages that aren’t as well protected?

Do you have a subscription / API for the NYT site?

UpstateLeafPeeper · November 25, 2018, 2:37am

When I “save” a webpage, am I just saving a bunch of links instead?

It’s about 50/50.

Yes, I have an annual PAID subscription.

Mittineague · November 25, 2018, 3:11am

That would be consistent with my “source vs. computed” idea. That is, a lot of sites are “JavaScript required” instead of “JavaScript enhanced”. i.e. the browser has JavaScript enabled but the “save as” doesn’t.

As for revisiting older NYT articles, I could find policy that seems to be about “using” them. Though I wouldn’t consider saving a local copy as “using” it, maybe they do? In any case it wouldn’t hurt to ask. Maybe there is some sort of “bookmarked favorites” feature.

UpstateLeafPeeper · November 25, 2018, 3:18am

Can you explain that a little more?

Well, regardless of what the NYT thinks, it is “Fair Use” for me to save local copies for my own personal consumption again in the future.

I have some more questions around my disappearing files…

Will it confuse the matter if I ask them here or should I start another thread?

Mittineague · November 25, 2018, 3:27am

Did you try what I posted earlier?

You could do that to confirm, but basically there would be a “no JavaScript” page, and a “with JavaScript” page. You could also compare how “view-source” looks with how the dev tool’s page DOM looks.

UpstateLeafPeeper · November 25, 2018, 3:57am

This is getting complicated…

So I turned off Javascript in Firefox and when I went back to the original article on the NYT, it loaded for a second and then I got a blank page. Then I searched around and found some site claiming to have a non-JS version, although it looked like some foreign site. When I saved that page with Javascript still turned off, it seems like I can read the page as I would expect. I was also able to create a decent PDF of it.

Are you saying that many modern websites are set up so that if you don’t have active Javascript enabled that the page breaks, including saved copies?

If so, is this by design? Or is it just horrible web design?

You know another thing I just discovered is this…

So I have saved all of these news articles, and in my folder I see the .html file and the corresponding folder. But when I double-click on the .html file to re-read an article I saved weeks ago, the .html file sudden disappears and all I am left with is the html folder, which renders the webpage unviewable since there is no long an .html file to click on?!

Is this another conspiracy by media outlets? Or is it more sloppy programming?

Most importantly, how can I easily save web pages like I did in the past?

The reason I save stuff is for reference, and hoping that a webpage will still be around in a week, month or year is dreaming. Plus I want a way to easily access a web page offline when I need it. (For instance, tonight I was trying to read a webpage I saved about configuring Apache, and sadly it fell victims to the issues above, and so what was once a great reference to help me out is now apparently lost forever.)

Like I said, I have tried PDFing webpages from the get go, but often there is so much Javascript nonsense going on - sorta like on this website - you can never get a legible web page when you PDF things. (5-10 years ago I just relied on the “print version” and would save that and/or PDF it and things were golden.

Now it is like companies don’t want anyting to be permanent. You read it once and it is gone forever… Maybe we should move back into the cave and just tell stories by the fire light and hope we can remember things to tell our chldren?!

John_Betong · November 25, 2018, 4:06am

Here is what I see…

https://johns-jokes.com/downloads/sp-d/johnyboy-curl-test/crazycatcoder.php

Source:

Generated from:

https://johns-jokes.com/downloads/sp-d/johnyboy-curl-test/

SP Forum Topic:

https://www.sitepoint.com/community/t/internal-error-on-target-site-using-curl-but-not-browser/102574