How can I do a capture of an entire news article including the comments when the developers used stupid JavaScript to dynamically display only parts of the content?
I was hoping to use SnagIt or PDF to capture this article and the comments, but can’t get it to work because of stupid JavaScript!!
Programmatically, you don’t. Or, rather, you shouldn’t. I assume you are trying to “screen-scrape” someone else’s site for news content, plus comments. It’s a question of ethics. Are you cool with someone else screen-scraping your site, for their use?
If I’m wrong about the screen-scraping, I apologise. In which context are you referring?
I’ll leave it to you to determine the ethics of what you want to do
But the practicality is that you need a screen scraper which supports JavaScript – interpreting the content into a DOM that you can then save externally.
I started building a WordPress plugin for a client that does just this (capture page to PDF). The client wanted their team member pages to be downloadable as a PDF, but it would work for any URL. And since it renders the PDF using Webkit you shouldn’t have a problem grabbing Disqus etc comments. In fact the reason I went with http://wkhtmltopdf.org/ as the engine is because their site was JS-heavy.
You don’t need WordPress, but you do need a PHP server with X settings. This was 6+ months ago so I forget what X is, but you can read about it here: http://wkhtmltopdf.org/
Just replace the above variables as follows:
plugin_dir_path(_FILE_) - this just points to that wkhtmltopdf-i386 file
$header[‘flag’] - These are the header flags to send to the renderer. Look at my script to get an idea of what the header should be. It’s pretty gross, so let me know if you run into problems.
$templateFilename.‘’.$pdfFilename - This is just where you want the file saved to
It’s been a minute since I’ve used it, and you might need to make a local copy the target HTML locally first. That can just be done with curl.
EDIT
Oh and you may be able to export to an image too (is this what you wanted with SnagIt?). You might need another library to export PDF to PNG, I forget.
I want to read that article but don’t have time now, plus I find all of this responsive, smart-phone design to be crap!
So like I have done for decades, I was hoping to PDF - or use SnagIt to make a .png - of the entire article plus all of the comments so I can read it later at my leisure.
Since the article is entitled: “Computer Security Expert Bruce Schneier Is Here to Answer Questions”, it sorta follows that reading the answers is sort of important!!
I don’t understand why most web developers have turned into suck JERKS?!
Every web page I build can easily be printed out, or screen-captured or PDF’ed.
(Why publish content if it is hard to read, print out and share?)
And since all of this falls under FAIR USE, it doesn’t make me a “scraper” or “pirate” or “thief”…
I used to have a bookmark to a website that stripped out all of the ads and stuff so you could just read the content all of one page, but I have long since lost the link.
Thanks for trying to help, but you totally lost me in what I need to do - other than lots of programming!
SnagIt usually does a pretty good job capturing things, however, it sucks as Disque comments or whatever they are called. (The Mac version is also somewhat lame compared to the Windows version.)
You would think what I want to do is extremely common…
Maybe it is a generational thing, but when I read articles online, I prefer them to be text-only and like a book. 15 years ago, I used to PDF everything on my PC and then read it as PDFs.
And I often store useful article for later reference, and a PDF is much easier to access than an HTML file.
[quote=“mikey_w, post:8, topic:115352”]
Now that I have been downgraded from “criminal”
[/quote]If I had a nickel for every post I’ve read by a user who asks questions like this so that they can put stuff on their own site, I’d be driving a Lamborghini. I wasn’t accusing you. Like my responses to them, I was merely pointing out that it can be considered unethical. (I also apologised if I was not correct about the screen-scraping, and asked for clarification.)
[quote=“mikey_w, post:8, topic:115352”]
I find all of this responsive, smart-phone design to be crap!
[/quote]No arguments, there. I agree; however, I can also see why it has progressed to this point. Mostly because it’s easier to update one codeset than it is to update two or more. (Remember when mobile phone browsers would be redirected to “http://m.domainname.com”? Responsive design eliminates that.)
It’s also (generally) more secure. Ever try to figure out how Google does something? I’ve never been able to.
[quote=“mikey_w, post:8, topic:115352”]
So like I have done for decades, I was hoping to PDF - or use SnagIt to make a .png - of the entire article plus all of the comments so I can read it later at my leisure.
[/quote]So, just “printing” the page to PDF doesn’t work?
[quote=“mikey_w, post:8, topic:115352”]
Since the article is entitled: “Computer Security Expert Bruce Schneier Is Here to Answer Questions”, it sorta follows that reading the answers is sort of important!!
[/quote]Schneier and Gibson both have my utmost respect. I don’t think anyone has done more in the name of internet security and privacy than those two. So, yeah, critical reading for developers.
Up until about 5 years ago, sure, that worked fine.
But now the article doesn’t start until page 4, and all of the hyperlinks for the Disque crap are expanded out into full URL’s and it looks like a disaster.
I definitely misunderstood you lol. Give Pocket a try.
All you do is click the pocket button when you’re on a page and it saves it for later in straight-up text (like an ebook). It’s seriously just one click, I use it all the time. You can even tag the Pockets to sift through later.
It doesn’t do comments, but it provides a link at the bottom of the article to the original page.
I was able to remember the add-on I had before… Readability.
Both work okay, but if there is a way that I could capture comments like in the article in my OP, that would be best…
I find software like Disqus so fricking frustrating!!!
Half of the time I am as interested in people’s comments as the original article, and who wants to have to keep clicking “Load More” to see the next block of 10 comments? Plus when you try to go back or move farther forward then things fall apart.
One of the stupidest creations on the Internet!!!
Anyways, I would like to believe I am not the only frustrated reader online, and that someone has created an add-on to fix this non-sense.
I have been Googling the topic, but haven’t made any progress.
Yeah I dislike JS comment plugins as much as you do. Though for me it’s not so much about saving and printing, as it is about progressive enhancement.
Sites use these systems for optimization – it makes the page more cacheable, and produces faster initial rendering. That’s the reason, but it doesn’t cut much ice with me; it’s like removing half the cargo from a truck so that it uses less fuel.
The problem with making an add-on for this, is that it would have to support loads of different comment systems. In theory, it shouldn’t be too hard to make one that automatically expands a Disqus thread so you can view the whole thing in one go. But then it would have to support Facebook comments, and all the other systems that different sites use. Not to mention that lots of sites use custom solutions.
Making a single add-on to cater for all of that would be quite a challenge, and maybe that’s why there isn’t one (afaik).
There’s a nice bookmarklet called Printliminator. It adds a button to your browser toolbar. Click it, and you can delete any items on the page you don’t want, like ads, sidebars, headers and footers. Then print to PDF, and you get a nice document with just the content you want. I use it all the time.