I am working on a project nominating government urls for harvesting and web crawling by the EOT project, and recording urls with either content or links such that that is impossible, for special handling.
We have been told that links and images that are embedded in javascript need to be referred as “interactive”, and wild horses cannot extract more specific information.
The End of Term project is run in part by archive.org, and neither group provides more specific information nor proves more willing to answer questions.
I have some programming background and know old-style HTML, but do not know javascript nor CSS. Often one can easily tell them apart, but not always.
It matters because the Wayback machine contains federal government web pages from the 2012 End of Term project, that contain interactive CSS features that were harvested and display properly. “Elevators”, for instance, where you click on a topic and text appears underneath it.
I am especially having trouble knowing what to make of “ID=” appearing in an HTML tag.
In the following line, “ID=” calls a javascript function from the javascript file loaded at the top of the page. <div class='form-message" id=“url-warning”>This URL has already been archived in the last 30 days</> The line invisibly queries the home database on whether the page has been archived in the past 30 days, and if it has prints the message in a specified format. (This is from the NominationTool that we install from the Google Chrome store.)
However, most of the time when I can eventually make out what it does, “ID=” is part of CSS.
Here is a mess from a Google search page where I simply have no idea. This could ont be a better example of taking 200 lines to do with CSS what could have been done in five lines in HTML. (Don’t even get me started on the why behind this nonsense.) It violates every standard of efficient use of time and space I learned in computer science.
I want to know what “ID= means”.
I know that the page itself is interactively generated and could not be crawled, even if it made sense to crawl a search engine search results page. I want to know how to tell what “ID=” means.
I know you like URLs but it wouldn’t let me post with them here!
I’m looking at a specific url:
Ars Technica
Op-ed: Mark Zuckerberg’s manifesto is a political trainwreck
Ars Technica - 2 hours ago
Enlarge / Can Facebook’s AIs travel back in time to help with this boiler explosion? Probably. Eventually. Courtesy of De Forest Douglas Diver Railroad Photographs, ca.
An image thumbnail loads to the left.
Here is the code for this. I did some creating format, and marked key elements in bold and/or italics. The bold and italics didn’t copy and paste and those features above don’t work, so here is the main tag I’m interested in.
<a target="_**blank" class="article usg-AFQjCNGSvjd0HMNm1Sdq8BdquP_OY5jmMA sig2-iRkiaotguLcWD_dRcONzqA did-1028502305395840773 esc-thumbnail-link" href="https://arstechnica.com/staff/2017/02/op-ed-mark-zuckerbergs-manifesto-is-a-political-trainwreck/" url="https://arstechnica.com/staff/2017/02/op-ed-mark-zuckerbergs-manifesto-is-a-political-trainwreck/" id="MAA4BUgBUABgAWoCdXM" ssid="tc" style="visibility: visible;">
I want to know what “ID=” and “SSID=” mean.
It’s in a big block of stuff that isn’t posting, and neither did the a target tag until I took the < out from in front of it.
<div class="media-strip"></div></div>
</td></tr></tbody></table>
</div></div></div></div></div>
<div class="esc-separator"></div><div class="blended-wrapper esc-wrapper">
<div cid="52779385955753" class="story anchorman-blended-story esc esc-has-thumbnail " id=":4d">
<div class="esc-inner esc-collapsed">
<div class="esc-body">
<div class="goog-inline-block jfk-button jfk-button-standard esc-toggle-button" role="button" style="user-select: none;" tabindex="0">
<div class="jfk-button-img icon esc-toggle-icon"></div></div><div class="esc-default-layout-wrapper esc-expandable-wrapper">
<table class="esc-layout-table" cellspacing="0" cellpadding="0"><tbody>
<tr><td class="esc-layout-thumbnail-cell"><div class="esc-thumbnail-wrapper">
<div class="esc-thumbnail-state"><div class="esc-thumbnail esc-thumbnail-hidden" title="Ars Technica">
**<a target="_**blank" class="article usg-AFQjCNGSvjd0HMNm1Sdq8BdquP_OY5jmMA sig2-iRkiaotguLcWD_dRcONzqA did-1028502305395840773 esc-thumbnail-link" href="https://arstechnica.com/staff/2017/02/op-ed-mark-zuckerbergs-manifesto-is-a-political-trainwreck/" url="https://arstechnica.com/staff/2017/02/op-ed-mark-zuckerbergs-manifesto-is-a-political-trainwreck/" id="MAA4BUgBUABgAWoCdXM" ssid="tc" style="visibility: visible;">
<div class="esc-thumbnail-image-wrapper " style="">
<img class="esc-thumbnail-image late-tbn" imgsrc="//t1.gstatic.com/images?q=tbn:ANd9GcTKSzGLwUWhvbRlGraJZQNUkTEDeCJE1RJEvMR1pKniEWH2QqFi8ImbKf5b8NLz9rgu95W2MTFL-zc" style="width: 100%; visibility: visible;" alt="" src="./Google News_files/images(29)"></div><div class="esc-thumbnail-image-source-wrapper">
<label class="esc-thumbnail-image-source">Ars Technica</label></div></a>
</div></div></div><a class="goog-inline-block jfk-button jfk-button-action esc-fullcoverage-button" href="https://news.google.com/news/rtc?ncl=dBRfKw2X5RQ7voMWS-2KPTle01gLM&authuser=0&topic=tc" title="Click to see realtime coverage of this story" style="2" value="/news/rtc?ncl=dBRfKw2X5RQ7voMWS-2KPTle01gLM&authuser=0&topic=tc">See realtime coverage</a></td><td class="esc-layout-article-cell">
<div class="esc-lead-article-title-wrapper">
<h2 class="esc-lead-article-title">
<a target="_blank" class="article usg-AFQjCNGSvjd0HMNm1Sdq8BdquP_OY5jmMA sig2-iRkiaotguLcWD_dRcONzqA did-1028502305395840773" href="https://arstechnica.com/staff/2017/02/op-ed-mark-zuckerbergs-manifesto-is-a-political-trainwreck/" url="https://arstechnica.com/staff/2017/02/op-ed-mark-zuckerbergs-manifesto-is-a-political-trainwreck/" id="MAA4BUgBUABgAWoCdXM" ssid="tc"><span class="titletext">Op-ed: Mark Zuckerberg's manifesto is a political trainwreck</span></a>
</h2></div>
<div class="esc-lead-article-source-wrapper">
<table class="al-attribution single-line-height" cellspacing="0" cellpadding="0">
<tbody><tr><td class="al-attribution-cell source-cell"><span class="al-attribution-source">Ars Technica</span></td><td class="al-attribution-cell timestamp-cell"><span class="dash-separator"> - </span><span class="al-attribution-timestamp">58 minutes ago</span></td><td class="al-attribution-separator-cell separator-before-share-bar"><div class="separator"></div></td><td class="al-attribution-cell sharebar-cell"><table id="52779385955753-sharebar" class="share-bar-table yesscript" cellspacing="0" cellpadding="0"><tbody><tr><td class="share-bar-cell sharebox-cell"><div class="share-button-wrapper" buttontype="share" sharetype="s-gplus" title="Share on Google+"><div class="share-button-state"><div class="icon-fc gplus-share-icon share-button"></div></div></div></td><td class="share-bar-cell"><div class="share-button-wrapper" buttontype="share" sharetype="s-twitter" title="Share on Twitter"><div class="share-button-state"><div class="icon-fc share-icon-twitter share-button"></div></div></div></td><td class="share-bar-cell"><div class="share-button-wrapper" buttontype="share" sharetype="s-fb" title="Share on Facebook"><div class="share-button-state"><div class="icon share-icon-facebook2 share-button"></div></div></div></td><td class="share-bar-cell"><div class="share-button-wrapper" buttontype="share" sharetype="s-email" title="Share via Email"><div class="share-button-state"><div class="icon email-icon2 share-button"><a target="_blank" class="mailto-share-link"></a></div></div></div></td></tr></tbody></table></td></tr></tbody>
Thanks!
Yours,
Dora