schwim
June 13, 2016, 12:36am
1
Hi there everyone!
I’m helping a guy who accidentally lost his informational website while hospitalized. It’s a site that categorically stored about 35,000 links pointing to technical sites and pages.
I’ve got the entire site snapshot from the Wayback Machine and have begun writing a script to go through these files but am at the point at which I need to extract the links, which is what I’m having a problem with. Mostly, my issue is that the original site creator was a minimalist when it came to tags, classes and whatnot, so content I don’t want often looks like the content that I do. Here’s a representation of what I will be working with:
[code]
Select A Category:
Lubricant Leaking from Rear Hub Seals Full Float Hub w/10.25" FULL-FLOATER Rear Axle TSB 94-19-24 for 86-94 F-250, F-350
[/code]
I need only the links and not the categories. Since he used the same table info for the categories section, I need to retrieve information only after:
<td width="/100%.jpg" align=left style="font-weight: bold; font-size: 14px; color: #CD3301; font-family: Arial;">Select A Link:</td>
Then once I’m there, I need to do the following with a table row:
[code]
[/code]
the link inside the a href needs to become my $link. The content between the opening and ending a href needs to be my $title and the content on the line after "Source: by " needs to become my $submitter.
My googling is leading me towards DOM parsers but there’s a lot to choose from and I’m afraid to tie my horse to the wrong cart and invest a lot of time learning something that isn’t properly suited to do what I need.
Could someone suggest a class, function, script, method, etc. for me to begin working toward solving this issue? I really would like to be able to help him but I just don’t know in which direction to go and Googling is just offering too many directions.
Thanks for your time!
Select A Link:
Identification Based on VIN, Door Jamb Label, Build Sheet (Ford 999 Report), Paint Color Code, VECI Label, Transmission/Differential Pan & Gasket Sizes/Shapes, etc.; "... made a mistake 15 years ago by telling someone to use the Driver's side label to ID their Rear Differential (axle, pumpkin type, etc.); turned out that a previous owner had swapped a Dana 60 in place of the stock 8.8..."
Source: by miesk5 at Ford Bronco Zone Forums
"...Ford built our Broncos & other 4x4 trucks & vans with a numerically lower front gear ratio in the front Dana 44 than the rear so that off-road steering is enhanced. A Bronco built with 3.55 rear ratio would have a 3.54 ration in the front Dana 44; or; 3.08 in the 8.8 & 3.07 in the Dana 44; or 4.11 in the 8.8 & 4.10 in the Dana 44, etc..."; Following was in my MS WORD Notes and the source, Randy's Ring & Pinion has removed it from their current web site; The gear ratio in the front of a four wheel drive has to be different from the front so the front wheels will pull more. There have been many different ratio combinations used in four-wheel drive vehicles, but not so that the front will pull more. Gear manufactures use different ratios for many different reasons. Some of those reasons are: strength, gear life, noise (or lack of it), geometric constraints, or simply because of the tooling they have available. I have seen Ford use a 3.50 ratio in the rear with a 3.54 in the front, or a 4.11 in the rear with a 4.09 in the front. As long as the front and rear ratios are within 1%, the vehicle works just fine on the road, and can even be as different as 2% for off-road use with no side effects. point difference in ratio is equal to 1%. To find the percentage difference in ratios it is necessary to divide, not subtract. In order to find the difference, divide one ratio by the other and look at the numbers to the right of the decimal point to see how far they vary from 1.00. For example: 3.54 ÷ 3.50 = 1.01, or 1%, not 4% different. And likewise 4.11 ÷ 4.09 = 1.005, or only a 1/2% difference. These differences are about the same as a 1/3" variation in front to rear tire height, which probably happens more often than we realize. A difference in the ratio will damage the transfer case. Any extreme difference in front and rear ratios or front and rear tire height will put undue force on the drive train. However, any difference will put strain on all parts of the drivetrain. The forces generated from the difference have to travel through the axle assemblies and the driveshafts to get to the transfer case. These excessive forces can just as easily break a front u-joint or rear spider gear as well as parts in the transfer case.
Source: by miesk5 at Ford Bronco Zone Forums
"...Ford built our Broncos & other 4x4 trucks & vans with a numerically lower front gear ratio in the front Dana 44 than the rear so that off-road steering is enhanced. A Bronco built with 3.55 rear ratio would have a 3.54 ration in the front Dana 44; or; 3.08 in the 8.8 & 3.07 in the Dana 44; or 4.11 in the 8.8 & 4.10 in the Dana 44, etc..."; Following was in my MS WORD Notes and the source, Randy's Ring & Pinion has removed it from their current web site; The gear ratio in the front of a four wheel drive has to be different from the front so the front wheels will pull more. There have been many different ratio combinations used in four-wheel drive vehicles, but not so that the front will pull more. Gear manufactures use different ratios for many different reasons. Some of those reasons are: strength, gear life, noise (or lack of it), geometric constraints, or simply because of the tooling they have available. I have seen Ford use a 3.50 ratio in the rear with a 3.54 in the front, or a 4.11 in the rear with a 4.09 in the front. As long as the front and rear ratios are within 1%, the vehicle works just fine on the road, and can even be as different as 2% for off-road use with no side effects. point difference in ratio is equal to 1%. To find the percentage difference in ratios it is necessary to divide, not subtract. In order to find the difference, divide one ratio by the other and look at the numbers to the right of the decimal point to see how far they vary from 1.00. For example: 3.54 ÷ 3.50 = 1.01, or 1%, not 4% different. And likewise 4.11 ÷ 4.09 = 1.005, or only a 1/2% difference. These differences are about the same as a 1/3" variation in front to rear tire height, which probably happens more often than we realize. A difference in the ratio will damage the transfer case. Any extreme difference in front and rear ratios or front and rear tire height will put undue force on the drive train. However, any difference will put strain on all parts of the drivetrain. The forces generated from the difference have to travel through the axle assemblies and the driveshafts to get to the transfer case. These excessive forces can just as easily break a front u-joint or rear spider gear as well as parts in the transfer case.
Source: by miesk5 at Ford Bronco Zone Forums
chorn
June 13, 2016, 6:12am
2
try SimpleXML
first, this will lead to good results on valid html, plus you have the ability to use xpath
for a first filtering.
oddz
June 13, 2016, 12:58pm
3
In my opinion that there is only one option – Symfony DOM Crawler . When I need to do web scrapping that is what I turn to. Coupled with the CSS Selector component it is very easy to scrap pages.
system
Closed
September 12, 2016, 7:58pm
4
This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.