Webcrawling with SQL


#1

Hello, I am brand new to the community here on site point. I am also fairly new to SQL.
As of right now, I have about 90% of a webcrawler for a database done. The only hiccup I have at the moment is that I don't know how to exclude a URL link a certain way. I want to exclude a link if it does not start with the URL I am crawling.
For example, if I crawl YouTube. I want to exclude any links that might pop up that aren't youtube related. So I it didn't start with https://www.YouTube.com, I don't want it in the database. I k ow how to exclude if a link contains something, but not if it does not contain something.


#2

WHERE link NOT LIKE 'https://www.YouTube.com%'

this will not return any Youtube links


#3

Thank you so much! That really helped narrow down my restritctions on what links it will crawl!! I just have two more hitches for the spider that I am not quite sure how to do.

  1. How do I get the spider to compare keywords found for the link to a table of keywords I want stored?
    For Example, my spider crawls YouTube. But I don't want URLs stored for everything they have, I want it to compare if the URL for the video has any keywords, if it does, grab them and see if at least one matches my keyword table.

  2. Is there an easier way to get the spider to crawl a whole site? Right now it is on a loop.
    While a<= 99(for example)
    Is there an easier way than setting this up to an extremely high number?

Thank you in advance for anything!!!!


#4

Hello, I just saw a couple typos in my original message that might get the wrong answer lol. I just corrected them below. Thank you for any information!!
As of right now, I have about 90% of a webcrawler for a database done. The only hiccup I have at the moment is that I don't know how to exclude a URL link a certain way. I want to exclude a link if it does not start with the URL I am crawling.
For example, if I crawl YouTube. I want to exclude any links that might pop up that aren't youtube related. So if it didn't start with https://www.YouTube.com, I don't want it in the database. I know how to exclude if a link contains something, but not if it does not contain something.


#5

WHERE url LIKE 'https://www.YouTube.com%'

will ~exclude~ any urls that do ~not~ start with https://www.YouTube.com


#6

Thank you for the clarification.
I am finally to the home stretch on this webcrawler. My last question is when I am trying to pull the keywords off of the html script and put them into my database. This is the code I get from the website for the keywords:

a href="/youtube/bobcat" class="js-pop">Bobcat ,
a href="/youtube/cat" class="js-pop">,
a href="/youtube/catinpool" class="js-pop">,
a href="/youtube/catattack" class="js-pop">

I am just not sure how to pick individual ones to put into the database.


#7

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.