Hello, I am brand new to the community here on site point. I am also fairly new to SQL.
As of right now, I have about 90% of a webcrawler for a database done. The only hiccup I have at the moment is that I don’t know how to exclude a URL link a certain way. I want to exclude a link if it does not start with the URL I am crawling.
For example, if I crawl YouTube. I want to exclude any links that might pop up that aren’t youtube related. So I it didn’t start with https://www.YouTube.com
, I don’t want it in the database. I k ow how to exclude if a link contains something, but not if it does not contain something.
WHERE link NOT LIKE 'https://www.YouTube.com%'
this will not return any Youtube links
Thank you so much! That really helped narrow down my restritctions on what links it will crawl!! I just have two more hitches for the spider that I am not quite sure how to do.
-
How do I get the spider to compare keywords found for the link to a table of keywords I want stored?
For Example, my spider crawls YouTube. But I don’t want URLs stored for everything they have, I want it to compare if the URL for the video has any keywords, if it does, grab them and see if at least one matches my keyword table. -
Is there an easier way to get the spider to crawl a whole site? Right now it is on a loop.
While a<= 99(for example)
Is there an easier way than setting this up to an extremely high number?
Thank you in advance for anything!!!
Hello, I just saw a couple typos in my original message that might get the wrong answer lol. I just corrected them below. Thank you for any information!!
As of right now, I have about 90% of a webcrawler for a database done. The only hiccup I have at the moment is that I don’t know how to exclude a URL link a certain way. I want to exclude a link if it does not start with the URL I am crawling.
For example, if I crawl YouTube. I want to exclude any links that might pop up that aren’t youtube related. So if it didn’t start with https://www.YouTube.com
, I don’t want it in the database. I know how to exclude if a link contains something, but not if it does not contain something.
WHERE url LIKE 'https://www.YouTube.com%'
will ~exclude~ any urls that do ~not~ start with https://www.YouTube.com
Thank you for the clarification.
I am finally to the home stretch on this webcrawler. My last question is when I am trying to pull the keywords off of the html script and put them into my database. This is the code I get from the website for the keywords:
I am just not sure how to pick individual ones to put into the database.
This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.