How to scrape keywords from website's source code with python 3?

How do I build a spider which reads URLs from a CSV file and searches each url for a specific keyword or character in the website’s source code (like meta name="viewport" for example to find (non) responsive websites)?

I’m using python 3.6

Spider? You mean a script? Or a spider as in scrapy spider?

A csv file contains lines of text, each line consists of 2 or more columns separated by a comma (usually). Now normally I’d use regular expression for this but here you can do this (also not using the csv module):

urls = []

with open("path to csv file") as cf:
   contentList = cf.readlines

for row in contentList:
   columns = row.split(",") #if the delimiter isn't a "," modify accordingly
   for column in columns:
      if ("https://" in column) or ("www." in column):
         urls.append(column)

Now you have a list containing ever url that’s in your csv file. Time to use beautiful soup to extract the keywords you want:

from bs4 import BeautifulSoup as bs
import urllib2

for url in urls:
   response = urllib2.urlopen(url)
   page = response.read()
   soup = bs(page, "lxml")

Now you can use beautiful soup to extract anything you want from the pages, for example soup.findAll(“meta”) will return a list containing all meta attributes. The official documentation should help you find extract whatever you want in like 1 or 2 lines of code:

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

If you don’t have beautiful soup or the lxml parser, just use pip install:

pip install beautifulsoup4
pip install lxml

scrapy is too complicated and inefficient for my taste so I don’t use it, and I haven’t met a single page that I couldn’t scrape with urllib2 (or requests) and beautiful soup.

Good luck! And you’re welcome to ask again should you get an error

This topic was automatically closed 91 days after the last reply. New replies are no longer allowed.