Comparing data from two documents

I have a list of under 100 surnames which I need to compare regularly (at least monthly) with a second list, to see if any of them also appear there. The second list has almost 10,000 entries, and I can obtain the data in various formats - HTML, .csv, Excel, PDF or plain text.

I can search for each name individually, which is tedious and time-consuming. I’m sure there must be a better way, but I’ve no idea what. Any help would be greatly appreciated.

Ideally, I want exact matches only. “Baker” should not trigger matches with “Abubaker”, for example.

What OS?

Ubuntu 19.04.

So lets put it in order-of-annoyance:

.csv
Plain Text
.html
.xls
.pdf

Stage 1: Parsing

This is actually where most of your effort will lie. The actual comparing is a dawdle, but getting the data prepped for comparing.

CSV is designed for parsing, is a structured format.
Plain Text, obviously, is pretty easy to parse out. Not a structured format necessarily, so it may need some tweaking.
.html can be a bollockache if you’re getting a lot of random/different HTML. If its a standardized HTML, it’ll be easier.
.xls is a bit of a pain because Microsoft don’t like their data to be easily read. It may be worth pushing XLS files through excel into CSV.
PDF. Oh lordy. PDF is going to be a pain as well. Probably best to push this one through your reader of choice, and output it as plain text.

Basically, you’re going to want to mangle all your data into its more basic forms: CSV being the preferred input. A script can parse it down into a list.

Stage 2: Comparing Two Lists

Python can do this pretty easily (and i’m probably not simplifying/shorthanding this enough for python experts to be happy): [also i’m spitballing this, so it may need tweaking]

for i, word in enumerate(secondlist):
  for fword in (fword for fword in firstlist if fword == word):
      print ("Match found for {0} on line {1}".format(fword,i))

Output

The benefit of using something command line based like python would be

  1. OS ‘independance’ (since python is not tied to any OS specifically),
  2. Pipability. complists.py list1.csv list2 > todayslist

OK, thanks for that.

Sorry if I wasn’t clear earlier; I can obtain the same data in any of the given formats. I’ve been using the Excel version, but I could just as easily download the CSV version. I can also convert my surname list to CSV without problem.

Unfortunately, my knowledge of Python is practically nil, so I’d need more help with that. Having a solution which is not tied to an OS would be an advantage, although in practical terms, I’m pretty sure nobody but me is ever going to attempt this. (My skills may be limited, but they’re the best we have. redface This is for a non-profit, and we’re all volunteers.)

:slight_smile: Glad to help you always.

If we assume both of your lists are CSV’s…
okay. Let me pause there and ask some sanitization questions.

  1. Are these single-word lists. You mentioned the first list is Surnames, but is the second list also just a list of surnames? Or is it a line of text you’re searching for surnames in?
  2. Assuming 1) was answered with Surnames, is the second list considered ‘sanitized’? (no trailing spaces, no oddball escaping slashes, etc)

First list (the one I produce) is just surnames, all single word, no hyphens, spaces or anything else.

The second document is multi-column, although I’m fairly sure I only need to check against the first column. In the unlikely event of a match of any sort, I would need to check further manually, which is fine. The document I’m checking against can be found here.

Have you looked into using a Diff Comparison tool? I have used DiffMerge and Meld on Linux, along with KDiff3 and Kompare (but those are more geared for KDE).

You’d want to make sure your two files are sorted the same way before using either tool, and it will highlight the lines it sees a difference on and then you can quickly scan it to see if the difference matters.

1 Like

I thought about using Meld, but the second document contains multiple columns, so I didn’t think it would work. Although thinking about it further, I could just save a copy with all other columns deleted and compare that … Both documents are sorted alphabetically by surname.

So, for the observance of doubt, I pulled down this file in CSV format, and pulled a couple of names at random from the first column.

pondering… saving time… saving effort… makes some tweaksmakes some foolproofing while he’s at it

So here’s my result:

complists.py:

import csv
import sys
if (len(sys.argv) != 4):
    print("Expected format: complists.py inputfile1 inputfile2 outputfilename")
    exit()
with open(sys.argv[1],'rb') as ff:
    reader = csv.reader(ff)
    firstlist = list(reader)
with open(sys.argv[2],'rb') as sf:
    reader = csv.reader(sf)
    secondlist = list(reader)	
with open(sys.argv[3],'wb') as out:
    writer = csv.writer(out)
    for i, word in enumerate(secondlist):
        for fword in (fword for fword in firstlist if fword[0] == word[0]):
            print("Matched line {0} for {1}".format(i,word[0]))
            writer.writerow(word)

testfile1:

ABDULLAH
LOUKA
NOTINTHELIST

(The sanctionslist.csv comes from the site you provided)

Command Line Format:
complists.py firstlistfilename secondlistfilename outputfilename

Command Line Example:

E:\>python complists.py testfile1 sanctionsconlist.csv output.csv
Matched line 177 for ABDULLAH
Matched line 178 for ABDULLAH
Matched line 179 for ABDULLAH
Matched line 180 for ABDULLAH
Matched line 181 for ABDULLAH
Matched line 182 for ABDULLAH
Matched line 183 for ABDULLAH
Matched line 184 for ABDULLAH
Matched line 185 for ABDULLAH
Matched line 186 for ABDULLAH
Matched line 187 for ABDULLAH
Matched line 188 for ABDULLAH
Matched line 189 for ABDULLAH
Matched line 6253 for LOUKA
Matched line 6254 for LOUKA
Matched line 6255 for LOUKA
Matched line 6256 for LOUKA

This script will create a CSV file (named whatever you put into the last argument) that contains only the rows it matched. Should make for easier review of results for you.

1 Like

That’s excellent - thank you. smile

1 Like