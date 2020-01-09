So lets put it in order-of-annoyance:
.csv
Plain Text
.html
.xls
.pdf
Stage 1: Parsing
This is actually where most of your effort will lie. The actual comparing is a dawdle, but getting the data prepped for comparing.
CSV is designed for parsing, is a structured format.
Plain Text, obviously, is pretty easy to parse out. Not a structured format necessarily, so it may need some tweaking.
.html can be a bollockache if you’re getting a lot of random/different HTML. If its a standardized HTML, it’ll be easier.
.xls is a bit of a pain because Microsoft don’t like their data to be easily read. It may be worth pushing XLS files through excel into CSV.
PDF. Oh lordy. PDF is going to be a pain as well. Probably best to push this one through your reader of choice, and output it as plain text.
Basically, you’re going to want to mangle all your data into its more basic forms: CSV being the preferred input. A script can parse it down into a list.
Stage 2: Comparing Two Lists
Python can do this pretty easily (and i’m probably not simplifying/shorthanding this enough for python experts to be happy): [also i’m spitballing this, so it may need tweaking]
for i, word in enumerate(secondlist):
for fword in (fword for fword in firstlist if fword == word):
print ("Match found for {0} on line {1}".format(fword,i))
Output
The benefit of using something command line based like python would be
- OS ‘independance’ (since python is not tied to any OS specifically),
- Pipability.
complists.py list1.csv list2 > todayslist