Methods for data de-dupicliation

I am writing an application that will take in thousands of leads a week from various providers. I need to figure out an efficient method of determining duplicate data.

First the list will be de-duplicated against itself. Second it will be reduplicated against other leads from the provider. Third it will be reduplicated against ALL names leads in the database. I will have to check duplicates between up to two phones, as well as email addresses.

I’m not asking for code, I just wondering what the best/most logical and efficient method of doing this would be?

And resources on data-deduplication techniques would be wonderful. Thank you community!

How many databases are involved? Are there multiple databases involved (one for each provider), each having entries added then the three databases checked for duplicates or is there just the one database involved with each provider adding entries to the one database?

Why not just insert all leads into database and make use of ‘unique’ key. For example if you make email_address unique, then database will prevent insertion of any new row that already has the same value of email_address.

You may also make other fields or keys unique. It really depends on what you consider duplicate. Obviously 2 different people may have the same First and Last names, so you should not assume that FirstName.LastName value should be unique.
But email address is a pretty good indication of uniqueness or the lead

Avoiding duplicate data in the first place is obviously to prefer, but if you’re in a situation where that isn’t possible, I would suggest that you create a moderation-queue, where all new entries go into. If they look unique (determined by some function that you’ll have to construct), verify them right away. But if they look even remotely like they might be duplicates, have a person verify/abort the update. The critical part is then to create a duplicate-detection-function that is good at guessing. You probably will prefer to err on the side of caution - Eg. if it has even the slightest hint at being a duplicate, have it moderated. But then that depends on your particular situation. Maybe you can accept some duplicates, if that gives you less manual intervention.

The simplest solution would be using unique keys and INSERT … ON DUPLICATE KEY UPDATE or INSERT IGNORE.

as I said before, use unique key in database. The way I would do it is to NOT use INSERT IGNORE but instead I would wrap the INSERT statement inside the try/catch block.
this way I would be able to catch duplicate key exception and then could do something with that data like to log it to another table to keep track of all duplicate errors, maybe for later analyses and statistics

After INSERT IGNORE you can get affected row count to know whether the insert succeeded or not. So try/catch is not the only way.

Fifth Street and 5. Street are also duplicates - those wouldn’t be catched by a unique key - so that suggestion is kind of useless in terms of data de-duplication.

Ah if they’re going to differ like that, then probably will have to remember that 5. Street is Fifth Street. So any future records can be matched without intervention.

Maybe google geocoding may help if its actual physical address?

I’ve done something similar (although on many different applications/datasources) and wrapped it up as a web service which would be passed 2 snippets of content and then return a value based on the similarity. the script calling the web service would then remove record if an exact match was found, or flag for manual review if it passed a certain threshold.

One such application that used it was a contact management system that generated around 3,000 new contacts a week which had to be compared against the existing 220,000 contacts. Batch jobs were run overnight for imports or direct calls were made whilst users were creating new contacts.

The calculation of the similarity was based on using a combination of Levenshtein distance, metaphone values and swapping of common words (st to street etc).

Sounds interessting.
Are you willing to share some code?

Heh, I would if I still had a copy, but i’ve left the company I was at when it was developed (plus it was written in c# :D) but the idea is pretty simple - do the st/street type replacements so its semi normalised, compare the metaphone values of each and count differences, compare the Levenshtein distance between both and then return that number combined, higher the number, less chance of a match. The important part is the st/street replacements otherwise your numbers will be seriously off.

With several other replacements like umlauts (and other Germen specials…) it’d be a pretty good way for a duplicate check (and an address proposal service via AJAX).
Thanks so far, I’ll keep in mind for the next time I’ll need it.

yeah we were trying to get it to work realtime for new addresses in the contact management system but couldn’t get it fast enough - with metaphones you can store the results in the DB but for LD you need to compare each existing entry against the entry the users just made (and for each address line, postcode etc) ended up being too costly

Anything wrong with creating a unique key on a field to hold a binary MD5 hash of the normalized address/phone number? If there are multiple phone numbers per record than phone numbers should be probably stored in a second database table, which can use the same normalize/hash method to identify duplicates. If duplicates are needed in the database for some reason, such as reporting how many leads were duplicates and therefore not chargeable than the fields don’t need to have a unique key, but instead GROUP BY the hash field and count(*) the number of records per group. These are all tasks which can be easily performed on the virtual web platform which is a system I developed for producing web based business applications. If your really pressed for speed you can always divide the tables up based on the MD5 hashes so your tables aren’t holding millions of records.