I have a database that has a name of a business and a number of contacts. As you can imagine data entry can not be as accurate as one may like at times so sometimes the names of business names are not provided in the same format or spelling. Here is a sample of the data:
Name | Contact
Steel City | 5
Steele City | 6
City of Angel | 3
City of Angels | 1
AB Corp | 7
Ab Corporation | 2
AB Incorporated | 8
AB INC | 9
Adult Beverage INC | 4
Vally System | 2
Vally System Assoc | 3
I need a way of matching this data so it combines these tables to look like:
Name | Contact
Steel City | 11
City of Angels | 4
AB Corp | 30
Vally System Assoc | 5
Does anyone have any idea on how to match these entries when the name is similar but not 100% the same?
Edit the data? Or change the data entry process to have them enter a corporation once, then adding contacts to that corporation.
Seriously. There is ABSOLUTELY no way to do what you’re asking. Full Text Searching could come close if you didn’t have the AB disaster there. But if you have anything close, you could end up with a LOT of false positives regardless. I would highly recommend a simplification of the data entry process…
You could set it up for future entries so that as the business name is being entered, it uses an AJaX call to get similar business names from the database and have them appear in a “drop down” fashion as the user types. That way, if the business already exists, they can either abort and add something else, or choose that business name so that it matches all the others.
Normalization should be utilized, here, IMHO. One table for businesses, another for contacts, keeping the bloat down and cutting down on network. Just a thought.
This is a great idea. Unfortunately, we have no control over how the data is being entered. This is another company so we can’t fix how they are collecting data. They simply provide us a dump of there data which has 60K+ rows. Out of those 60k rows maybe only 16k are unique companies. The more matching we can do the better. I recognize the impossibilities presented in the AB example but there has to be a way to do matching for the others or at least create a score of suggestable matches.
Perhaps using the against functionality? If we seperate the name into seperate words such as:
SELECT * FROM table WHERE MATCH(name) AGAINST (‘+ab ~inc’ IN BOOLEAN MODE)
Is there a way to score this data and rank it in the results based on highest possible matches?
Again, you can with full text searching, but you’re going to spend as much time with parsing through that looking for false positives (which there will be a LOT depending on how generic you make the search terms).
This data you’re getting from another company…is this something you’ve contracted out for, or are you buying a mailing list from them? If it’s the first, then you most certainly do have recourse - make them fix the data. If it’s the second, then you’re on your own…
The closest that I think you can get would be to make Solr or Verity collections from the database dumps; and even then, it will be a huge pain in the arse.
If there is another company from which to get your data, one that actually pays attention to details… drop whomever you’re currently using and go with that. Honestly, if this company can’t keep consistency in THEIR data entry, you’re paying too much.