Skip to the content

Getting Personal with Surname Level Matching

We’ve been improving data quality for our clients for over 15 years now, and our software has constantly evolved, including the very latest developments in fuzzy logic. With that in mind we have looked at further ways to refine this logic, such as on matching levels.

The question of matching levels considers the question: “what level of name match are we prepared to accept?”

For example, let’s assume that we have a data source that is reporting the following person as deceased:

“Mr John Smith”

In your data we find a record at the exact same premise with the following name:

“Mr J Smith”

What are the chances that your J is the same as our deceased John?

The answer we know to be around 95%. That is to say that 1 in 20 matches where we can only confirm the first initial turn out to be wrong. Perhaps your J referred to James who is not at all deceased, rather alive and well and a potential future customer.

Initial Level Matching means that we are prepared to accept this match, along with the risk that it may be wrong, although much more likely chance that the match is correct.

Forename Level Matching means that we would only be prepared to make the match if you gave us a John as the Forename – so this example would not match.

The other two options are:

Surname level matching which means that we don’t care about the Forename and just want to have the same Surname – there are not many use cases for this; and

Premise level matching which only really has a use in deduplicate jobs where you are looking to get one mailing per household.

This all works very well, and the decision on whether to use Forename or Initial level matching can be decided based on the number of additional matches you make and the relative harm of a bad match versus not making a match: would you prefer to make more matches, reducing the cost of your direct marketing, but with an increased risk of removing some that should not be removed; or prefer to keep more records with a risk that some of them should really be removed but we can’t be so sure about them.

It is useful here to refer to the Data Quality Audit to check out what proportion of your file contains full forenames, initials or even no forename information at all. Our account managers are always happy to help discuss these decisions.

What becomes an issue is that occasionally we get a job where there is quite a wide variation of data relating to the name. Sometimes the file can contain a significant proportion of the file where the name is just “Mr Smith”.

In a file where we get the following records (all at the same address):

“Mr James Smith”

“Mr J Smith”

“Mr David Smith”

“Mr Smith”

How should we match these examples against our “Mr John Smith” who we know to be deceased at the same address?

At Forename Level then none of them would match. At Initial Level matching then just “Mr J Smith” would match. At Surname Level then all of them would match. Even David.

If we were considering how to deduplicate them, then we have a dilemma, as none of those options give us a perfect answer. In fact arguably there is no perfect answer. But one good answer is to allow only what we call non-conflicting matches. I.E. James will match with J as they do not conflict. James and J can also match with the last Mr Smith, as again there is no conflict between those 3. But David is clearly a different person.

And this is our new option: Personal Level.

It allows a match if there is no conflicting forename information. It is useful in a situation where we want to deduplicate, but the source data is not consistent with its population of forenames. In your file, 3 people with the following names “Mr John Smith”, “Mr J Smith”, and “Mr Smith”, who have all enjoyed your product before, are very likely to be the same person.


Of course, as with the example above, things can get complicated. “Mr Smith” could match with either James/J or David. And if you want to know about these complicated examples then we can identify all of these for you so that you can manually decide upon the correct answer. Please speak to us if you want to know more.

You can find the new option under the Matching Level option for Deduplicate on your Data cleansing jobs.

Of course, the real issue here is not the level of matching at all, but rather the quality of the data we have to work with in the first place. Given the chance some people would prefer to give their name as “Mr Smith” or even worse “Mr and Mrs Smith” – how are we supposed to work out if that ‘person’ is deceased? Our advice is to consider your data capture touch points and work out if they can be improved, or if the further training can assist if necessary. Splitting a generic Name field on your website into Title, Forename, Surname will get you better quality data. Making a forename capture rule of more than 1 character will stop people putting the minimum amount of data in. Ensuring staff understand the consequences of not asking for a first name will help when you next come to use the data for marketing.

We also have a full range of data validation tools to help you capture the best possible quality data as efficiently as possible.

Next Steps

Find out more about how Data8 can help you

About the author

Beth Beggs