Lorna Backhouse is a Data Scientist at VisionWare.
This can be translated into the question: What is the probability of having the same last 4 digits as someone else in a Social Security Number (SSN). This is far higher than one would expect.
- Out of 119 people, there is a 50% chance that two of them will have the same last 4 digits in their SSN.
- Out of 180 people, there is a 80% chance that two of them will have the same last 4 digits in their SSN.
- Out of 300 people, then there is more than 99% probability that two of them will have the same number.
When considering what the probability is that two people have the same last four digits, one takes into account how many people you are comparing. If there are 4 people in the room, the chance of two people having the same last 4 digits can be calculated as follows:
It is important to take into account the comparisons between all individuals. The chance of two numbers not matching is 9999/10000. However if there are 4 people in the room:
- Chance A doesn’t match to B = 9999/10000
- Chance A doesn’t match to C given that A didn’t to B = 9998/10000
- Chance A doesn't match to D, given that A did not match to B or C if 9997/10000.
Hence the total chance of A not matching any of B, C or D is
The chance that A does match either B, C or D is
As you increase the number of people from 4 upwards, the probability increases exponentially. The probability of two people having the same last 4 digits out of more than 10000 people is 0.9999.
So how does this affect the score assigned to the 4 digit SSN in record linkage? The best is to think of score relative to additional attributes.
- 9 digit SSN – Out of 10000 people, the chance of any two sharing a SSN is 0.04. This is purely mathematical and ignores the fact that SSN is issued without replacement – i.e. uniquely. Out of 10000 people, there is almost certainty that 2 people will share the same 4 digit SSN.
Hence, the score given to a 4 digit SSN should be significantly lower than that given to a 9 digit SSN. I would recommend that it is least 20x less than the 9 digit SSN.
- Date of Birth – If we consider all dates of birth over a range of 100yrs then there are 36,525 possible birth dates. If we consider 10,000 people then the chance of any two of them having the same Date of Birth is > 0.99. It is very similar to that of 4 digit SSN.
Hence in terms of scores, it is advisable to score the 4 digit SSN the same if not slightly less than Date of Birth. If your population consists only economically viable adults then the same score should be applied to both Date of Birth and the 4 digit SSN.
Although there are numerous articles available stating not to give out your 4 digit SSN, understand that in these cases the SSN is given with additional identifying information such as Date of Birth and Zipcode. From these 3 pieces it is possible to determine the likely 9 digit SSN but only as a combination.
For reference and further explanations I would recommend the following readings: