I have a dataset which contains pairs of names, it looks like this:
ID; name1; name2
1; Mike Miller; Mike Miler
2; John Doe; Pete McGillen
3; Sara Johnson; Edita Johnson
4; John Lemond-Lee Peter; John LL. Peter
5; Marta Sunz; Martha Sund
6; John Peter; Johanna Petera
7; Joanna Nemzik; Joanna Niemczik
I have some cases, which are labelled. So I check them manually and decide if these are duplicates or not. The manual judgement in these cases would be:
1: Is a duplicate
2: Is not a duplicate
3: Is not a duplicate
4: Is a duplicate
5: Is not a duplicate
6: Is not a duplicate
7: Is a duplicate
(The 7th case is a specific case, because here phonetics come into the game too. However, this is not the main problem, I am ok with ignoring phonetics.)
A first approach would be to calculate the Levenshtein-distance for each pair and mark those as a duplicate, where the Levenshtein-distance is for example less or equal than 2. This would lead to the following output:
1: Levenshtein distance: 2 => duplicate
2: Levenshtein distance: 11 => not a duplicate
3: Levenshtein distance: 4 => not a duplicate
4: Levenshtein distance: 8 => not a duplicate
5: Levenshtein distance: 2 => duplicate
6: Levenshtein distance: 4 => not a duplicate
7: Levenshtein distance: 2 => duplicate
This would be an approach which uses a "fixed" algorithm based on the Levinshtein distance.
Now, I would like to do this task with using a neural network / machine learning:
I do not need the neural network to detect semantic similarity, like "hospital" and "clininc". However, I would like to avoid the Levenshtein-distance, as I would like the ML algorithm to be able to detect "John Lemond-Lee Peter" and "John LL. Peter" as a potential duplicate, also not with a 100% certainty. The Levenshtein distance would lead to a relative high number in this case (8), as there are quite some characters to be added. In a case like "John Peter" and "Johanna Petera" the Levenshtein-distance would lead to a smaller number (4), however this is in fact no duplicate and for this case I would hope that the ML algorithm would be able to detect that this is likely not a duplicate. So I need the ML algorithm to "learn the way I need the duplicates to be checked". With my labelling I would give as an input I would give the ML algorithm the direction, of what I want.
I actually thought that this should be an easy task for a ML algorithm / neural network, but I am not sure.
How can I implement a neural network to compare the pairs of names and identify duplicates without using an explicit distance metric like the Levenshtein distance?
I thought that it would be possible to convert the strings to numbers and a neural network can work with this and learn to detect duplicates according to my labelling style. So without having to specify a distance metric. I thought about an human: I would give this task to a person and this person would judge and make a decision. This person has no clue about a Levenshtein-distance or any other mathematical concept. So I just want to train the neural network to learn to do what the human is doing. Of course, every human is different and it also depends on my labelling.
source https://stackoverflow.com/questions/72914328/compare-similarity-of-two-names-and-identify-duplicates-with-neural-network
Comments
Post a Comment