Often we want to search, find or remove duplicate and similar content from our data base containing articles, text, messages etc.
It is easy to do so manually if you have only 100 or so records in your database. But to handle large number of records it is bit time consuming and for even more records manual processing is not possible. I have tried many ways to do so and would like to share an insight what all I have done. Hope, this would help some amongst people like me.. 🙂
- starting with basic – try performing removing duplicates. With this you will be able to remove records having duplicate values. For this use tools like MS Excel and choose remove duplicates.
- sorting and removing – again coming back to MS Excel sort all records and now you will be able to see which records are starting from same characters and might be similar. Start deleting such records. (this method is not recommended when you have large number of records).
- introducing PHP for checking – how the above thing can be done automatically using PHP and MySQL. Select your records from database and sort them. For each record store record’s first 20 characters or so (depends on you) in a variable $record_old_name_initial and match it with next record name using LIKE% and if it matches you can remove the current record. What else can be done is that delete the one which is new or old..depending on which one you want to keep. With this many of the records can be filtered out.
- continuing with PHP – above method can also be applied with last 20 characters or so.
- Combining Front and Back – After individually checking from Front and then Back, now do it with a combination of Front and Back. This method is more preferable than the above 2. Because in above 2 methods more number of records might be filtered (and some of them we might not want to filter).
- the center text – Now, after front and back we are left with the middle/center content. This seems to be better than others and doesn’t require sorting. Just take a string from a record starting from 20th character or so to next 20 characters or so..depends on you. Now check the presence of this string in other records and if you find a match then remove such records.
- the center text multiplied – take 2-3 string chunks of 20 characters length or so from current record and search these 2-3 chunks using AND. Remove returned records.
- the center text modified – while using above 2 methods just make the match % more than 80% or so..depends on you and then remove the resulting records.
- the words – split the record into words and check how many words exist in the next record. If it is more than 90% you know what to be done.
- the tags – First tag your content. You can use auto tagging APIs. After the tagging has been done compare tags for the current records with tags of other records. If matching of tags for 2 records is high then a record can be considered as duplicate or similar.
I have analysed these methods over the time and tried. With each of these methods I was able to eliminate similar records. If you have other ideas please do mention them in comment section..