Julio Sancho's Blog

Personal blog of Julio Sancho

MySQL remove duplicates (large database + fast)

Posted on April 11, 2020 by julifos

In the end, this was for me the fastest method to remove dups from the big db I’m working on:

Duplicate the structure of the database (ie, “db”) to a new database (ie, “db2”, setting the fields you don’t want to be dups to UNIQUE.
Copy db to db2 using INSERT IGNORE:

INSERT IGNORE INTO db2 (`xxx`,`yyy`,`zzz`) SELECT `xxx`,`yyy`,`zzz` FROM `db`

This reduced the amount of time to minutes, while any other method could take many hours, appart from creating large temporary files which could fill the disk holding the temp dir and finally abort the process.

If you don’t need in the end to have those UNIQUE fields (the indexes takes many HD), you should re-dump db2 to a new db without that UNIQUE fields => again some more minutes, but still worth against 12 hours of CPU/RAM intensive working (in a highly mysql responsive profile) with other kind of filtering methods, such as DISTINCT or GROUP BY.

Leave a comment Cancel reply