Removing duplicates from a PostgreSQL database - Siafoo
Removing duplicates from a PostgreSQL database – Siafoo.
The basic assumptions are that you have a table, named foo, that contains a text field, called contents and that the contents field may contain duplicate rows.
For my purposes I am using a table with 30,000 rows of which 8,000 rows must be removed because they duplicate content one or more times. The contents field contains an average of only 180 characters making the hash function very fast but also probably unnecessary.
-
First, as always, backup your database.
-
Add an md5 hash key and populate it with the content hashes (~2 seconds for my data).
# ’sALTER TABLE foo ADD COLUMN hash text; UPDATE foo SET hash = md5(content);
-
Create a unique id (i.e. like a primary key) for each row, if one does not exist.
# ’sCREATE SEQUENCE foo_sequence INCREMENT 1 MINVALUE 1 MAXVALUE 9223372036854775807 START 1 CACHE 1; -- Set the correct owner for the sequence if needed -- ALTER TABLE foo_seq OWNER TO postgres; -- Populate the field ALTER TABLE foo ADD COLUMN id integer; UPDATE foo SET id = nextval('foo_seq');
-
(optional) Estimate the number of duplicates
You can use the query below to estimate the number of duplicates, so that once the deletion takes place you can figure out if something has gone horribly wrong™.
# ’sSELECT sum(dupes.c) - count(dupes.c) FROM (SELECT count(*) AS c FROM foo GROUP BY hash HAVING count(*) > 1 ) AS dupes
-
Delete the duplicates (~3 minutes on my local machine, a dual core AMD)
# ’sDELETE FROM foo WHERE id NOT IN (SELECT min(id) FROM foo GROUP BY hash HAVING count(*) >= 1)
-
Drop the hash column
# ’sALTER TABLE foo DROP COLUMN hash;
Naturally you can modify the above procedure to better fit your data definition and uniqueness requirements. It is also probable that you can achieve some speed ups by using indexes… somewhere.