One of the principles of data integrity is to reduce data redundancy as much as possible and wipe out any duplicate database table entries. Another reason for this is that you need to add unique indexing to one or a group of table columns and you need the one column or combination of columns to be unique through out all the rows of the table.
There are basically 2 ways to achieve that:
- Identifying duplicate entries, leaving just one of them and deleting the rest.
- Identifying only unique table rows and collect them into a new table.
I’ll just go ahead with the 2nd approach.
There are in turn 2 ways to accomplish this task, namely SELECT DISTINCT … FROM … and SELECT … FROM GROUP BY …
SELECT DISTINCT … FROM …
Rather intuitively, this SQL query selects all distinct combination of columns from a table and leaving out duplicate ones:
CREATE TABLE new_orders SELECT DISTINCT receipt, quantity, price FROM orders
One thing to note is that you have to leave out the original primary key such as ID so that all rows are only compared for uniqueness on necessary data fields. It is after this creation query is successfully completed when you add a primary key indexing column to the new_orders table.
SELECT … FROM GROUP BY …
GROUP BY clause is usually used for statistical functions of MySQL such as COUNT() or AVG() to only take a distinguished group of columns into calculation. In this situation, however, it comes handy that it inherently excludes multiple table records with the same value in the specified column and includes just one of them in the returned results.
For example, if you need to include only those records from the old table orders that’s with distinct values in column receipt:
CREATE TABLE new_orders SELECT * FROM orders GROUP BY receipt
Now all the records in the new table new_orders are with distinctly unique receipt values.
Depending on whether the original records are referenced by other tables via ID, you can regenerate the ID or leave them in the new table new_orders.
I am seeking a more complex de-dupe routine. I have 2 tables. The new table was generated by spidering a yellow pages site. The old data was generated the same way some time ago, and people have added and changed records over the years.
I want to keep unique records from the old table but not end up with duplicates. I tried one routine using left join and group, however it did not remove all duplicates because the “name” field sometimes does not match exactly
I want to match the first 20 characters of the name field PLUS all the city field AND all the state field. If all 3 match, the record is not to be copied. If one of the 3 does not match, that record from the old table should be written to an outfile. Then I can grab the records matching the unique id’s, modify the fields to match the new table, and import to the new table.
Thanks, Mike
its very useful please guide for MySql