Você está na página 1de 5

SQL syntax

Language English

Index 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 [18 ] 19 20

!
. !
s pyenglis h.ru

!
. !
4 winner.jus tc lic k.ru

How to delete duplicates in the presence of a primary key?


by S.Moiseenko
(2009-07-25)

In previous paper we were considering the duplicate rows problem resolving, caused by lack of a primary key. Now let consider more difficult case, when key is present but it is synthetic. When design is improper it can lead to duplicate rows appearing, from a subject area view point. It is strange, but even if I am tell to my students about synthetic primary key disadvantages, they are still use ones in their first data base projects. Probably all people need to enumerate everything :) I don't want to discuss here trite synthetic key problem. Just would like to tell if you decided to use synthetic key as a primary, you should create natural unique key to avoid situation described below. So, let we have a table with primary key id and column name. In accordance with subject area restrictions the column name must contain unique value. However, if the table structure is determined as

C R E A T ET A B L ET _ p k( i dI N TI D E N T I T YP R I M A R YK E Y , n a m eV A R C H A R ( 5 0 ) ) ; a duplicate rows can appear. It will be better to use table design as

C R E A T ET A B L ET _ p k( i dI N TI D E N T I T YP R I M A R YK E Y , n a m eV A R C H A R ( 5 0 )U N I Q U E ) ;
.

-!
15-45 .. (500-1500 y.e.)/. !
tvoy- s tart.ru

3 :
, ! .
pavel- koles ov.ru

Everybody know which way is better but sometimes we need to deal with inherited structure and data, which is violate subject area restrictions. For example:

i d 1 2 3 4 5 6

n a m e J o h n S m i t h J o h n S m i t h S m i t h T o m

You can ask: What is the difference between this problem and previous one? Probably here we have more easy solution. We just need to delete all rows from each groups with same value of a name except rows with minimum/maximum value of an id. For example so:

D E L E T E F R O MT _ p k W H E R Ei d>( S E L E C TM I N ( i d )F R O MT _ p kXW H E R EX . n a m e=T _ p k . n a m e ) ;

It is right, but I still have not told you everything. Let we have the derived table T_details associated with the table T_pk on the foreign key:

C R E A T ET A B L ET _ d e t a i l s( i d _ p kI N TF O R E I G NK E YR E F E R E N C E S T _ p kO ND E L E T EC A S C A D E , c o l o rV A R C H A R ( 1 0 ) , P R I M A R YK E Y( i d _ p k ,c o l o r ) ; This table can contain data like this: i d _ p k 1 1 2 2 3 4 6 c o l o r b l u e r e d g r e e n r e d r e d b l u e r e d

For better visibility let use query S E L E C Ti d ,n a m e ,c o l o rF R O MT _ p kJ O I NT _ d e t a i l sO Ni d =i d _ p k ;

to see the names

i d 1 1 2 2 3 4 6

n a m e J o h n J o h n S m i t h S m i t h J o h n S m i t h T o m

c o l o r b l u e r e d g r e e n r e d r e d b l u e r e d

It is shown that one person data is erroneously belong to different parent entries. Furthermore, duplicate rows are in this table as well: 1 3 J o h n J o h n r e d r e d

Similar data will lead to erroneously data analysis. Furthermore, a cascade deleting will lead to data loosing. For example, if we will left only minimum identifier rows in each groups of the T_pk table, we will lost the row 4 S m i t h b l u e

in the table T_details. Consequently, during duplicate entries deleting we should take into account both tables - T_pk and T_details. The data cleaning can be done in two stages:
Update table T_details to join a data, which is related to one name, to a row with minimum id. Delete duplicate entries from table T_pk, except rows with minimal id, in each group characterized by the same value in the name column.
Updating T_details table

Query

S E L E C Ti d _ p k ,n a m e ,c o l o r ,R A N K ( )O V E R ( P A R T I T I O NB Yn a m e ,c o l o rO R D E RB Yn a m e ,c o l , ( S E L E C TM I N ( i d ) F R O MT _ p kW H E R ET _ p k . n a m e=X . n a m e )m i n F R O MT _ p kXJ O I NT _ d e t a i l sO Ni d = i d _ p k ;

determines the number of a duplicate rows (value of dup value is greater than 1) and the minimum value of an id in a equal-name groups (min_id). Here is the result of that query:

i d _ p k 1 1 3 4 2 2 6

n a m e J o h n J o h n J o h n S m i t h S m i t h S m i t h T o m

c o l o r b l u e r e d r e d b l u e g r e e n r e d r e d

d u p m i n _ i d 1 1 1 1 2 1 1 2 1 2 1 2 1 6

Now we need to replace the value of a id_pk to min_pk for each row, except the third one, because it is a duplicate copy of the second row. The value of a dup=2 is indicate on that. The query for updating can be as:

U P D A T ET _ d e t a i l s S E Ti d _ p k = m i n _ i d F R O MT _ d e t a i l sT _ dJ O I N( S E L E C Ti d _ p k ,n a m e ,c o l o r ,R A N K ( )O V E R ( P A R T I T I O NB Yn a m e ,c o l o rO R , ( S E L E C TM I N ( i d ) F R O MT _ p kW H E R ET _ p k . n a F R O MT _ p kXJ O I NT _ d e t a i l sO Ni d = i d _ p k )YO NY . i d _ p k = T _ d . i W H E R Ed u p= 1 ;

Updated table T_details will be like this: i d _ p k 1 1 2 2 2 3 6 c o l o r b l u e r e d b l u e g r e e n r e d r e d r e d

It is shown that only one duplicate row is left: 3 r e d

But it is no need to worry about that row because it will be deleted after duplicate rows cascade deleting from table T_pk:

D E L E T E F R O MT _ p k W H E R Ei d>( S E L E C TM I N ( i d )F R O MT _ p kXW H E R EX . n a m e=T _ p k . n a m e ) ;

The last query is a second stage of a deleting procedure. The result of that query looks like this: T a b l eT _ p k i d n a m e 1 J o h n 2 S m i t h 6 T o m T a b l eT _ d e t a i l s i d _ p k c o l o r 1 b l u e 1 r e d 2 b l u e 2 g r e e n 2 r e d 6 r e d Only need to apply restriction to avoid duplicates in future:

A L T E RT A B L ET _ p k A D DC O N S T R A I N Tu n i q u e _ n a m eU N I Q U E ( n a m e ) ;
Dzone.com
Previous | Index | Next

sds.prom .ua . . vashilinzy.ru . , , re m ont-podolsk .ru

Home

SELECT exercises (rating stages)

DML exercises

Developers

Você também pode gostar