Duplicates: detection and eradication ... PDF Print E-mail
User Rating: / 0
PoorBest 
Wednesday, 28 May 2008

1. The different types of duplicates
We must distinguish duplication absolute duplicates on and near duplicates.

Duplicates absolute tuples are strictly identical. All columns lines doublonnees must contain the same data. Given the possibility of the presence of a key in most tables, these types of duplicates that are not in the key tables.

Sample duplicates absolute:
NUM FIRST NAME
---------------- ----------- --------
33 DUPONT Marc
33 DUPONT Marc
33 DUPONT Marc
Duplicates on tuples are identical except for the key component values. With the exception of forming the key columns, other columns lines doublonnees must contain the same data. This is the kind of duplicates the most common since the majority of tables are equipped with keys.

Sample duplicates on:
NUM FIRST NAME
---------------- ----------- --------
47 DUPOND Alain
59 DUPOND Alain
The quasis duplicates are made by tuples whose semantic content and logic is the same, with slight differences as a differentiation capitalization or typographical errors due to data entry errors.

Example of near duplicates:
NUM FIRST NAME
---------------- ----------- --------
65 Dupond Frederic
78 DUPOND Frederic 
81 DUPOND fred
99 DUPON eric
We shall now see how to detect and eradicate all of these types of duplicates. But first, here is our game tests:

CREATE TABLE T_DOUBLON (NUM INT, NAME VARCHAR (32), FIRST VARCHAR (16))
INSERT INTO T_DOUBLON VALUES (33, 'DUPONT', 'marc') INSERT INTO T_DOUBLON VALUES (33, 'DUPONT', 'marc') INSERT INTO T_DOUBLON VALUES (33, 'DUPONT', 'marc') INSERT INTO T_DOUBLON VALUES ( 47, 'DUPOND', 'alain') INSERT INTO T_DOUBLON VALUES (59, 'DUPOND', 'alain') INSERT INTO T_DOUBLON VALUES (65, 'Smith', 'Frederick') INSERT INTO T_DOUBLON VALUES (78, 'DUPOND' , 'Frederic ') INSERT INTO T_DOUBLON VALUES (81, 'DUPOND', 'fred ') INSERT INTO T_DOUBLON VALUES (99, 'DUPON', ' eric')

2. Duplicates absolute
The principle of absolute duplication detection is simple. They are counting lines with a combination of all columns and introduce a filter on the result of counting by detecting values above 1:

SELECT COUNT (*) AS NBR_DOUBLES, NUM, NAME, FIRST GROUP FROM T_DOUBLON BY NUM, NAME, FIRST HAVING COUNT (*)> 1
NBR_DOUBLES NUM FIRST NAME
----------- ----------- -------- ----------------
3 33 DUPONT Marc
If such duplication is a simplicity to detect child, it does not apply to delete ... Indeed, it should be kept for at least one line and delete the rest. But like any information is similar, all lines will be discontinued, unless, and this is our solution, introducing a discriminatory ...

2.1. Suppression by adding a discriminator column
The technique is to add a column, feed this column with all different values, values that can serve as a unique key. To feed this column, you can use a slider to be numbered or if your RDBMS becomes available, using a function of ranking or ID generation.

2.1.1. Using a cursor
Here is an example for SQL Server:
ALTER TABLE T_DOUBLON ADD COLID INT
DECLARE @ i INT DECLARE @ colid INT SET @ i = 1 DECLARE CURSOR CURSEUR_DOUBLON FOR SELECT COLID FROM T_DOUBLON FOR OPEN UPDATE CURSEUR_DOUBLON FETCH CURSEUR_DOUBLON INTO colid WHILE @ @ @ FETCH_STATUS = 0 BEGIN SET COLID UPDATE T_DOUBLON = @ i WHERE CURSEUR_DOUBLON SET OF CURRENT @ i = @ i + 1 FETCH CURSEUR_DOUBLON INTO @ colid END CLOSE CURSEUR_DOUBLON DEALLOCATE CURSEUR_DOUBLON

NUM NAME FIRST COLID
---------------- -------- ----------- -----------
33 DUPONT Marc 1
33 DUPONT Marc 2
33 DUPONT Marc 3
47 DUPOND Alain 4
59 DUPOND Alain 5
65 Frederick Smith 6
78 DUPOND Frederic  7
81 DUPOND fred  8
99 DUPON  eric 9
Therefore, to remove duplication "33 DUPONT marc" just a simple basic complaint:
DELETE FROM T_DOUBLON T WHERE T. COLID <ANY (COLID SELECT FROM T_DOUBLON T2 WHERE T. COLID <> T2.COLID AND T. NUM T2.NUM = = NAME AND T. T. T2.NOM AND FIRST = T2.PRENOM)

NUM NAME FIRST COLID
---------------- -------- ----------- -----------
33 DUPONT Marc 3
47 DUPOND Alain 4
59 DUPOND Alain 5
65 Frederick Smith 6
78 DUPOND Frederic  7
81 DUPOND fred  8
99 DUPON  eric 9
Of course we can now remove the column COLID introduced to our needs suppression ...

2.1.2. Using an office ranking
A function of ranking, gives a number to rank lines of a table. Some RDBMS allow the use of such a function as rowid. The principle is the following:
add new column
fuel this column with the function of ranking
delete data by application
delete the column ranking
Here is an example that uses a function that generates a unique identifier, or GUID, (SQL Server):
ALTER TABLE T_DOUBLON ADD COLUID uniqueidentifier

NUM NAME FIRST COLUID
----------- -------- ---------------- --------------- ---
33 DUPONT Marc NULL
33 DUPONT Marc NULL
33 DUPONT Marc NULL
47 DUPOND Alain NULL
59 DUPOND Alain NULL
65 Dupond Frederic  NULL
78 DUPOND Frederic  NULL
81 DUPOND fred  NULL
99 DUPON Eric NULL

UPDATE T_DOUBLON SET COLUID = newid ()
NUM NAME FIRST COLUID
----------- -------- ---------------- --------------- ---------------------
33 DUPONT marc 2FEB898A-709C-4C8B-9A4D-DDE82D73CA06
33 DUPONT marc 659DB730-B4F3-4078-A21F-9A2250CE4284
33 DUPONT marc DE8C63A6-3940-4246-AB6C-5BBA122A4507
47 DUPOND alain 82A33D79-2806-4FA7-AD95-F6B209D18424
59 DUPOND alain EE8F99EE-4E25-47A2-8988-1D7702A61A35
65 Dupond Frederic  28B3CAD0-8977-46EB-81A2-409B098C4087
78 DUPOND Frederic  12A11D89-0448-46EF-B8F5-988BCB380598
81 DUPOND fred  D75E6872-0026-4051-83D2-648C63BA8ED6
99 DUPON  eric D45C8563-DF53-4800-A310-DE2869380D6B

DELETE FROM T_DOUBLON T WHERE T. COLUID> ANY (SELECT COLUID FROM T_DOUBLON T2 WHERE T. COLUID <> T2.COLUID AND T. NUM T2.NUM = = NAME AND T. T. T2.NOM AND FIRST = T2.PRENOM)

NUM NAME FIRST COLUID
----------- -------- ---------------- --------------- ---------------------
33 DUPONT marc DE8C63A6-3940-4246-AB6C-5BBA122A4507
47 DUPOND alain 82A33D79-2806-4FA7-AD95-F6B209D18424
59 DUPOND alain EE8F99EE-4E25-47A2-8988-1D7702A61A35
65 Dupond Frederic  28B3CAD0-8977-46EB-81A2-409B098C4087
78 DUPOND Frederic  12A11D89-0448-46EF-B8F5-988BCB380598
81 DUPOND fred  D75E6872-0026-4051-83D2-648C63BA8ED6
99 DUPON  eric D45C8563-DF53-4800-A310-DE2869380D6B

ALTER TABLE T_DOUBLON DROP COLUMN COLUID

NUM FIRST NAME
---------------- ----------- --------
33 DUPONT Marc
47 DUPOND Alain
59 DUPOND Alain
65 Dupond Frederic 
78 DUPOND Frederic 
81 DUPOND fred 
99 DUPON  eric
This elegant solution is preferable in the case or your RDBMS is equipped with such a feature.

2.1.3. Remove with a temporary table
Another formula to eliminate duplication, is to use a temporary table the same structure as one with Duplicate supplies delivered by the single keyword DISTINCT. It then eliminates all duplicate the original table, then re-inject the contents of the temporary table in the original table.
Here is an example SQL Server:

SELECT DISTINCT NUM, NAME, FIRST INTO T_DOUBLON_TMP BY GROUP FROM T_DOUBLON NUM, NAME, FIRST HAVING COUNT (*)> 1
DELETE FROM T_DOUBLON T1 WHERE EXISTS (SELECT * FROM T_DOUBLON T2 WHERE T1.NUM = T2.NUM AND T1.NOM = T2.NOM AND T1.PRENOM = T2.PRENOM GROUP BY NUM, NAME, FIRST HAVING COUNT (*)> 1 )

NUM FIRST NAME
---------------- ----------- --------
47 DUPOND Alain
59 DUPOND Alain
65 Dupond Frederic 
78 DUPOND Frederic 
81 DUPOND fred 
99 DUPON  eric

INSERT INTO SELECT T_DOUBLON NUM, NAME, FIRST FROM T_DOUBLON_TMP
NUM FIRST NAME
---------------- ----------- --------
33 DUPONT Marc
47 DUPOND Alain
59 DUPOND Alain
65 Dupond Frederic 
78 DUPOND Frederic 
81 DUPOND fred 
99 DUPON  eric
DROP TABLE T_DOUBLON_TMP

2.1.4. Remove without changing the schema
Sometimes users do not have the necessary rights on the basis for modifying the scheme and in particular add or delete a column to a table or even create and destroy a table, even temporarily.

A useful solution in this case is to add information in a column already set the table.
The choice of the column in which we will add the information should be dictated by the following:
column sought little research and not linked to another table (FK)
column containing enough space to add at least one figure

In our example, it is clearly NAME column, which entails the maximum free space. But it is very likely that this is also the most requested research. However, the column FIRST is probably the least sought. The size of the column is 16 characters and is filled with up to 8. It leaves us 8 characters, which can in theory go from 0 to 99 999 999 until the numbering of 100 million lines ... Widely clearly enough in our case.

But to facilitate the "backward" and therefore easier to clean this column, we'll use a marking ... Let us see how, using an example for SQL Server:
DECLARE @ first_name VARCHAR (16) DECLARE @ i INT SET @ i = 0 DECLARE CURSOR CURSEUR_DOUBLON SELECT FOR FIRST FROM T_DOUBLON FOR OPEN UPDATE CURSEUR_DOUBLON FETCH CURSEUR_DOUBLON INTO first_name WHILE @ @ @ FETCH_STATUS = 0 BEGIN SET FIRST UPDATE T_DOUBLON = @ first_name + '# '+ CAST (@ i AS VARCHAR (7)) WHERE OF CURRENT CURSEUR_DOUBLON SET @ i = @ i + 1 FETCH CURSEUR_DOUBLON INTO @ first_name END CLOSE CURSEUR_DOUBLON DEALLOCATE CURSEUR_DOUBLON

NUM FIRST NAME
----------- ----------- ----------------
33 DUPONT Marc # 0
33 DUPONT Marc # 1
33 DUPONT Marc # 2
47 DUPOND Alain # 3
59 DUPOND Alain # 4
65 Dupond Frederic  # 5
78 DUPOND Frederic  # 6
81 DUPOND fred  # 7
99 DUPON Eric # 8

DELETE FROM T_DOUBLON T1 WHERE T1.PRENOM> ANY (SELECT T2.PRENOM FROM T_DOUBLON T2 WHERE T1.NUM = T2.NUM AND T1.NOM = T2.NOM AND T1.PRENOM <> T2.PRENOM)

NUM FIRST NAME
----------- ----------- ----------------
33 DUPONT Marc # 0
47 DUPOND Alain # 3
59 DUPOND Alain # 4
65 Dupond Frederic  # 5
78 DUPOND Frederic  # 6
81 DUPOND fred  # 7
99 DUPON Eric # 8

UPDATE T_DOUBLON FIRST SET = SUBSTRING (FIRST, 1, CHARINDEX ('#', FIRST) -1)

NUM FIRST NAME
----------- ----------- ----------------
33 DUPONT Marc
47 DUPOND Alain
59 DUPOND Alain
65 Dupond Frederic 
78 DUPOND Frederic 
81 DUPOND fred 
99 DUPON  eric
We see immediate interest to have added the tag "#" to clean up the faster column polluted by our discriminating ...
TIP: before choosing your discriminant, make sure it is not present in the column by a petition seeking the columns containing such a character. For that use the SQL LIKE or better, a search function of character.

3. Duplicates on
The treatment and detection of duplicates on is much easier since the discriminant, usually the key is already present in the table.
Here is an example of such duplication detection:
* SELECT DISTINCT FROM T_DOUBLON T1 WHERE EXISTS (SELECT * FROM T_DOUBLON T2 WHERE T1.NUM <> T2.NUM AND T1.NOM = T2.NOM AND T1.PRENOM = T2.PRENOM)

NUM FIRST NAME
-------------------------------- ----------- ------- ---------
47 DUPOND Alain
59 DUPOND Alain
And here's how to eradicate them:



DELETE FROM T_DOUBLON T WHERE T. NUM <ANY (NUM FROM SELECT T_DOUBLON T2 WHERE T. NUM <> T2.NUM NAME AND T. T2.NOM = = FIRST AND T. T2.PRENOM)

NUM FIRST NAME
-------------------------------- ----------- ------- ---------
33 DUPONT Marc
33 DUPONT Marc
33 DUPONT Marc
59 DUPOND Alain
65 Dupond Frederic 
78 DUPOND Frederic 
81 DUPOND fred 
99 DUPON  eric

4. Quasi duplicates
The case of quasi duplication is much more delicate. Indeed we must agree on the value of information. But a classic technique to flush out the most is to combine a partial search and levelling of the breakage.

Example:
* SELECT DISTINCT FROM T_DOUBLON T1 WHERE EXISTS (SELECT * FROM T_DOUBLON T2 WHERE T1.NUM <> T2.NUM AND (LOWER (T1.NOM) LIKE '%' + LOWER (T2.NOM) + '%' OR LOWER (T2 . NOM) LIKE '%' + LOWER (T1.NOM )+'%') AND (LOWER (T1.PRENOM) LIKE '%' + LOWER (T2.PRENOM) + '%' OR LOWER (T2.PRENOM) LIKE '%' + LOWER (T1.PRENOM) +'%'))

NUM FIRST NAME
-------------------------------- ----------- ------- ---------
65 Dupond Frederic 
78 DUPOND Frederic 
81 DUPOND fred 
99 DUPON  eric
Of course their eradication reflects the same problem-called and the same answers as those seen previously.

 
< Prev   Next >
School Joomla Templates and Joomla Tutorials