|
Duplicates: detection and eradication ...
|
|
|
|
|
Wednesday, 28 May 2008
|
1. The different types of duplicates We must distinguish duplication absolute duplicates on and near duplicates. Duplicates absolute tuples are strictly identical. All columns lines doublonnees must contain the same data. Given the possibility of the presence of a key in most tables, these types of duplicates that are not in the key tables. Sample duplicates absolute: NUM FIRST NAME ---------------- ----------- -------- 33 DUPONT Marc 33 DUPONT Marc 33 DUPONT Marc Duplicates on tuples are identical except for the key component values. With the exception of forming the key columns, other columns lines doublonnees must contain the same data. This is the kind of duplicates the most common since the majority of tables are equipped with keys. Sample duplicates on: NUM FIRST NAME ---------------- ----------- -------- 47 DUPOND Alain 59 DUPOND Alain The quasis duplicates are made by tuples whose semantic content and logic is the same, with slight differences as a differentiation capitalization or typographical errors due to data entry errors. Example of near duplicates: NUM FIRST NAME ---------------- ----------- -------- 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric We shall now see how to detect and eradicate all of these types of duplicates. But first, here is our game tests: CREATE TABLE T_DOUBLON (NUM INT, NAME VARCHAR (32), FIRST VARCHAR (16)) INSERT INTO T_DOUBLON VALUES (33, 'DUPONT', 'marc') INSERT INTO T_DOUBLON VALUES (33, 'DUPONT', 'marc') INSERT INTO T_DOUBLON VALUES (33, 'DUPONT', 'marc') INSERT INTO T_DOUBLON VALUES ( 47, 'DUPOND', 'alain') INSERT INTO T_DOUBLON VALUES (59, 'DUPOND', 'alain') INSERT INTO T_DOUBLON VALUES (65, 'Smith', 'Frederick') INSERT INTO T_DOUBLON VALUES (78, 'DUPOND' , 'Frederic ') INSERT INTO T_DOUBLON VALUES (81, 'DUPOND', 'fred ') INSERT INTO T_DOUBLON VALUES (99, 'DUPON', ' eric') 2. Duplicates absolute The principle of absolute duplication detection is simple. They are counting lines with a combination of all columns and introduce a filter on the result of counting by detecting values above 1: SELECT COUNT (*) AS NBR_DOUBLES, NUM, NAME, FIRST GROUP FROM T_DOUBLON BY NUM, NAME, FIRST HAVING COUNT (*)> 1 NBR_DOUBLES NUM FIRST NAME ----------- ----------- -------- ---------------- 3 33 DUPONT Marc If such duplication is a simplicity to detect child, it does not apply to delete ... Indeed, it should be kept for at least one line and delete the rest. But like any information is similar, all lines will be discontinued, unless, and this is our solution, introducing a discriminatory ... 2.1. Suppression by adding a discriminator column The technique is to add a column, feed this column with all different values, values that can serve as a unique key. To feed this column, you can use a slider to be numbered or if your RDBMS becomes available, using a function of ranking or ID generation. 2.1.1. Using a cursor Here is an example for SQL Server: ALTER TABLE T_DOUBLON ADD COLID INT DECLARE @ i INT DECLARE @ colid INT SET @ i = 1 DECLARE CURSOR CURSEUR_DOUBLON FOR SELECT COLID FROM T_DOUBLON FOR OPEN UPDATE CURSEUR_DOUBLON FETCH CURSEUR_DOUBLON INTO colid WHILE @ @ @ FETCH_STATUS = 0 BEGIN SET COLID UPDATE T_DOUBLON = @ i WHERE CURSEUR_DOUBLON SET OF CURRENT @ i = @ i + 1 FETCH CURSEUR_DOUBLON INTO @ colid END CLOSE CURSEUR_DOUBLON DEALLOCATE CURSEUR_DOUBLON NUM NAME FIRST COLID ---------------- -------- ----------- ----------- 33 DUPONT Marc 1 33 DUPONT Marc 2 33 DUPONT Marc 3 47 DUPOND Alain 4 59 DUPOND Alain 5 65 Frederick Smith 6 78 DUPOND Frederic 7 81 DUPOND fred 8 99 DUPON eric 9 Therefore, to remove duplication "33 DUPONT marc" just a simple basic complaint: DELETE FROM T_DOUBLON T WHERE T. COLID <ANY (COLID SELECT FROM T_DOUBLON T2 WHERE T. COLID <> T2.COLID AND T. NUM T2.NUM = = NAME AND T. T. T2.NOM AND FIRST = T2.PRENOM) NUM NAME FIRST COLID ---------------- -------- ----------- ----------- 33 DUPONT Marc 3 47 DUPOND Alain 4 59 DUPOND Alain 5 65 Frederick Smith 6 78 DUPOND Frederic 7 81 DUPOND fred 8 99 DUPON eric 9 Of course we can now remove the column COLID introduced to our needs suppression ... 2.1.2. Using an office ranking A function of ranking, gives a number to rank lines of a table. Some RDBMS allow the use of such a function as rowid. The principle is the following: add new column fuel this column with the function of ranking delete data by application delete the column ranking Here is an example that uses a function that generates a unique identifier, or GUID, (SQL Server): ALTER TABLE T_DOUBLON ADD COLUID uniqueidentifier NUM NAME FIRST COLUID ----------- -------- ---------------- --------------- --- 33 DUPONT Marc NULL 33 DUPONT Marc NULL 33 DUPONT Marc NULL 47 DUPOND Alain NULL 59 DUPOND Alain NULL 65 Dupond Frederic NULL 78 DUPOND Frederic NULL 81 DUPOND fred NULL 99 DUPON Eric NULL UPDATE T_DOUBLON SET COLUID = newid () NUM NAME FIRST COLUID ----------- -------- ---------------- --------------- --------------------- 33 DUPONT marc 2FEB898A-709C-4C8B-9A4D-DDE82D73CA06 33 DUPONT marc 659DB730-B4F3-4078-A21F-9A2250CE4284 33 DUPONT marc DE8C63A6-3940-4246-AB6C-5BBA122A4507 47 DUPOND alain 82A33D79-2806-4FA7-AD95-F6B209D18424 59 DUPOND alain EE8F99EE-4E25-47A2-8988-1D7702A61A35 65 Dupond Frederic 28B3CAD0-8977-46EB-81A2-409B098C4087 78 DUPOND Frederic 12A11D89-0448-46EF-B8F5-988BCB380598 81 DUPOND fred D75E6872-0026-4051-83D2-648C63BA8ED6 99 DUPON eric D45C8563-DF53-4800-A310-DE2869380D6B DELETE FROM T_DOUBLON T WHERE T. COLUID> ANY (SELECT COLUID FROM T_DOUBLON T2 WHERE T. COLUID <> T2.COLUID AND T. NUM T2.NUM = = NAME AND T. T. T2.NOM AND FIRST = T2.PRENOM) NUM NAME FIRST COLUID ----------- -------- ---------------- --------------- --------------------- 33 DUPONT marc DE8C63A6-3940-4246-AB6C-5BBA122A4507 47 DUPOND alain 82A33D79-2806-4FA7-AD95-F6B209D18424 59 DUPOND alain EE8F99EE-4E25-47A2-8988-1D7702A61A35 65 Dupond Frederic 28B3CAD0-8977-46EB-81A2-409B098C4087 78 DUPOND Frederic 12A11D89-0448-46EF-B8F5-988BCB380598 81 DUPOND fred D75E6872-0026-4051-83D2-648C63BA8ED6 99 DUPON eric D45C8563-DF53-4800-A310-DE2869380D6B
ALTER TABLE T_DOUBLON DROP COLUMN COLUID NUM FIRST NAME ---------------- ----------- -------- 33 DUPONT Marc 47 DUPOND Alain 59 DUPOND Alain 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric This elegant solution is preferable in the case or your RDBMS is equipped with such a feature. 2.1.3. Remove with a temporary table Another formula to eliminate duplication, is to use a temporary table the same structure as one with Duplicate supplies delivered by the single keyword DISTINCT. It then eliminates all duplicate the original table, then re-inject the contents of the temporary table in the original table. Here is an example SQL Server: SELECT DISTINCT NUM, NAME, FIRST INTO T_DOUBLON_TMP BY GROUP FROM T_DOUBLON NUM, NAME, FIRST HAVING COUNT (*)> 1 DELETE FROM T_DOUBLON T1 WHERE EXISTS (SELECT * FROM T_DOUBLON T2 WHERE T1.NUM = T2.NUM AND T1.NOM = T2.NOM AND T1.PRENOM = T2.PRENOM GROUP BY NUM, NAME, FIRST HAVING COUNT (*)> 1 ) NUM FIRST NAME ---------------- ----------- -------- 47 DUPOND Alain 59 DUPOND Alain 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric INSERT INTO SELECT T_DOUBLON NUM, NAME, FIRST FROM T_DOUBLON_TMP NUM FIRST NAME ---------------- ----------- -------- 33 DUPONT Marc 47 DUPOND Alain 59 DUPOND Alain 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric DROP TABLE T_DOUBLON_TMP 2.1.4. Remove without changing the schema Sometimes users do not have the necessary rights on the basis for modifying the scheme and in particular add or delete a column to a table or even create and destroy a table, even temporarily. A useful solution in this case is to add information in a column already set the table. The choice of the column in which we will add the information should be dictated by the following: column sought little research and not linked to another table (FK) column containing enough space to add at least one figure In our example, it is clearly NAME column, which entails the maximum free space. But it is very likely that this is also the most requested research. However, the column FIRST is probably the least sought. The size of the column is 16 characters and is filled with up to 8. It leaves us 8 characters, which can in theory go from 0 to 99 999 999 until the numbering of 100 million lines ... Widely clearly enough in our case. But to facilitate the "backward" and therefore easier to clean this column, we'll use a marking ... Let us see how, using an example for SQL Server: DECLARE @ first_name VARCHAR (16) DECLARE @ i INT SET @ i = 0 DECLARE CURSOR CURSEUR_DOUBLON SELECT FOR FIRST FROM T_DOUBLON FOR OPEN UPDATE CURSEUR_DOUBLON FETCH CURSEUR_DOUBLON INTO first_name WHILE @ @ @ FETCH_STATUS = 0 BEGIN SET FIRST UPDATE T_DOUBLON = @ first_name + '# '+ CAST (@ i AS VARCHAR (7)) WHERE OF CURRENT CURSEUR_DOUBLON SET @ i = @ i + 1 FETCH CURSEUR_DOUBLON INTO @ first_name END CLOSE CURSEUR_DOUBLON DEALLOCATE CURSEUR_DOUBLON NUM FIRST NAME ----------- ----------- ---------------- 33 DUPONT Marc # 0 33 DUPONT Marc # 1 33 DUPONT Marc # 2 47 DUPOND Alain # 3 59 DUPOND Alain # 4 65 Dupond Frederic # 5 78 DUPOND Frederic # 6 81 DUPOND fred # 7 99 DUPON Eric # 8 DELETE FROM T_DOUBLON T1 WHERE T1.PRENOM> ANY (SELECT T2.PRENOM FROM T_DOUBLON T2 WHERE T1.NUM = T2.NUM AND T1.NOM = T2.NOM AND T1.PRENOM <> T2.PRENOM) NUM FIRST NAME ----------- ----------- ---------------- 33 DUPONT Marc # 0 47 DUPOND Alain # 3 59 DUPOND Alain # 4 65 Dupond Frederic # 5 78 DUPOND Frederic # 6 81 DUPOND fred # 7 99 DUPON Eric # 8 UPDATE T_DOUBLON FIRST SET = SUBSTRING (FIRST, 1, CHARINDEX ('#', FIRST) -1) NUM FIRST NAME ----------- ----------- ---------------- 33 DUPONT Marc 47 DUPOND Alain 59 DUPOND Alain 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric We see immediate interest to have added the tag "#" to clean up the faster column polluted by our discriminating ... TIP: before choosing your discriminant, make sure it is not present in the column by a petition seeking the columns containing such a character. For that use the SQL LIKE or better, a search function of character. 3. Duplicates on The treatment and detection of duplicates on is much easier since the discriminant, usually the key is already present in the table. Here is an example of such duplication detection: * SELECT DISTINCT FROM T_DOUBLON T1 WHERE EXISTS (SELECT * FROM T_DOUBLON T2 WHERE T1.NUM <> T2.NUM AND T1.NOM = T2.NOM AND T1.PRENOM = T2.PRENOM) NUM FIRST NAME -------------------------------- ----------- ------- --------- 47 DUPOND Alain 59 DUPOND Alain And here's how to eradicate them: DELETE FROM T_DOUBLON T WHERE T. NUM <ANY (NUM FROM SELECT T_DOUBLON T2 WHERE T. NUM <> T2.NUM NAME AND T. T2.NOM = = FIRST AND T. T2.PRENOM) NUM FIRST NAME -------------------------------- ----------- ------- --------- 33 DUPONT Marc 33 DUPONT Marc 33 DUPONT Marc 59 DUPOND Alain 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric 4. Quasi duplicates The case of quasi duplication is much more delicate. Indeed we must agree on the value of information. But a classic technique to flush out the most is to combine a partial search and levelling of the breakage. Example: * SELECT DISTINCT FROM T_DOUBLON T1 WHERE EXISTS (SELECT * FROM T_DOUBLON T2 WHERE T1.NUM <> T2.NUM AND (LOWER (T1.NOM) LIKE '%' + LOWER (T2.NOM) + '%' OR LOWER (T2 . NOM) LIKE '%' + LOWER (T1.NOM )+'%') AND (LOWER (T1.PRENOM) LIKE '%' + LOWER (T2.PRENOM) + '%' OR LOWER (T2.PRENOM) LIKE '%' + LOWER (T1.PRENOM) +'%')) NUM FIRST NAME -------------------------------- ----------- ------- --------- 65 Dupond Frederic 78 DUPOND Frederic 81 DUPOND fred 99 DUPON eric Of course their eradication reflects the same problem-called and the same answers as those seen previously.
|
|
|