|
I. Preamble The methods of bringing data that we see are based on functions that measure the alignment or the removal of two strings. These methods are intended to find good value in a table reference values accurate data from incorrect or imprecise. They are often useful in the construction of some solutions Informatics decision to classify the data during the migration (part of ETL, or ELT). II. Distance Levenshtein The distance Levenshtein owes its name to Vladimir Levenshtein which has set in 1965. It calls Levenshtein distance between two words source and target the minimum cost to go from source to target performing elementary operations such as: substitution of a character; Adding in character; Removing a character; By combining operations at each weighing one finds the cost per weight summation of the various operations. The Levenshtein distance generally operates on data type string, but you can use it on all channels and symbols in particular binary data. The algorithm presented a strong complexity (mxn operations) may be written in a simplified manner through a matrix that we will develop through a table. It is nonetheless an algorithm expensive ... That is why we will see a first implementation of this basic function, and an implementation limiting the depth of calculation to yield results more quickly. Note already there is not a single implementation of the Levenshtein distance, but a family of algorithms that consider whether or not this or that operation and valuent the different weights depending on the problem to be addressed . For example, in SQL can be seen that: If the type of data is CHAR and that the substitution is valuee to 1 then added as the abolition of one character may be value to 3 because of the shift required in the chain. However, if the type of data is VARCHAR and that the substitution is valuee to 1 then adding may be value to 9 because of the need to expand storage space to do so, but for the abolition this valuation may remain at 4 because it is not necessary to readjust the storage space, but just change the value of the parameter that specifies the actual length of the data stored. In short everything depends on the point of view in which we go. III. The implementation of Levenshtein "standard" In our examples we considered that all operations of the algorithm Levenshtein should be valued at 1. For example, to get the word HOLDERS DEPORTA the word, we need 6 unit operations: 1 suppression of D EPORTA 2 Suppression of E PORTA 3 modification of A to E DOOR 4 Adding U PORTEU 5 Adding R CARRIER 6 Adding S HOLDERS Here is a first implementation of the algorithm to Transact SQL: CREATE FUNCTION F_DISTANCE_LEVENSHTEIN (@ SOURCE VARCHAR (8000), @ TARGET VARCHAR (8000)) INT RETURNS AS BEGIN -- Management of side effects -- Null input IF @ SOURCE IS NULL OR @ TARGET IS NULL RETURN NULL -- Identical IF = @ @ SOURCE TARGET RETURN 0 DECLARE @ LN_SOURCE INT DECLARE @ LN_CIBLE INT SET @ LN_SOURCE = LEN (@ SOURCE) + 1 SET @ LN_CIBLE = LEN (@ TARGET) + 1 -- Empty string with full string IF LN_SOURCE = @ @ LN_CIBLE OR 1 = 1 BEGIN RETURN LN_SOURCE + @ @ LN_CIBLE - 2 END DECLARE @ MATRIX TABLE (ID Int, Value INT) DECLARE @ i INT DECLARE @ j INT DECLARE @ TMP Int DECLARE @ V1 Int DECLARE @ V2 Int DECLARE @ V3 Int DECLARE @ Vmin Int DECLARE @ COST INT / * Initialize the MATRIX * / SET @ i = 0 WHILE (@ i <LN_SOURCE @ * @ LN_CIBLE) BEGIN INSERT INTO @ MATRIX VALUES (@ i, 0) SET @ i = @ i + 1 END / * Initialize the first line * / SET @ i = 0 WHILE (@ i <@ LN_SOURCE) BEGIN SET @ tmp = 0 UPDATE @ DIE SET Value = @ i WHERE ID = @ TMP SET @ i = @ i + 1 END / * Initializing the first column * / SET @ i = 0 WHILE @ i <@ LN_CIBLE BEGIN SET tmp = @ @ @ i * LN_SOURCE UPDATE @ DIE SET Value = @ i WHERE ID = @ TMP SET @ i = @ i + 1 END / * Compare the two chains * / SET @ i = 1 WHILE @ i <@ LN_SOURCE / Column * * / BEGIN SET @ j = 1 WHILE @ j <@ LN_CIBLE / * * Line / BEGIN / * Check if the characters correspond * / IF SUBSTRING (@ SOURCE, @ i, 1) = SUBSTRING (@ TARGET, @ j, 1) / * Equal * / SET @ COST = 0 ELSE / * ALTERNATIVE, INSERTION * / SET @ = 1 COST / * Playing the left box * / SELECT @ v1 = Value +1 DIE FROM @ WHERE ID = j * @ @ @ LN_SOURCE + i - 1 / * Reading of the case d above * / SELECT @ = v2 Value +1 DIE FROM @ WHERE ID = (@ j-1) * @ @ i + LN_SOURCE / * Reading of the box diagonal left high * / SELECT @ @ = v3 COST + Value DIE FROM @ WHERE ID = (@ j-1) * @ LN_SOURCE + @ i-1 / * It takes the value of the smallest and is inserted in the current case * / IF @ V1 <= @ @ V2 AND V1 <= @ V3 SET @ @ = Vmin V1 ELSE IF @ V2 <= @ V3 SET @ @ = Vmin V2 ELSE SET Vmin = @ @ V3 SET tmp = @ @ @ j * + @ i LN_SOURCE UPDATE @ DIE SET Value = @ Vmin WHERE ID = @ TMP SET @ j = @ j +1 END SET @ i = @ i +1 END / * It returns the cost of processing * / SELECT @ tmp = Value DIE FROM @ WHERE ID = (@ LN_CIBLE-1) * @ LN_SOURCE + @ LN_SOURCE-1 RETURN @ TMP END GO To test the effects, here Thursday the tests that I propose: Either a table containing a column clothing color and color table containing the target reference colours. We will use the office to try to find the value color reference of each garment, closest to those seized by careless operators ... CREATE TABLE T_VETEMENT_VTM (VTM_DESCRIPTION VARCHAR (16), VTM_COULEUR VARCHAR (32)) GO INSERT INTO T_VETEMENT_VTM VALUES ( 'Pantalon1', 'Red') INSERT INTO T_VETEMENT_VTM VALUES ( 'Robe2', 'green'eau') INSERT INTO T_VETEMENT_VTM VALUES ( 'Robe3', 'Rape.) INSERT INTO T_VETEMENT_VTM VALUES ( 'Pull4', 'Yellow straw') INSERT INTO T_VETEMENT_VTM VALUES ( 'Robe5', 'Green') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pantalon6', 'Grey blue') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pantalon7', 'Green') INSERT INTO T_VETEMENT_VTM VALUES ( 'Robe8', 'Red') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pull9', 'white') INSERT INTO T_VETEMENT_VTM VALUES ( 'Robe10', 'White') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pantalon11', 'Grey mouse') INSERT INTO T_VETEMENT_VTM VALUES ( 'Robe12', 'pink') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pull13', 'blue') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pantalon14', 'Red brick') INSERT INTO T_VETEMENT_VTM VALUES ( 'Pull15', 'Brown') INSERT INTO T_VETEMENT_VTM VALUES ( 'Chemise16', 'Vrt') INSERT INTO T_VETEMENT_VTM VALUES ( 'Chemise17', 'Blenc') GO CREATE TABLE T_REF_COULEUR_RCL (RCL_COULEUR VARCHAR (16)) GO INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Black') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'White') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Green') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Yellow') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Blue') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Red') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Rose') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Violet') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Black') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Grey') INSERT INTO T_REF_COULEUR_RCL VALUES ( 'Brown') GO The following query will give us an approach to the solution: SELECT VTM_DESCRIPTION, VTM_COULEUR, RCL_COULEUR, dbo.F_DISTANCE_LEVENSHTEIN (VTM_COULEUR, RCL_COULEUR) AS DL FROM T_VETEMENT_VTM V CROSS JOIN T_REF_COULEUR_RCL C WHERE dbo.F_DISTANCE_LEVENSHTEIN (VTM_COULEUR, RCL_COULEUR) = (SELECT MIN (dbo.F_DISTANCE_LEVENSHTEIN (VTM_COULEUR, RCL_COULEUR)) FROM T_VETEMENT_VTM CROSS JOIN T_REF_COULEUR_RCL WHERE VTM_DESCRIPTION = V. VTM_DESCRIPTION) BY ORDER 1, 4 ASC VTM_DESCRIPTION VTM_COULEUR RCL_COULEUR DL ---------------- -------------------------------- -- -------------- ----------- Chemise16 Vrt Verde 1 Chemise17 Blenc Blanc 1 Pantalon1 Red Red 0 Pantalon11 mouse Gris Gris 1 Red brick Pantalon14 Green 3 Red brick Pantalon14 Yellow 3 Red brick Pantalon14 Blue 3 Red brick Pantalon14 Rouge 3 Red brick Pantalon14 Rose 3 Pantalon6 Grey blue Blue 0 Pantalon7 Green Green 0 Pull13 blue Blue 1 Pull15 Brown Brown 0 Pull4 Yellow straw Blue 2 Pull9 white White 0 Robe10 Blanche White 2 Robe12 rose Rose 0 Robe2 Green water Green 3 Robe2 Green water Yellow 3 Robe2 Green Blue Water 3 Robe3 Rape. Violet 2 Robe5 Green Verde 1 Robe8 Red Red 0 But the big disadvantage of this method is that, with so little data, the combination is such that the mass calculations already requires 3 seconds of calculations on a high-end PC. This is obviously unworkable in production with large masses of information.
IV. A Levenshtein "limited" The idea is then to limit the depth poll. Looking at the results, we find that good approaches have been with a Levenshtein distance of between 0 and 2. On three we already pantalon14 and robe2 which gives us too many matches. We can introduce into the code this limitation by modifying the algorithm as follows: CREATE FUNCTION F_DISTANCE_LEVENSHTEIN_LIMITE (@ SOURCE VARCHAR (8000), @ TARGET VARCHAR (8000), @ LIMIT INT) INT RETURNS AS BEGIN -- Management of side effects -- Null input IF @ SOURCE IS NULL OR TARGET IS NULL @ @ LIMIT OR IS NULL RETURN NULL -- Limit of 0 and different! IF LIMITED @ @ = 0 AND SOURCE <> @ TARGET RETURN NULL -- Lower limit to zero: IF @ LIMIT <0 RETURN NULL -- Identical IF = @ @ SOURCE TARGET RETURN 0 DECLARE @ LN_SOURCE INT DECLARE @ LN_CIBLE INT SET @ LN_SOURCE = LEN (@ SOURCE) + 1 SET @ LN_CIBLE = LEN (@ TARGET) + 1 -- Empty string with full string IF LN_SOURCE = @ @ LN_CIBLE OR 1 = 1 BEGIN RETURN LN_SOURCE + @ @ LN_CIBLE - 2 END DECLARE @ MATRIX TABLE (ID Int, Value INT) DECLARE @ i INT DECLARE @ j INT DECLARE @ TMP Int DECLARE @ V1 Int DECLARE @ V2 Int DECLARE @ V3 Int DECLARE @ Vmin Int DECLARE @ COST INT / * Initialize the MATRIX * / SET @ i = 0 WHILE (@ i <LN_SOURCE @ * @ LN_CIBLE) BEGIN INSERT INTO @ MATRIX VALUES (@ i, 0) SET @ i = @ i + 1 END / * Initialize the first line * / SET @ i = 0 WHILE (@ i <@ LN_SOURCE) BEGIN SET @ tmp = 0 UPDATE @ DIE SET Value = @ i WHERE ID = @ TMP SET @ i = @ i + 1 END / * Initializing the first column * / SET @ i = 0 WHILE @ i <@ LN_CIBLE BEGIN SET tmp = @ @ @ i * LN_SOURCE UPDATE @ DIE SET Value = @ i WHERE ID = @ TMP SET @ i = @ i + 1 END / * Compare the two chains * / SET @ i = 1 WHILE @ i <@ LN_SOURCE / Column * * / BEGIN SET @ j = 1 WHILE @ j <@ LN_CIBLE / * * Line / BEGIN / * Check if the characters correspond * / IF SUBSTRING (@ SOURCE, @ i, 1) = SUBSTRING (@ TARGET, @ j, 1) / * Equal * / SET @ COST = 0 ELSE / * ALTERNATIVE, INSERTION * / SET @ = 1 COST / * Playing the left box * / SELECT @ v1 = Value +1 DIE FROM @ WHERE ID = j * @ @ @ LN_SOURCE + i - 1 / * Reading of the case d above * / SELECT @ = v2 Value +1 DIE FROM @ WHERE ID = (@ j-1) * @ @ i + LN_SOURCE / * Reading of the box diagonal left high * / SELECT @ @ = v3 COST + Value DIE FROM @ WHERE ID = (@ j-1) * @ LN_SOURCE + @ i-1 / * It takes the value of the smallest and is inserted in the current case * / IF @ V1 <= @ @ V2 AND V1 <= @ V3 SET @ @ = Vmin V1 ELSE IF @ V2 <= @ V3 SET @ @ = Vmin V2 ELSE SET Vmin = @ @ V3 SET tmp = @ @ @ j * + @ i LN_SOURCE UPDATE @ DIE SET Value = @ Vmin WHERE ID = @ TMP -- If the limit is exceeded, it returns NULL! IF EXISTS (SELECT * DIE FROM @ WHERE ID = (@ LN_CIBLE-1) * @ LN_SOURCE + @ LN_SOURCE-1 AND Value> @ LIMIT) RETURN NULL SET @ j = @ j +1 END SET @ i = @ i +1 END / * It returns the cost of processing * / SELECT @ tmp = Value DIE FROM @ WHERE ID = (@ LN_CIBLE-1) * @ LN_SOURCE + @ LN_SOURCE-1 RETURN @ TMP END In our case, using a limit of 2, we get a more consistent answer already: -- The best approaches with a limit of 2 SELECT VTM_DESCRIPTION, VTM_COULEUR, RCL_COULEUR, dbo.F_DISTANCE_LEVENSHTEIN (VTM_COULEUR, RCL_COULEUR) AS DL FROM T_VETEMENT_VTM V CROSS JOIN T_REF_COULEUR_RCL C WHERE dbo.F_DISTANCE_LEVENSHTEIN_LIMITE (VTM_COULEUR, RCL_COULEUR, 2) = (SELECT MIN (dbo.F_DISTANCE_LEVENSHTEIN_LIMITE (VTM_COULEUR, RCL_COULEUR, 2)) FROM T_VETEMENT_VTM CROSS JOIN T_REF_COULEUR_RCL WHERE VTM_DESCRIPTION = V. VTM_DESCRIPTION) BY ORDER 1, 4 ASC VTM_DESCRIPTION VTM_COULEUR RCL_COULEUR DL ---------------- -------------------------------- -- -------------- ----------- Chemise16 Vrt Verde 1 Chemise17 Blenc Blanc 1 Pantalon1 Red Red 0 Pantalon11 mouse Gris Gris 1 Pantalon6 Grey blue Blue 0 Pantalon7 Green Green 0 Pull13 blue Blue 1 Pull15 Brown Brown 0 Pull4 Yellow straw Blue 2 Pull9 white White 0 Robe10 Blanche White 2 Robe12 rose Rose 0 Robe3 Rape. Violet 2 Robe5 Green Verde 1 Robe8 Red Red 0 You will note, however, there are still differences. For example, the grey blue Pantalon6 was found rather than blue grey .... When the Pull4 it was found while blue is yellow straw. In fact Levenshtein distance is very little credibility when occurrences have very different lengths. There are various means to correct that. First to valuant strongly insertions and deletions and low substitutions. After taking into account the difference in length and weighing the result of Levenshtein. Finally using other measures such as the difference HAMMING I leave you to work to change skilfully algorithm that Levenshtein. As for me, I'll show you the difference HAMMING ... V. Difference Hamming It is simply the number of different symbols from one channel to another ... Just read sequentially the two channels by comparing each letter to the position in two words. If the letter is identical on account 1. The difference this Hamming may be coded as follows Transact SQL: CREATE FUNCTION F_DIFFERENCE_HAMMING (@ SOURCE VARCHAR (8000), @ TARGET VARCHAR (8000)) INT RETURNS AS BEGIN IF @ SOURCE IS NULL OR @ TARGET RETURN IS NULL NULL SOURCE IF @ @ =''OR''RETURN TARGET = LEN (@ SOURCE) + LEN (@ TARGET) DECLARE @ COUNT INT DECLARE @ I INT, @ L INT SET @ I = 1 SET @ L = LEN (@ SOURCE) IF LEN (@ TARGET)> @ L SET @ L = LEN (@ TARGET) SET @ COUNT = 0 WHILE @ I <= @ L BEGIN IF @ I> LEN (@ SOURCE) BEGIN SET @ @ COUNT COUNT = + LEN (@ TARGET) - LEN (@ SOURCE) + 1 BREAK END IF @ I> LEN (@ TARGET) BEGIN SET @ @ COUNT COUNT = + LEN (@ SOURCE) - LEN (@ TARGET) + 1 BREAK END IF SUBSTRING (@ SOURCE, @ I, 1) <> SUBSTRING (@ TARGET, @ I, 1) SET @ @ COUNT COUNT = + 1 SET @ @ I = I + 1 END RETURN @ COUNT END GO This algorithm is inexpensive because linear and gives some good results: -- The best approaches Hamming SELECT VTM_DESCRIPTION, VTM_COULEUR, RCL_COULEUR, dbo.F_DIFFERENCE_HAMMING (VTM_COULEUR, RCL_COULEUR) AS DL FROM T_VETEMENT_VTM V CROSS JOIN T_REF_COULEUR_RCL C WHERE dbo.F_DIFFERENCE_HAMMING (VTM_COULEUR, RCL_COULEUR) = (SELECT MIN (dbo.F_DIFFERENCE_HAMMING (VTM_COULEUR, RCL_COULEUR)) FROM T_VETEMENT_VTM CROSS JOIN T_REF_COULEUR_RCL WHERE VTM_DESCRIPTION = V. VTM_DESCRIPTION) BY ORDER 1, 4 ASC VTM_DESCRIPTION VTM_COULEUR RCL_COULEUR DL ---------------- -------------------------------- -- -------------- ----------- Chemise16 Vrt Green 4 Chemise16 Vrt Grey 4 Chemise17 Blenc Blanc 1 Pantalon1 Red Red 0 Pantalon11 Gris Gris mouse 8 Red brick Pantalon14 Rouge 8 Pantalon6 blue Gris Gris 6 Pantalon7 Green Green 0 Pull13 blue Blue 2 Pull15 Brown Brown 0 Pull4 Yellow Yellow straw 8 Pull9 white White 0 Robe10 Blanche White 3 Robe12 rose Rose 0 Robe2 Green water Green 7 Robe3 Rape. Violet 3 Robe5 Green Green 2 Robe8 Red Red 0 But our view is Chemise16 green and grey while Pantalon6 is seen specifically grey ... Can we still improve our score? I thought working on a more direct approach, particularly in passing, not by the difference, but by direct letters. I called this algorithm INFERENCE BASIC. VI. Inference basic Frederic Brouard algorithm. It is neither more nor less than to compare the letters of a string to another, always advancing in the same direction: on advance letter by letter in the word target; it positions itself to the first letter in the word source; if a letter is found in the word source on account 1 and we remain positioned to this letter in the word source; the comparison is not commutative, again on target and reversing a source; then return on the best of both counts. This is an algorithm that Levenshtein cheaper, but a little more than Hamming. It can also be optimized by the fact that the polling reverse is not necessary if the count 0 refers to the first pass (no joint letter). Example: Either by inference to compare basic words and ANNANAS BANANAS: First pass: B A N A N E A N N A N A S The first pass gives 3 common symbols Second passe: A N N A N A S B A N A N E It gives 4 common symbols. The direct inference is therefore 4. Here is the code of this feature: CREATE FUNCTION F_INFERENCE_BASIQUE (@ SOURCE VARCHAR (8000), @ TARGET VARCHAR (8000)) INT RETURNS AS BEGIN IF @ SOURCE IS NULL OR @ TARGET RETURN IS NULL NULL SOURCE IF @ @ =''OR''RETURN TARGET = 0 DECLARE @ COUNT1 INT, @ COUNT2 INT DECLARE @ I INT, @ J INT, @ L INT DECLARE @ C CHAR (1) SET @ COUNT1 = 0 SET @ COUNT2 = 0 SET @ I = 1 SET @ J = 1 SET @ L = LEN (@ SOURCE) -- First pass source to target WHILE @ I <= @ L BEGIN SET @ C = SUBSTRING (@ SOURCE, @ I, 1) IF CHARINDEX (C @, @ TARGET, @ J)> 0 BEGIN SET COUNT1 = @ @ COUNT1 + 1 SET @ J = CHARINDEX (C @, @ TARGET, @ J) + 1 SET @ @ I = I + 1 CONTINUES END SET @ @ I = I + 1 END IF @ COUNT1 RETURN 0 = 0 SET @ I = 1 SET @ J = 1 SET @ L = LEN (@ TARGET) -- Second pass target to source WHILE @ I <= @ L BEGIN SET @ C = SUBSTRING (@ TARGET, @ I, 1) IF CHARINDEX (C @, @ SOURCE, @ J)> 0 BEGIN SET COUNT2 = @ @ COUNT2 + 1 SET @ J = CHARINDEX (C @, @ SOURCE, @ J) + 1 SET @ @ I = I + 1 CONTINUES END SET @ @ I = I + 1 END -- The best counting IF @ COUNT1 <@ COUNT2 SET COUNT1 = @ @ COUNT2 RETURN @ COUNT1 END GO In our example, the results speak for themselves: VTM_DESCRIPTION VTM_COULEUR RCL_COULEUR IB ---------------- -------------------------------- -- -------------- ----------- Chemise16 Vrt Green 3 Chemise17 Blenc Blanc 4 Pantalon1 Red Red 5 Pantalon11 Gris Gris mouse 4 Red brick Pantalon14 Red 5 Pantalon6 Grey blue Blue 4 Pantalon6 Gris Gris blue 4 Pantalon7 Green Green 4 Pull13 blue Blue 4 Pull15 Brown Brown 6 Pull4 Yellow straw Yellow 5 Pull9 white Blanc 5 Robe10 Blanche White 5 Robe12 rose Rose 4 Robe2 Green water Green 4 Robe3 Rape. Violet 4 Robe5 Green Green 4 Robe8 Red Red 5 The only case that has not been settled is the Pantalon6 grey blue! This is indeed the only case undecidable by the mere fact of a machine ... VII. Comparison of computing time The time for calculating these various algorithms to test our game are as follows (the complaints have been launched without the ORDER BY clause): method UC Time (ms) Time cast (ms) Levenshtein 10 375 10 959 Levenshtein limited to 2 11 562 11 984 Hamming 407 447 Inference basic 797 830 It finally realizes that the Levenshtein limited does not provide conclusive results and frankly that the test limitation is ultimately more expensive than its "free". It notes that the cost of basic inference is less than twice that of the difference Hamming. But given his best performance in terms of bringing grounds it is paid in many cases ...
|