Preamble My thinking is not new - I started this kind of work before SQL Server does this kind of service - and have continued recently with projects in my company. Here, then these elements together in this document and I hope you enjoy reading this. Some definitions All developers know the meaning of the word syntax (well, good developers ...!). I would therefore do not insult our readers recall the notice of the small nor the large Robert Larousse ... Semantics Knowing the meaning of the word semantics is less widespread, especially since it depends on the meaning of words therefore semantics ... Far from being a tautology and although this gibberish inevitably tends to make you smile, I just like to point the intellect whole exercise has difficulty distinguishing between semantics and syntax ... I do not give a definition by opposition of semantics, saying that if the syntax is responsible for grammar, semantics, it takes care of meaning, it means that my words and phrases want to say, in other words, ideas they convey. 1. Indexing base 1.1. Keywords and index This is an old method widely used in documents issued in paper form. There are two different modes: the abstract and index still called thesaurus. The abstracts are generally used in legal documentation and are placed on top of the document often in the title. They allow, before reading the publication, to know the legal context, that is to say about what are treated. It is essentially words or phrases, sentences rarely. The index is usually a list of some carefully chosen words, or all own name cited in a document, with a reference for each page or is this word or proper name. The index is usually a list of some carefully chosen words, or all own name cited in a document, with a reference for each page or is this word or proper name. We can now proceed to food at the table of words, only by the words chosen: insert into KEYWORD (MOT_ID, MOT_MOT) values (1, 'small'); insert into KEYWORD (MOT_ID, MOT_MOT) values (2, 'chat'); insert into KEYWORD (MOT_ID, MOT_MOT) values (3, 'is') ; Insert into KEYWORD (MOT_ID, MOT_MOT) values (4, 'death'); insert into KEYWORD (MOT_ID, MOT_MOT) values (5, 'fact'); insert into KEYWORD (MOT_ID, MOT_MOT) values (6, 'beautiful' ); Insert into KEYWORD (MOT_ID, MOT_MOT) values (7, 'sun'); insert into KEYWORD (MOT_ID, MOT_MOT) values (8, 'shining'); insert into KEYWORD (MOT_ID, MOT_MOT) values (9, birds '); Insert into KEYWORD (MOT_ID, MOT_MOT) values (10,' sing '); insert into KEYWORD (MOT_ID, MOT_MOT) values (11,' understand '); insert into KEYWORD (MOT_ID, MOT_MOT) values (12,' undertake '); insert into KEYWORD (MOT_ID, MOT_MOT) values (13,' udders'); insert into KEYWORD (MOT_ID, MOT_MOT) values (14, 'trade'); insert into KEYWORD (MOT_ID, MOT_MOT) values (15, 'take'); insert into KEYWORD (MOT_ID, MOT_MOT) values (16, 'time'); insert into KEYWORD (MOT_ID, MOT_MOT) values (17, 'live'); insert into KEYWORD (MOT_ID, MOT_MOT) values (18 , 'Saying'); insert into KEYWORD (MOT_ID, MOT_MOT) values (19, 'other'); insert into KEYWORD (MOT_ID, MOT_MOT) values (20, 'life'); insert into KEYWORD (MOT_ID, MOT_MOT) values ( 21, 'high'); insert into KEYWORD (MOT_ID, MOT_MOT) values (22, 'low'); insert into KEYWORD (MOT_ID, MOT_MOT) values (23, 'confused'); insert into KEYWORD (MOT_ID, MOT_MOT) values (24, 'low'); insert into KEYWORD (MOT_ID, MOT_MOT) values (25, 'Guitar'); insert into KEYWORD (MOT_ID, MOT_MOT) values (26, 'bass'); insert into KEYWORD (MOT_ID, MOT_MOT) values (27, 'violin'); insert into KEYWORD (MOT_ID, MOT_MOT) values (28, 'must'); insert into KEYWORD (MOT_ID, MOT_MOT) values (29, 'eat'); insert into KEYWORD (MOT_ID, MOT_MOT ) Values (30, 'from'); insert into KEYWORD (MOT_ID, MOT_MOT) values (31, 'dying'); insert into KEYWORD (MOT_ID, MOT_MOT) values (32, 'little'); insert into KEYWORD (MOT_ID, MOT_MOT) values (33, 'end'); insert into KEYWORD (MOT_ID, MOT_MOT) values (34, 'caution'); insert into KEYWORD (MOT_ID, MOT_MOT) values (35, 'mere'); insert into KEYWORD (MOT_ID , MOT_MOT) values (36, 'security'); insert into KEYWORD (MOT_ID, MOT_MOT) values (37, 'boys); Last phase of our algorithm, seize correspondence between words and texts in the table INDEX: insert into INDEX (TXT_ID, MOT_ID) values (1, 1); insert into INDEX (TXT_ID, MOT_ID) values (1, 2); insert into INDEX (TXT_ID, MOT_ID) values (1, 3); insert into INDEX (TXT_ID , MOT_ID) values (1, 4); insert into INDEX (TXT_ID, MOT_ID) values (2, 5); insert into INDEX (TXT_ID, MOT_ID) values (2, 6); insert into INDEX (TXT_ID, MOT_ID) values ( 2, 7); insert into INDEX (TXT_ID, MOT_ID) values (2, 8); insert into INDEX (TXT_ID, MOT_ID) values (2, 9); insert into INDEX (TXT_ID, MOT_ID) values (2, 10); insert into INDEX (TXT_ID, MOT_ID) values (3, 11); insert into INDEX (TXT_ID, MOT_ID) values (3, 12); insert into INDEX (TXT_ID, MOT_ID) values (3, 13); insert into INDEX (TXT_ID , MOT_ID) values (3, 14); insert into INDEX (TXT_ID, MOT_ID) values (4, 15); insert into INDEX (TXT_ID, MOT_ID) values (4, 16); insert into INDEX (TXT_ID, MOT_ID) values ( 4, 17); insert into INDEX (TXT_ID, MOT_ID) values (4, 18); insert into INDEX (TXT_ID, MOT_ID) values (4, 19); insert into INDEX (TXT_ID, MOT_ID) values (5, 20); insert into INDEX (TXT_ID, MOT_ID) values (5, 21); insert into INDEX (TXT_ID, MOT_ID) values (5, 22); insert into INDEX (TXT_ID, MOT_ID) values (6, 23); insert into INDEX (TXT_ID , MOT_ID) values (6, 24); insert into INDEX (TXT_ID, MOT_ID) values (6, 25); insert into INDEX (TXT_ID, MOT_ID) values (6, 26); insert into INDEX (TXT_ID, MOT_ID) values ( 6, 27); insert into INDEX (TXT_ID, MOT_ID) values (7, 17); insert into INDEX (TXT_ID, MOT_ID) values (7, 28); insert into INDEX (TXT_ID, MOT_ID) values (7, 29); insert into INDEX (TXT_ID, MOT_ID) values (8, 30); insert into INDEX (TXT_ID, MOT_ID) values (8, 31); insert into INDEX (TXT_ID, MOT_ID) values (8, 32); insert into INDEX (TXT_ID , MOT_ID) values (9, 4); insert into INDEX (TXT_ID, MOT_ID) values (9, 18); insert into INDEX (TXT_ID, MOT_ID) values (9, 33); insert into INDEX (TXT_ID, MOT_ID) values ( 10, 3); insert into INDEX (TXT_ID, MOT_ID) values (10, 34); insert into INDEX (TXT_ID, MOT_ID) values (10, 35); insert into INDEX (TXT_ID, MOT_ID) values (10, 36); insert into INDEX (TXT_ID, MOT_ID) values (10, 37); 1.2.3. The complaints It is a matter of finding applications to be implemented to respond to the text search. A good way to solve the thing would be to find a single complaint could handle the majority of cases. For example, set a single complaint able to search a text containing: a word or another ( 'LIVING' or 'EAT' or both) a word and another ( 'LIVE' and 'EAT') three or more words ( 'LOW' and 'GUITAR' and 'BASS') at least two out of three words (of 'Death', 'SELF', 'LIVE')! etc. ... Text OR Search text containing a word or another: 'LIVE' or 'EAT' (or both) select distinct from t.TXT_ID TEXT t join d on t.TXT_ID INDEX = d.TXT_ID join MOT m on d.MOT_ID = m.MOT_ID where m.MOT_MOT in ( 'LIVING', 'EAT') AND complaint Search text containing a word and another: 'LIVE' and 'EAT' The complaint is based on the same motion as before, with the addition of a clause aggregation with a grouping: select distinct from t.TXT_ID TEXT t join d on t.TXT_ID INDEX = d.TXT_ID join MOT m on d.MOT_ID = m.MOT_ID where m.MOT_MOT in ( 'LIVING', 'EAT') group by having t.TXT_ID count (*)> = 2 Search text containing three words: 'LOW' and 'GUITAR' and 'BASS' select distinct from t.TXT_ID TEXT t join d on t.TXT_ID INDEX = d.TXT_ID join MOT m on d.MOT_ID = m.MOT_ID where m.MOT_MOT in ( 'LOW', 'GUITAR', 'BASS') group by t.TXT_ID having count (*)> = 3 application combining OR and AND Search text containing at least two words on 3, one of 'Death', 'SELF', 'LIVING' select distinct from t.TXT_ID TEXT t join d on t.TXT_ID INDEX = d.TXT_ID join MOT m on d.MOT_ID = m.MOT_ID where m.MOT_MOT in ( 'Death', 'SELF', 'LIVE') group by t.TXT_ID having count (*)> = 2 widespread application configurable In general: select distinct from t.TXT_ID TEXT t join d on t.TXT_ID INDEX = d.TXT_ID join MOT m on d.MOT_ID = m.MOT_ID where m.MOT_MOT in (: param1) group by having t.TXT_ID count (*)> =: Param2 or: : param1 is a list of words separated by commas : param2 the number of occurrence of the word required simultaneously Thus the first complaint (OR with two words) can be expressed: select distinct from t.TXT_ID TEXT t join d on t.TXT_ID INDEX = d.TXT_ID join MOT m on d.MOT_ID = m.MOT_ID where m.MOT_MOT in ( 'LIVING', 'EAT') group by having t.TXT_ID count (*)> = 1
ANNEX: primitive Pascal (Delphi) basic functions: Cleaning the string, cut the chain into words and deleting the words black. type = record CelMot word: string; idm: integer; key: boolean; end; TabMot = array of CelMot; var TabMotNoir: TabMot / / this table is supplied with the following request: / / select MNR_MOT_NOIR / / from T_MOT_NOIR / / order by MNR_MOT_NOIR / / word black longest i_mot_noir_maxi_long: integer; //********************************** ************************************// / / CLEANING THE CHAIN OF NATURE / / / / ************************************************** ********************// Function CleanStr (var aString: string): boolean / / input, the channel must be lowercase characters var i: integer; newStr: string; CleanStr begin: = false; newStr: =''; for i: = 1 to length (aString) do CASE aString [i] of / / accepted as the characters from a to z and from 0 to 9 ' a '.. 'z': newStr: = NewStr + aString [i];'0 '.. '9 ': NewStr: = NewStr + aString [i] / / cleaning accented characters and divers' to', 'beta', 'ä', 'ã', 'to', 'å', 'A', 'Á', 'Â', 'Ã', 'Ä', 'Å': newStr: newStr + = 'a', 'æ', 'Æ': newStr: = newStr + 'ae'; 'ç' 'Ç': newStr: newStr + = 'c', 'e', 'è', 'even', 'ë', 'E', 'È', 'Ê', 'Ë': newStr: = newStr + 'e', 'î', 'ï', 'ì', 'í', 'Ì', 'Í', 'Î', 'Ï': newStr: = newStr + 'i', 'ñ', ' Ñ ': newStr: newStr + =' n ',' O ',' ö ',' ð ',' ò ',' ó ',' Ò ',' Ó ',' O ',' Õ ',' Ö ': NewStr: = newStr +' o ',' œ ',' Œ ': newStr: = newStr +' oe '' ù ',' ú ',' û ',' ü ',' Ù ',' Ú ',' Û ',' Ü ': newStr: = newStr +' u ';' ý ',' ÿ ',' Ý ',' Ÿ ': newStr: = newStr +' y '; ELSE / / in all other cases, replacement by a space unless / / space already present in the chain result ... if newStr [length (newStr)] <> 'then newStr: newStr + =' '; END; aString: = newStr; CleanStr: = true; end; //************* ************************************************** *******// / / CUTTING OF THE CHAIN OF CHARACTERS IN WORDS / / //*************************** *******************************************// Function BreakStr (aString: string; var words: TabMot; supMotNoir: boolean; supLet: boolean): boolean; var i: integer; word: string; because: string; BreakStr begin: = false; word: =''; / / it is rid of its characters parasites at the beginning or end aString: = trim (aString) / / analysis character of the chain cleaned for i: = 1 to length (aString) do begin as: = copy (aString, i, 1) / / is a space if car = 'then begin / / no word previously stored if length (word) = 0 then continue / / word of a letter, but not accepted if (length (word) = 1) and supLet then begin word: =''; continues; end; if isMotNoir (word) then begin word: =''; continuous end / / word chosen setLength (words, length (words) +1); words [length ( words) -1]. keyword: = word; word: =''; end else word: word + = car; end / / treatment of any final word if length (word)> 0 then if not ((length (word) = 1) and supLet) then begin setLength (words, length (words) +1); words [length (words) -1]. keyword: = word; end; BreakStr: = true; end / / * ************************************************** *******************// / / IS IT A WORD BLACK? / / / / (Dichotomous research in the table of black words) / / //******************************** **************************************// Function IsMotNoir (unMot: string): boolean; var iCel: integer; bMax, bMin: integer; IsMotNoir begin: = false; if length (unMot)> i_mot_noir_maxi_long then exit / / research dichotomous in the table of black words. bMin: = 0; bMax: = length (TabMotNoir) -1; while bMax <> bMin do begin iCel: = (bMax + bMin) div 2; if unMot = TabMotNoir [iCel]. word then begin / / this is a Black word! IsMotNoir: = true; exit; end; if unMot <TabMotNoir [iCel]. BMax then word: = iCel else bMin: = iCel; if bMax = bMin +1 then begin if (unMot = TabMotNoir [bMax]. Word) or ( unMot = TabMotNoir [bMin]. word) then IsMotNoir: = true; exit; end; end; end; //************************* *********************************************// / / TRI TABLE OF with counting occurrences / / / / using a temporary TStringList / / //******************************* ***************************************// / / Example, WORDS includes the following words Input: / / Words [0]. = word 'lost', words [0]. idm = null / / Words [1]. = word 'found', words [1]. idm = null / / Words [2 ]. = word 'found', words [2]. idm = null / / Words [3]. = word 'found', words [3]. idm = null / / a exit, then: / / Words [ 0]. = Word 'lost', words [0]. Idm = 1 / / Words [2]. = Word 'found', words [2]. Idm = 1 / / Words [1]. = Word 'found' , Words [1]. Idm 2 = function SortTab (var lesMots: TabMot): boolean; var ListMot: TStringList i, j: integer; SortTab begin: = false; ListMot: TStringList.create = (); try ListMot.sorted : = True; / / feed StringList for i: = 0 to length (words) -1 do ListMot.add (words [i]. Word) / / emptying the table setlength (words, 0) / / table takes the size of the number of line of TStringList setlength (words, ListMot.count) j = 0; / / TStrinList copies of a table with counting duplicates for i: = 0 to ListMot.count-1 do if j = 0 then begin words [j]. Keyword: ListMot.strings = [i]; words [j]. Idm: = 1, j = 1; end else begin if the words [j]. ListMot.strings word = [i] then the words [j]. idm: = words [j]. idm +1 else begin j: = j +1; words [j]. keyword: ListMot.strings = [i]; words [j]. idm: = 1 ; End; end; finally ListMot.free (); end; setlength (words, j +1); SortTab: = true; end; 2. Searches semantic But our work is not finished. Indeed it is desirable if in our index, the word car is not here, but in contrast there are vehicle, automobile, car or cars (plural), we found query results by the similarity concept. It is the semantic search ... Apparently semantics of a word A word may be linked to another four-way: the word is a form bent (plural female combination ...) the word is a synonym of more or less the word is a word parent built on the same root it is an expression synonymous Example: hospital inflected forms: hospitals ... synonyms: Clinically, hospice, asylum, clinic, Polyclinic, ... parents: hospital, hospital, hospital, hospice ... expression means: Established health care centre, home medical ... It is therefore necessary to define in the database a link between various words, a bond that we qualify (bent form, synonym, parent ...). Of course it will enrich the "dictionary" of these apparentements semantic or buy such a dictionary as an electronic file. 3. Words misspelled Another issue of research is that the user may have misspelled a word. Studies conducted in this regard have shown that the main fault lay in writing or in the reversal of letters of a word, either in a typographical error or Spelling. 3.1. Error impactor To find a word that some letters were reversed, just take all the letters of that word and place them in a defined order, alphabetical order in this case seems more appropriate. For example, our user wanting to find the word guitar, inadvertently typed the word guitrae. By maintaining a database for each word the list of its letters in alphabetical order, we would have stored aegirtu. In case of failure of the direct search of the word can then list the letters in alphabetical order and try to find a word containing the same letters. Obviously this technique is not a panacea, because two words as doors and poster have the same letters but are devoid of common sense! 3.2. Spelling defective Donal Knuth, suggested there are now several decades, a simple method to try to find a word misspelled. He left that if a word was misspelled, there was a good bet that cutting the word in two, then the error had every opportunity to be located in the half right word, either in the Half word left. According to this hypothesis, he concluded that the other half remained well misspelled word. He then looked in his dictionary words beginning with the half left and those ending with the word half right. The list of words and was then found include the word correctly spelled. For example, our user wanting to find the word guitar, spelling guithare. The algorithm presented above will find the words starting with guit ... and those ending with ... hare, which are listed below: guitar guitarist guitoune zither 4. Measuring relevance Most search engines associated with knowledge bases affect weighing the results presented to propose, first the most relevant results, and eventually the list less relevant. Without going into details, it is conceivable that the weight can be considered as a fuzzy logic or correspondence is perfect (value 1) or empty (0). Between the two, different values are possible in order to qualify the relevance of research. It will assign a value of 1 to the index of relevance when all the words have been found exactly as before. For other matches could for example assess the decrease in the index every word using the table below: correspondence Exact inflection form (sex, plural) synonymous inflection form (combined) parent typo bad spelling 100% 90% 80% 70% 60% 40% 30% Example: The user searches for "hospital" and "Guitar" Results: guitar, hospital 1 guitar, hospitals 0.95 guitar, clinical 0.9 guitar, hospitalize 0.8 Example 2: The user searches for "hospital" and "guithare" (spelling error) Results: guitar, hospital 0.65 guitar, hospitals 0.6 guitar, clinical 0.55 guitar, hospitalize 0.5
|