Recently I and my Friend Sonali has completed a white paper on full text search engines with detail case study on SQL Server 2008 full text search & Lucene.net
The important thing to understand in full text search is, being aware of various terminologies used in full text search process. In below paragraph I am going to explain in short about what is full text search and in later paragraphs various terminologies used in the process.
This information is collected from various websites, from msdn, apache lucene.net formal website etc
Full Text Search
In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
Full text search is often divided into two tasks: indexing and searching.
The indexing stage will scan the text of all the documents and build a list of search terms, often called an index.
When a query arrives, either programmatically or as a result of a user request, the full-text engine accesses the sorted and optimized word index to identify which documents contain the requested term(s). The engine creates a list of documents that qualify, typically provided as a list of pointers into the main document index.
General terms used in full text search engines...
The important thing to understand in full text search is, being aware of various terminologies used in full text search process. In below paragraph I am going to explain in short about what is full text search and in later paragraphs various terminologies used in the process.
This information is collected from various websites, from msdn, apache lucene.net formal website etc
Full Text Search
In text retrieval, full text search refers to a technique for searching a computer-stored document or database. In a full text search, the search engine examines all of the words in every stored document as it tries to match search words supplied by the user.
Full text search is often divided into two tasks: indexing and searching.
The indexing stage will scan the text of all the documents and build a list of search terms, often called an index.
When a query arrives, either programmatically or as a result of a user request, the full-text engine accesses the sorted and optimized word index to identify which documents contain the requested term(s). The engine creates a list of documents that qualify, typically provided as a list of pointers into the main document index.
General terms used in full text search engines...
Crawling | full text population | |
Simple Term | One or more specific words or phrases are considered as simple terms. In full-text search, a word is considered to be a token. A token is identified by appropriate word breakers, following the linguistic rules of the specified language. A valid phrase can consist of multiple words, with or without punctuation between them.
For example, "croissant" is a word, and "café au lait" is a phrase. Words and phrases such as these are called simple terms. | |
Prefix term | A word or a phrase where the words begin with specified text. Using a prefix term search with query like condition CONTAINS ( table.column, ' "felso*" ' ) yields a match for the word "felsok". When conducting a prefix search, all entries in the column that contain text beginning with the specified prefix will be returned. For example, to search for all rows that contain the prefix top-, as in topple, topping, and top itself, the sql server full text search query looks like this:
|
|
Generated term | Inflectional forms of a specific word. The inflectional forms are the different tenses of a verb or the singular and plural forms of a noun. For example, search for the inflectional form of the word "drive". If various rows in the table include the words "drive", "drives", "drove", "driving", and "driven", all would be in the result set because each of these can be inflectionally generated from the word drive. | |
Proximity Term | A word or phrase close to another word or phrase. In text processing, a proximity search looks for documents where two or more separately matching term occurrences are within a specified distance, where distance is the number of intermediate words or characters. Proximity of the words in a document implies a relationship between the words. For example, a search could be used to find "red brick house", and match phrases such as "red house of brick" or "house made of red brick". By limiting the proximity, these phrases can be matched while avoiding documents where the words are scattered or spread across a page or in unrelated articles in an anthology. One more example to understand proximity term is, suppose you want to find the rows in which the word "ice" is near to the word "hockey" or in which the phrase "ice skating" is near to the phrase "ice hockey". Whether two terms or phrases are considered to be near to each other is calculated internally. Many data points are considered when calculating nearness. Most internet search engines mostly implements implicit proximity search functionality. That is they automatically rank those search results higher where user keywords have a good “overall proximity score” in such results. |
|
Compound term processing | It is way to retrieve information where matching happens on the basis of compound terms. Compound terms are built by combining two (or more) simple terms, for example "triple" is a single word term but "triple heart bypass" is a compound term.
Compound processing increases relevance of search results without missing anything important. By forming compound (i.e. multi-word) terms and placing these in the search engine's index the search can be performed with a higher degree of accuracy because the ambiguity inherent in single words is no longer a problem. |
|
Concept search/conceptual search | It is automated information retrieval method that is used to search electronically stored unstructured text for the information that is conceptually similar to the information provided in a search query. (e.g. emails, digital achieves, scientific literature etc) Concept-based searching is becoming accepted as a reliable and efficient search method that is more likely to produce relevant results than keyword or Boolean searches. This search uses compound term processing. | |
Thesaurus | Thesaurus defines user-specified synonyms for terms. For example, if an entry, "{car, automobile, truck, van}", is added to a thesaurus, you can search for the thesaurus form of the word "car". All rows in the table queried that include the words "automobile", "truck", "van", or "car", appear in the result set because each of these words belong to the synonym expansion set containing the word "car". | |
Weighted Term | A weighting value that indicates the degree of importance for each word and phrase within a set of words and phrases. For example, in a query searching for multiple terms, you can assign each search word a weight value indicating its importance relative to the other words in the search condition. The results for this type of query return the most relevant rows first, according to the relative weight you have assigned to search words.
The result sets contain documents or rows containing any of the specified terms (or content between them); however, some results will be considered more relevant than others because of the variation in the weighted values associated with different searched terms. |
|
Noise words | Large number of words which occurred very frequently and which were not helpful in resolving searches. Such words are referred to as noise words | |
Stemming | Stemming is the process of reducing inflected words to their root. For example, if you are looking for “developing”, probably you are also interested in the word “developed” or “develop” or “developer”. During the indexing phase, the stemming process normalizes all these inflected words to their root “develop”. And does the same when querying the index (if you search for “development” it will search for “develop”). Obviously this is tied to the language of the text.
E.g. SnowballAnalyzer is analyzer of Lucene.net which tokenized text based on grammars and adds stemming phase at the end using snowball stemming language. |
Lucene specific terminologies
Directory | The directory is where Lucene indexes are stored: it can be a physical folder on the filesystem (FSDirectory) or an area in memory where files are stored (RAMDirectory). The index structure is compatible with all ports of Lucene, hence we could also have the indexing done with .NET and searched with Java, or the other way around. |
IndexWriter | This component is responsible for the management of the indexes. It creates indexes, adds documents to the index, and optimizes the index. |
Analyzer | Complexity of the indexing resides in this component. The analyzer contains the policy for extracting index terms from the text. There are several analyzers available both in the core library and in the contrib project of Lucene and the java version has even more analyzers that have not been ported to .net yet.
StandardAnalyzer is mostly used in Lucene.net, which tokenizes the text based on European-language grammars, sets everything to lowercase and removes English stopwords. SnowballAnalyzer is one more example of analyzer, which works exactly like the standard one, however adds one more step at the end: the stemming phase, using the Snowball stemming language. |
Document and Fields | A document is a single entity that is put into the index. And it contains many fields which are, like in a database, the single different pieces of information that make a document.
Fields: Lucene documnet is unit of indexing and searching process. We add document to the index and after searching a text, we again get list of results in the form of document. A document is just unstructured collection of Fields. Fields are a hashtable having name and value. When we define Field while creating document, all parameters of fields are mentioned in constructor of it. new Lucene.Net.Documents.Field( "bug_id", Convert.ToString(bug_id), Lucene.Net.Documents.Field.Store.YES, Lucene.Net.Documents.Field.Index.NOT_ANALYZED, Lucene.Net.Documents.Field.TermVector.YES); |
Boosting | With boosting Lucene means the ability to make something (a document, a field, a search term) be more important than the others.
e.g. if we want matches on title of the post more than its content then we will set boost on title in this case.
Field title = ...; title.SetBoost(2.0f); |
SQL Server full text search specific terminologies
Full-text Catalog in SQL Server full text search engine | A full-text catalog contains zero or more full-text indexes. Full-text catalogs must reside on a local hard drive associated with the instance of SQL Server. Each catalog can serve the indexing needs of one or more tables within a database. Full-text catalogs cannot be stored on removable drives, floppy disks, or network drives, except when you attached a read-only database that contains a full-text catalog. |
Word Breaker & Stemmer used in SQL server full text search | Word breakers and stemmers perform linguistic analysis on all full-text indexed data. Linguistic analysis involves finding word boundaries which are word breakers. For a given language, a word breaker tokenizes text based on the lexical rules of the language. Such analysis also involves conjugating verbs which is called stemmers. For a given language, a stemmer generates inflectional forms of a particular word based on the rules of that language. Both word breaker and stemmers are language specific. For the word breakers of the language they must be registered. To view a list of the languages whose word breakers are currently registered with SQL Server, the following Transact-SQL statement can be used: SELECT * FROM sys.fulltext_languages Several licensed third-party word breakers are shipped with SQL Server 2008. You can manually load additional third-party word breakers (and stemmers) for several languages (Danish, Polish, and Turkish). This information is provided in depth on msdn site at path: http://msdn.microsoft.com/en-us/library/ms142509(v=SQL.105).aspx |
Noise Words/Stopwords | Such words are frequently occurring words that do not help the search. For example, for the English locale words such as "a", "and", "is", and "the" are considered noise words. These words are ignored to prevent the full-text index from becoming bloated.Such noise words are also known as Stopwords. SQL Server 2005 noise words have been replaced by stopwords. In SQL server 2008. |
StopLists | Beginning in SQL Server 2008, a system stoplist is provided that contains a basic set stopwords/noise words. In short stopwords are managed in databases using objects called stoplists. A stoplist is a list of stopwords that, when associated with a full-text index, is applied to full-text queries on that index. You can create/alter/drop stoplist just as below simple query. After creating stoplists, while creating/altering full text index on table, you can associate stoplists with it. CREATE FULLTEXT STOPLIST myStoplist; |
No comments:
Post a Comment