21st Jul, 2007

Full text search

I was doing a research on
implementing search using mySQL for my site. I learnt the following things. I
would like to share it with you.

One of the more
useful MySQL features is the ability to search for text using a
FULLTEXT index. Currently this is only available if you use the
MyISAM table type (which is the default table type, so if you don’t know
what table type you’re using, it’ll most likely be MyISAM). A fulltext
index can be created for a TEXT, CHAR or VARCHAR type field, or
combination of fields.

Example:

mysql> CREATE TABLE articles (
-> id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
-> title VARCHAR(200),
-> body TEXT,
-> FULLTEXT (title,body)
-> );

mysql> INSERT INTO articles (title,body) VALUES
-> (’MySQL Tutorial’,'DBMS stands for DataBase …’),
-> (’How To Use MySQL Well’,'After you went through a …’),
-> (’Optimizing MySQL’,'In this tutorial we will show …’),
-> (’1001 MySQL Tricks’,'1. Never run mysqld as root. 2. …’),
-> (’MySQL vs. YourSQL’,'In the following database comparison …’),
-> (’MySQL Security’,'When configured properly, MySQL …’);

mysql> SELECT * FROM articles
-> WHERE MATCH (title,body) AGAINST (’database’);
+—-+——————-+——————————————+
| id | title | body |
+—-+——————-+——————————————+
| 5 | MySQL vs. YourSQL | In the following database comparison … |
| 1 | MySQL Tutorial | DBMS stands for DataBase … |
+—-+——————-+——————————————+
2 rows in set (0.00 sec)

The MATCH() function performs a natural language search for a string
against a text collection. A collection is a set of one or more columns
included in a FULLTEXT index. The search string is given as the argument
to AGAINST(). For each row in the table, MATCH() returns a relevance
value; that is, a similarity measure between the search string and the
text in that row in the columns named in the MATCH() list.

A list of the
main features of a standard FULLTEXT search follows:

  • Excludes partial words
  • Full-text searches are supported for MyISAM tables only.
  • The argument to AGAINST() must be a
    constant string
  • Excludes words less than 4 characters in length (3 or less)
  • Excludes words that appear in more than half the rows
  • Hyphenated words are treated as two words (arthur-neil)
  • Rows are returned in order of relevance, descending
  • Words in the stopword list (common words) are also excluded from
    the search results.

MySQL can perform boolean full-text searches using the IN BOOLEAN MODE
modifier:

mysql> SELECT * FROM articles WHERE MATCH (title,body)
-> AGAINST (’+MySQL -YourSQL’ IN BOOLEAN MODE);
+—-+———————–+————————————-+
| id | title | body |
+—-+———————–+————————————-+
| 1 | MySQL Tutorial | DBMS stands for DataBase … |
| 2 | How To Use MySQL Well | After you went through a … |
| 3 | Optimizing MySQL | In this tutorial we will show … |
| 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. … |
| 6 | MySQL Security | When configured properly, MySQL … |
+—-+———————–+————————————-+

The + and - operators indicate that a word is required to be present or
absent, respectively, for a match to occur. Thus, this query retrieves
all the rows that contain the word “MySQL” but that do not contain
the word “YourSQL”.

Boolean full-text searches have these characteristics:

* They do not use the 50% threshold.
* They do not automatically sort rows in order of decreasing
relevance. You can see this from the preceding query result: The row with the
highest relevance is the one that contains “MySQL” twice, but it
is listed last, not first.
* They can work even without a FULLTEXT index, although a search
executed in this fashion would be quite slow.
* The minimum and maximum word length full-text parameters apply.
* The stopword list applies.

MySQL’s full-text search capability has few user-tunable
parameters.

  • The minimum and maximum lengths of words to be
    indexed are defined by the ft_min_word_len and
    ft_max_word_len system variables. The default
    minimum value is four characters; the default maximum is
    version dependent. If you change either value, you must
    rebuild your FULLTEXT indexes. For
    example, if you want three-character words to be searchable, you
    can set the ft_min_word_len variable by
    putting the following lines in an option file:

    [mysqld]
    ft_min_word_len=3

    Then you must restart the server and rebuild your
    FULLTEXT indexes. Note particularly the remarks
    regarding myisamchk in the
    instructions following this list.

  • To override the default stopword list, set the
    ft_stopword_file system variable. The
    stopword list is free-form. That is, you may use any
    non-alphanumeric character such as newline, space, or comma to
    separate stopwords. Exceptions are the underscore character
    (‘_’) and a single apostrophe
    (‘'’) which are treated as part of a word.
  • The 50% threshold for natural language searches is
    determined by the particular weighting scheme chosen. To
    disable it, look for the following line in
    myisam/ftdefs.h:

    #define GWS_IN_USE GWS_PROB

    Change that line to this:

    #define GWS_IN_USE GWS_FREQ

    Then recompile MySQL. There is no need to rebuild the
    indexes in this case.

To know more about Full text Search and other mysql search features
refer,

  • http://www.databasejournal.com/features/mysql/article.php/1578331
  • http://www.onlamp.com/pub/a/onlamp/2003/06/26/fulltext.html
  • http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

Leave a response

Your response:

Categories and Tags

Advertising