MySQL’s full-text search functions provide a simple framework for an easily implemented, approximate site search. Many sites, written in an interpreted language and powered by MySQL, can use MySQL’s full-text search to avoid third party dependencies.
The basics of the MySQL full-text search functions are well-documented in the MySQL online documentation. For those lacking patience, here is a quick rundown.
Full-text searching is somewhat akin to a
LIKE condition, but is much faster, requiring a
FULLTEXT index to be created for the table columns targeted in the search. To search the
description columns of a table,
entries, the following statement would create the proper index:
ALTER TABLE entries ADD FULLTEXT(title, description)
To search these columns for the text, “python threading,” the
MATCH...AGAINST functions are used:
SELECT id, MATCH(title, description) AGAINST ('python threading') AS score FROM entries ORDER BY score DESC
Notice that we keep the result of the match. The value returned is a float representing the relevance of the match. The higher the number, the more relevant the match.
There are several caveats to the full-text search. In particular, any words that are common between many entries are treated as noise and their relevance in any search is diminished. This means that were every article in
entries to be about threading in Python, searching for “python threading” may not return extremely relevant results. Refer to the MySQL docs for more information.
The hard part
If the content to be searched is not conveniently located in one table, things get more complex. In this case, a method must be devised to create an intermediary table to contain the search target.
This might be accomplished with a cron script that aggregates the information nightly or using stored procedures to keep the target table updated.
A common case is to weight the search to favor more recent results. Assuming that each entry has a
DATETIME field named
timestamp, this is easily accomplished by using the entry’s age to modify the score.
For an even reduction to the score based on the article’s age, divide the score by the age, which is determined with
(MATCH(title, description) AGAINST ('python threading'))/GREATEST(1, DATEDIFF(NOW(), TIMESTAMP))
DATEDIFF returns the difference in days, an entry written today could cause division by zero.
GREATEST means that entries written today and yesterday have equal weight, but prevents results from omitting today’s articles.
A quick test of this will show that results become wildly incorrect after a few days as the text match score begins to diminish further with age. This effect can be reduced by taking the
LOG of the age, making the divisor increase less and less the greater the age.
LOG(GREATEST(1, DATEDIFF(NOW(), TIMESTAMP)))
The use of
LOG causes a steep drop initially, smothing over time. For a less dramatic effect, substituting the square root causes a similar drop in the weight of the entry’s age over time, but diminishing less starkly over time and without the initial steep drop.
SQRT(GREATEST(1, DATEDIFF(NOW(), TIMESTAMP)))
The complete SQL statement is now:
SELECT id, (MATCH(title, description) AGAINST ('python threading'))/SQRT(GREATEST(1, DATEDIFF(NOW(), TIMESTAMP))) AS score FROM entries ORDER BY score DESC