Full-text searching with MySQL

MySQL’s full-text search functions provide a simple framework for an easily implemented, approximate site search. Many sites, written in an interpreted language and powered by MySQL, can use MySQL’s full-text search to avoid third party dependencies.

The basics

The basics of the MySQL full-text search functions are well-documented in the MySQL online documentation. For those lacking patience, here is a quick rundown.

Full-text searching is somewhat akin to a LIKE condition, but is much faster, requiring a FULLTEXT index to be created for the table columns targeted in the search. To search the title and description columns of a table, entries, the following statement would create the proper index:

ALTER TABLE entries ADD FULLTEXT(title, description)

To search these columns for the text, “python threading,” the MATCH...AGAINST functions are used:

SELECT id, MATCH(title, description) AGAINST ('python threading') AS score
FROM entries
ORDER BY score DESC

Notice that we keep the result of the match. The value returned is a float representing the relevance of the match. The higher the number, the more relevant the match.

There are several caveats to the full-text search. In particular, any words that are common between many entries are treated as noise and their relevance in any search is diminished. This means that were every article in entries to be about threading in Python, searching for “python threading” may not return extremely relevant results. Refer to the MySQL docs for more information.

The hard part

If the content to be searched is not conveniently located in one table, things get more complex. In this case, a method must be devised to create an intermediary table to contain the search target.

This might be accomplished with a cron script that aggregates the information nightly or using stored procedures to keep the target table updated.

Refining results

A common case is to weight the search to favor more recent results. Assuming that each entry has a DATETIME field named timestamp, this is easily accomplished by using the entry’s age to modify the score.

For an even reduction to the score based on the article’s age, divide the score by the age, which is determined with DATEDIFF(NOW(), timestamp).

(MATCH(title, description) AGAINST ('python threading'))/GREATEST(1, DATEDIFF(NOW(), timestamp))

Since DATEDIFF returns the difference in days, an entry written today could cause division by zero.
GREATEST means that entries written today and yesterday have equal weight, but prevents results from omitting today’s articles.

A quick test of this will show that results become wildly incorrect after a few days as the text match score begins to diminish further with age. This effect can be reduced by taking the LOG of the age, making the divisor increase less and less the greater the age.

LOG(GREATEST(1, DATEDIFF(NOW(), timestamp)))

The use of LOG causes a steep drop initially, smothing over time. For a less dramatic effect, substituting the square root causes a similar drop in the weight of the entry’s age over time, but diminishing less starkly over time and without the initial steep drop.

SQRT(GREATEST(1, DATEDIFF(NOW(), timestamp)))

The complete SQL statement is now:

SELECT id,
  (MATCH(title, description) AGAINST ('python threading'))/SQRT(GREATEST(1, DATEDIFF(NOW(), timestamp)))
  AS score
FROM entries
ORDER BY score DESC

Submit article

Artful Code