PostgreSQL/SQLAlchemy/TurboGears search

PostgreSQL 8.3+ has integrated the old tsearch2 plugin into the distribution.  This lets you do very formal "stemmed" textual queries on a body of text (set of columns in a table).  For instance, imagine we have a table "version_text" that stores all of the text in our version control system in the "text" column.  We can create an English text-search index like so:

create index fulltext_idx on version_text using 
gin(to_tsvector('english',text));

which will create an automatically updating index of the "stemmed" words for each word in "text" and allow us to query it as part of our SQLAlchemy searches like so:

from myproject.model import *
from sqlalchemy import func
DBSession.query( VersionText ).filter(
    VersionText.text.op( '@@' )( func.plainto_tsquery(search) )
)

and this will quite happily do stemmed-text searches through our VersionText index.  The where clause this creates can be combined with any other filter/join etceteras criteria to allow us to restrict our searches to a subset of records.

You can also store the results of to_tsvector() in a column in your table with a trigger to update it.  Even more involved configurations are likely for "industrial" usage, such as having the indices be per-language based on a separate language/configuration column for the text.  There's also a ts_headline( text, query ) function that can be used to generate HTML-formatted summary text for the match

Thing is, the search mechanism is not a very satisfying text-search.  The plainto_tsquery function works by doing stemming on the query terms, so if you want to search for 'Vietnamese' you wind up having to type V-i-e-t-n-a-m into your ajax-text-search box before any results will match the query.  That is, it won't match until the 'm' is added and the stem matches the stem for Vietnamese.

Comparatively, a simplistic regex query, while far less efficient and needing quite a bit of pre-processing to transform user queries into regexen (replacing special characters and the like), produces more natural-feeling results:

DBSession.query( VersionText ).filter(
   VersionText.text.op( 'ILIKE' )( '%'+search+'%' )
)

What you'd really like, I figure, is:

  • have a query that could search through a btree'd index of those lexemes defined in the index for those records that match and give you a result that showed all matching lexemes as suggestions for search-terms (that is, you'd type 'V-i' and a pop-down list would suggest "Violin|Viola|Vietnam...", with *just* values from your actual lexemes in the body of records being searched)
  • maybe make suggestions for Levenshtein edit distance X from your current search term

I haven't yet figured out how to do an index that would give that effect, that is, to do textual search/matching on the lexemes in a ts_vector field.  I figure that it must be available somewhere, but the docs don't tend to mention the low-level mechanisms for interacting with a ts_vector, so for now, no text-search in my application.

[Update] Apparently PostgreSQL 8.4 supports prefix-searching by using the function to_query with a :* suffix, like so: to_query( 'prefix:*' )

Comments

  1. Simon Willison

    Simon Willison on 10/03/2009 4:34 p.m. #

    I've been looking for a good method of implementing Ajax autocomplete against a database as well. The best option I've seen so far is to break each term in to three letter substrings (as many as you can make, so the string "equador" would be stored as "equ", "qua", "uad", "ado" and "dor") and store those in an indexed column - then only offer autocomplete on search strings that are three letters or longer. That's how Playdar from www.playdar.org does it, using SQLite. I haven't tried this out myself yet though.

Comments are closed.

Pingbacks

Pingbacks are closed.