Simple Full Text Search for App Engine
Revised 2009-07-12 to include addition of stemming, multi-word exact matching, multiple search entities, and stowing of designated parent titles in index entity keys.
After rewatching Brett Slatkin's Building Scalable, Complex Apps talk at Google I/O, I came across the old SearchableModel module and realized it could be greatly improved by using the "Relation Index" technique described in the talk. So after a day of hacking, I'm releasing an open source (MIT License) simple solution to full text search for App Engine.
This full text search module has stemming and multi-word (two and three-word) search phrases. It uses the new Task Queue API to offload indexing. If you want more features like built-in autosuggest, take a look at gae-search. [Update 2009-07-10] They have now released a free non-commercial version.
Because this module uses the relation index method described in Slatkin's talk, it's easy to add to your existing models. All indexing properties are segregated into separate Search Index entities.
An example test app can be tested live here or downloaded from the appengine-search github repository. Because pyporter2 is included as a git submodule, after you clone the repository, make sure to run a "git submodule update --init" command. That will pull down the most recent pyporter2.
Let's say you have a Page model. You can use multiple inheritance to mix-in the Searchable module:
class Page(Searchable, db.Model): author_name = db.StringProperty() content = db.TextProperty() myPage = Page(content='My amazing content!') myPage.put() myPage.enqueue_indexing(url='/tasks/searchindexing') # or if you don't worry about timeouts myPage.index()
After you edit or create a Page, you can either enqueue indexing (via the new Task Queue API) or do indexing immediately if you know you won't time out. Both approaches are baked in. (Note: if you run the test app on your dev server, remember that tasks in the queue must be manually started via the development console.)
You can limit indexing to particular properties like so:
class ContentIndexedPage(Searchable, db.Model): author = db.StringProperty() content = db.TextProperty() random_junk = db.StringProperty() INDEX_ONLY = ['author', 'content']
To retrieve Page keys or entities given search words, just use a search() method:
Page.search('statue of liberty') # Returns entities
Page.search('stuff', keys_only=True) # Returns list of (key,title)
When searching for 'statue of liberty', the search method will return a list of keys where full phrase matches are listed first. The query results for ANDed keywords, e.g., 'statue' AND 'liberty', are also returned. Please see the code documentation for more information.
Key-only searches will return a list of (key,title) tuples so you can actually display some information from the parent without getting the entire parent entity. How does this work? First you use the INDEX_TITLE_FROM_PROP class variable to declare which string property will be the "title" for your model:
class Page(Searchable, db.Model): author = db.StringProperty() content = db.TextProperty() title = db.StringProperty() INDEX_TITLE_FROM_PROP = 'title'
When index entities are created for the above Page model, we can encode the particular page's title in the key names of index entities. Think of it as a way of stowing one piece of information that can be retrieved with a low-cost query, so searches could display links with useful descriptors instead of retrieving the whole parent entity. If you change the value of the parent's title, you should reset the indexes like so:
some_page.title = 'New Title Here' some_page.put() some_page.indexed_title_changed()
Reiterating the basic idea: lists of search phrases are held on a separate Search Index entity that declares the indexed entity as a parent. This gives us a number of benefits over the old SearchableModel. Because the phrase lists are in a separate entity, we don't have the exploding indices issue any more.
If there are more than approximately 5,000 search phrases, additional Search Index entities are created to work around the 5,000 indexed properties limit. In the case of large multi-entity indices for a Page, it's possible that keywords in a search phrase will be in different entities, so ANDed keywords could fail. However, if you use a search phrase like "statue of liberty", the entire three-word phrase will be matched and work correctly.
For the typical blog entry or article, it is rare that over 5,000 search phrases will be generated from content and require multiple index entities. The entire "Alice in Wonderland" novel generated silghtly over 10,000 search phrases. You could set INDEX_MULTI_WORD to False and greatly reduce the number of search phrases as well, at the cost of precise matching of two- and three-word phrases:
class HugeContentPage(Searchable, db.Model): content = db.TextProperty() INDEX_MULTI_WORD = False
Stemming can be toggled in the Model class declaration, although it's on by default to handle inflections. To turn off stemming, do this:
class LiteralMatchPage(Searchable, db.Model): content = db.TextProperty() INDEX_STEMMING = False
Also, some people have wondered how to setup testing for App Engine python development. I've included some files under the projects /test directory that might help out. The test app uses nosetests to run a doctest, functional test, and unit test. Thanks go to the makers of nose-gae and other testers who've blogged about their lessons.
Comments are closed
18 Comments
Looks nifty, can't wait to try it! :)
Thanks for releasing the source and for the writeup!
agreed, this is very compelling. the asynchronous indexing is definitely a win (thanks task queue!), but personally i'm even more excited about the indexed properties limit workaround. that in particular would make this a much more useful interim solution until app engine offers real full text search. i can't wait!
really nice code, is stemming already supported ?
how do you manage to get this beautiful code highlighting ?
thx for the code release :)
Re: Article by Bill (2009-07-01)
It is now :)
Hi Bill,
Thanks for sharing this.
It looks like this will build indexes differently from Google's searchable model. Do you have any idea how they compare in terms of CPU usage? Google's seems to be a bit of a hog.
Re: Article by Bill (2009-07-11)
You are correct. There's an issue with doing AND of non-adjacent search words held on separate index entities. I should have caught that. We could treat search words as a literal search phrase together with two and three-word indexes (in progress). In that case, the index would contain "hello world" in one entity. Have to think about it, and suggestions are welcome.
the index are stored as is in lower case. By converting the word to
ascii you can match non-English words better. E.g. searching for
"bjorn" should match "björn," like google does.
Also, afaict the module didn't support paginating the search results
which I had to add. Performance is linear to the number of results
returned. It became unacceptably slow when more than 100 results were
returned. To add pading I used a technique similar to the one
described here:
http://code.google.com/intl/sv-SE/appengine/articles/paging.html
Basically you have a bookmark which is an entity key and an offset
that specifies how many entites you want to jump forward or
backward. Note that that requires an extra index because you have to
sort the __key__ attribute in descending order to be able to go to
previous pages. I think the technique can be extended to work with
arbitrary sort orders too, but I haven't gotten that far yet.
Then to decide how many result pages to show you need to have a
separate word count entity. So if you search for 'foo bar' and foo
occurs 100 and bar 200 times in the indexed entities, then if the
number of documents is 1000 then you will roughly get 1000 *
(200/1000) * (100/1000) = 20 hits (assuming there is no correlation
between the occurance of the word 'foo' and 'bar').
If you are interested in the code I can put it online somewhere.
You should strip tags from the comments
Re: Article by Bill (2009-08-21)
Re: Article by vern (2009-08-27)
When i run the sample code i get "No module named pyporter2". What is wrong?