How to Search Text Documents with Whoosh

Whoosh is a python library of classes and functions for indexing text and then searching the index. If the application requires text documents search functionality, Whoosh module can be used for this task. This post will summarize main steps needed for implementing search with Whoosh.

Text Search

Using Whoosh consists of indexing documents and then querying (searching) the index.
First we need to import needed modules:

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

To index documents we need define folder where to save needed files.

import os.path
if not os.path.exists("indexdir"):
    os.mkdir("indexdir")

We also need define Schema – the set of all possible fields in a document.

The schema specifies the fields of documents in an index. Each document can have multiple fields, such as title, content, url, date, etc.

schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))


ix = index.create_in("indexdir", schema)

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my python document! hello big world",
                    path=u"/a")
writer.add_document(title=u"Second try", content=u"This is the second example hello world.",
                    path=u"/b")
writer.add_document(title=u"Third time's the charm", content=u"More examples. Examples are many.",
                    path=u"/c")

writer.commit()

Once index is created, we can search documents using index:

from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
     query = QueryParser("content", ix.schema).parse("hello world")
     results = searcher.search(query, terms=True)
    
     for r in results:
         print (r, r.score)
         # Was this results object created with terms=True?
         if results.has_matched_terms():
            # What terms matched in the results?
            print(results.matched_terms())
        
     # What terms matched in each hit?
     print ("matched terms")
     for hit in results:
        print(hit.matched_terms())

The output that we get:

<Hit {'path': '/b', 'title': 'Second try', 'content': 'This is the second example hello world.'}>
<Hit {'path': '/b', 'title': 'Second try', 'content': 'This is the second example hello world.'}> 2.124137931034483
{('content', b'hello'), ('content', b'world')}
<Hit {'path': '/a', 'title': 'My document', 'content': 'This is my python document! hello big world'}> 1.7906976744186047
{('content', b'hello'), ('content', b'world')}
matched terms
[('content', b'hello'), ('content', b'world')]
[('content', b'hello'), ('content', b'world')]

Whoosh has many features that can enhance searching. We can get more documents like a certain search hit. This requires that the field you want to match on is vectored or stored, or that you have access to the original text (such as from a database). Here is the example, more_like_this() is used for this.

print ("more_results")
     first_hit = results[0]
     more_results = first_hit.more_like_this("content")
     print (more_results)   

Output:

more_results
<Top 1 Results for Or([Term('content', 'example', boost=0.6588835188105945), Term('content', 'second', boost=0.6588835188105945), Term('content', 'hello', boost=0.5617184491361429), Term('content', 'world', boost=0.5617184491361429)]) runtime=0.0038603000000136944>  

If we want to know the number of matched documents we can call len(results) but on very large indexes it can cause delay, but there is a way avoid this by getting just low and high estimate.

found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")    

Below you can find full python source code for the above and references to the Whoosh documentation and other articles about Whoosh. You will find how to use Whoosh with pandas or how to use Whoosh with web2py for web crawling project.

# -*- coding: utf-8 -*-

from whoosh.fields import Schema, TEXT, ID
from whoosh import index

#To create an index in a directory, use index.create_in:

import os.path

if not os.path.exists("indexdir"):
    os.mkdir("indexdir")
    
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT(stored = True))


ix = index.create_in("indexdir", schema)

writer = ix.writer()
writer.add_document(title=u"My document", content=u"This is my python document! hello big world",
                    path=u"/a")
writer.add_document(title=u"Second try", content=u"This is the second example hello world.",
                    path=u"/b")
writer.add_document(title=u"Third time's the charm", content=u"More examples. Examples are many.",
                    path=u"/c")

writer.commit()


from whoosh.qparser import QueryParser

with ix.searcher() as searcher:
     query = QueryParser("content", ix.schema).parse("hello world")
     results = searcher.search(query, terms=True)
     print(results[0])

     for r in results:
         print (r, r.score)
         # Was this results object created with terms=True?
         if results.has_matched_terms():
            # What terms matched in the results?
            print(results.matched_terms())
        
     # What terms matched in each hit?
     print ("matched terms")
     for hit in results:
        print(hit.matched_terms())

     

     print ("more_results")
     first_hit = results[0]
     more_results = first_hit.more_like_this("content")
     print (more_results)     
        
    
found = results.scored_length()
if results.has_exact_length():
    print("Scored", found, "of exactly", len(results), "documents")
else:
    low = results.estimated_min_length()
    high = results.estimated_length()

    print("Scored", found, "of between", low, "and", high, "documents")    

References

1. Quickstart
2. Developing a fast Indexing and Full text Search Engine with Whoosh: A Pure-Python Library
3. Whoosh , Pandas, and Redshift: Implementing Full Text Search in a Relational Database
4. USING WHOOSH WITH WEB2PY