Non-Standard Ways of using Lucene

For our recent online shop project, we required a full-text, multi-criteria product search. Lucene, the popular Java search engine, is an ideal candidate for this functionality. But in order to meet the high performance requirement, we had to extend its usage beyond standard full-text search. This posting describes our solution including index switching and using Lucene as a simple NoSQL database.

Searching is a frequent activity on the web and one of the most important features of every online shop. With a powerful searching facility, customers can easily specify what they want and then find a product they look for. This includes all kinds of standard catalog search (i.e. category) and full-text search.

Traditional web shop solutions implement a catalog search with relational database queries. Full-text search tasks can then be implemented either by native features of the databases, or by using external indexing engines. Lucene is an open-source Java indexing engine library. It is used by a huge number of web sites and applications. Internally, Lucene stores data in a flat storage structure, where each record consists of several fields as key/value pairs. In Lucene terms, such a record is called a Document.

We have had very good experience with Lucene in previous projects and decided to extend its usage beyond standard full-text search. Here are a couple of non-standard usages of Lucene that we’ve used in our projects and that you might find interesting too.

Criteria search with Lucene

Our recent shop project had high performance requirements for a multi-criteria search. Specifically, we required <100 ms for >200 concurrent queries on average. The application itself is written in Java and can be clustered easily.

We are using RAMDirectory to load an index entirely into RAM (see also later on in the blog). This is especially convenient since we have a pre-built index living on disk and can easily slurp the whole thing into RAM for faster searching. In such a setup, a search index is only limited by the amount of heap memory available to the JVM process. A mid-sized shop contains up to hundred thousands of products what, which should not reach usual memory limits. In our case, the index size is about 15 MByte per language, summing up to 60 MByte in total. However, if the number of indexed items is very high (resulting in big indexes) or a distributed search is needed then Solr could be considered as an alternative. See also our blog series about using Solr/Lucene with Hadoop.

Mapping Relational Data to the Lucene Index

Our first challenge was to represent the relations in the Lucene index, i.e. searchable data that comes from embedded or associated entities (*-to-many associations). For example, when searching products, one might restrict the search on specific categories, such as shoes. In relational databases, a SQL query on an entity and its associations can easily join several tables by primary/foreign keys. Such a join cannot be done with Lucene’s document model in a straight-forward way.

However, there is a workaround at the cost of storing redundant data: a document might contain more fields with the same key, i.e. key/value = “category/shoes” and “category/basketball-shoes”. This can be used for *-to-many associations where joins are represented by its business key (i.e. unique category key from the ERP system) or primary key from the database. Then it is necessary to collect all required data during index creation and build index documents from them. In our case, this means the transformation of the domain model (left side) into Lucene’s flat document structure (right side):

Extracting information from an domain entity and its associated entities into a single Lucene document.

When searching for a product with a given relation, i.e. for products of a specific category, then the search will be performed for all documents that contain the relation’s key with a given relation value, i.e. documents containing the key “category” with the value “basketball-shoes”.

Please note, that this approach is not suitable in cases when the associated entity in a “*-to-many” relation is changed frequently. Because in this case, either the complete index or at least all the documents containing the changed data would have to be updated. This might be time consuming operation. However, for our shop engine, this concern is not relevant, because data is changed only within the underlying ERP system, and each publication triggers a complete rebuild of the index (see below). Thus, the approach can be used without any limitations.

A sidenote: If you’re using JPA with Hibernate, you might want to have a look at Hibernate Search. It uses special annotations and automates the indexing and extracting of data from the entities, and even supports the indexing of embedded and associated JPA entities. For one-to-many and many-to-many relations, it uses the same “trick” as mentioned above.

Lucene as a NoSQL Database

Our first versions of the search returned only IDs of database records that met the given condition. The records were then loaded from the database. This approach, combined with a properly configured Hibernate second-level query cache, perfectly fits to most usages.

However, the performance can still be improved. For example, the search result page does not display all product information — only brand, product name, price is typically displayed, but the detailed product description is not, see example. Lucene allows to store “result values” in the index, which are not processed, i.e. kept untokenized, which can be retrieved for displaying the result. For example like this:

Field field = new Field("brand", brand.getName(),

And after performing the search, the value can be easily retrieved from the query result:

String brand = document.get("brand");

Like brand, we also store product name and price in the index, so that all necessary data to build the search result page can be retrieved from Lucene alone. Specifically, there is no need afterwards to make an extra database query to load them. A similar effect can be achieved by a second level cache containing all products.

Concurrent Rebuilding and Searching with Index Switching

The index creation is running in an external process that is triggered after data from the ERP system has been imported into the SQL database. This is typically done once a day, or at maximum several times a day. The imported data is read-only in the SQL databse. Once the import is finshed, all available products are indexed during this process and thus, the index is always completely rebuilt.

Unfortunately, Lucene has a restriction in that a index cannot be updated while it is open for reading by another process. But this issue can be easily solved by working with two indexes: one for searching, the other for updating. While one index is used by the shop for searching, the other might be recreated in the background. If the new index is ready, the clustered shop application is notified via JMS and opens the index that has just been recreated. The original search index stays untouched and is thus available for the next import and index re-creation.

The principle of Index Switching allow concurrent rebuilding and searching of different JVM processes.

Searching in RAM

The index is normally being stored in the filesystem and performs all operations there. If there is enough JVM heap space, then the index file can be loaded into memory using RAMDirectory and all search operations then do not require any disk access. This will increase the performance, if the file system does not keep recently accessed files in shared memory:

if (ramSearch) {
searcher = new IndexSearcher(new RAMDirectory(directory));
} else {
searcher = new IndexSearcher(directory);

Other high-performance options are discussed in chapter 9.4 of the latest edition of the Lucene in Action book.