Lucene Solr Revolution 2013 in Dublin10 Nov 2013
I just returned from Lucene Solr Revolution Europe, the conference on everything Lucene and Solr which this year was held in Dublin. I always like to recap what I took from a conference so here are some impressions.
In the spirit of last years conference, which was merged with ApacheCon and held in a soccer stadium in Sinsheim, this years venue was a Rugby Stadium. It's seems to be quite common that conferences are organized there and the location was well suited. For some of the room changes you had to walk quite a distance but that's nothing that couldn't be managed.
As there were four tracks in parallel choosing the talk to attend could prove to be difficult. There were so many interesting things to choose from. Fortunately all the talks have been recorded and will be made available for free on the conference website.
The following are a selection of talks that I think were most valuable to me.
Keynote: Michael Busch on Lucene at Twitter
Michael Busch is a regular speaker at Search conferences because Twitter is doing some interesting things. On the one hand they have to handle near realtime search, massive data sets and lots of requests. On the other hand they can always be sure that their documents are of a certain size. They maintain two different data stores as Lucene indices, the realtime index that contains the most recent data and the archive index that makes older tweets searchable. They introduced the archive index only a few months ago which in my opinion led to a far more reliable search experience. They have done some really interesting things like encoding the position info of a term with the doc id because they only need few bits to address positions in a 140 character document. Also they changed some aspects of the posting list encoding because they always display results sorted by date. They are trying to make their changes more general so those can be contributed back to Lucene.
Solr Indexing and Analysis Tricks by Erik Hatcher
Hacking Lucene and Solr for Fun and Profit by Grant Ingersoll
Grant Ingersoll presented some applications of Lucene and Solr not directly involving search like Classification, Recommendations and Analytics. Some examples had been taken from his excellent book Taming Text (watch this blog for a review of the book in the near future).
Schemaless Solr and the Solr Schema REST API by Steve Rowe
One of the factors of the success of Elasticsearch is its ease of use. You can download it and start indexing documents immediately without doing any configuration work. One of the features that enables you to do this is the autodiscovery of fields by value. Starting with Solr 4.4 you can now use Solr in a similar way. You can configure that you want Solr to manage your schema. This way unknown fields are then created automatically based on the first value that is extracted by configured parsers. As with Elasticsearch you shouldn't rely on this feature exclusively so there is also a way to add new fields of a certain type via the Schema REST API. When Solr is in managed mode it will modify the schema.xml so you might lose changes you made manually. For the future the developers are even thinking about moving away from XML for the managed mode as there are better options for when readability doesn't matter.
Stump the Chump with Chris Hostetter
This seems to be a tradition at Lucene Solr Revolution. Chris Hostetter has to find solutions to problems that have been submitted before or are posted by the audience. It's a fun event but you can also learn a lot.
Query Latency Optimization with Lucene by Stefan Pohl
Stefan first introduced some basic latency factors and how to measure them. He recommended to not instrument the low level Lucene classes when profiling your application as those rely heavily on hotspot optimizations. Besides introducing the basic mechanisms of how conjunction (AND) and disjunction (OR) work he described some recent Lucene improvements that can speed up your application, among those LUCENE-4571, the new minShouldMatch implementation and LUCENE-4752, which allows custom ordering of documents in the index.
Relevancy Hacks for eCommerce by Varun Thacker
Varun introduced the basics of relevancy sorting in Lucene and Solr and how those might affect product searches. TF/IDF is sometimes not the best solution ("IDF is a measurement of rarity not necessarily importance"). He also showed the ways to influence the relevancy: Implementation of a custom Similarity class, boosting and function queries.
What is in a Lucene Index by Adrien Grand
Adrien started with the basics fo a Lucene index and how it differs from a database index: the dictionary structure, segments and merging. He then moved on to topics like the structure of the posting list, term vectors, the FST terms index and the difference between stored fields and doc values. This is a talk full of interesting details on the internal workings of Lucene and the implications for the performance of your application.
As said before I couldn't attend all the talks I would have liked. I especially heard good things about the following talks which I will watch as soon as those are available:
- Integrate Solr with Real-Time Stream Processing Applications by Timothy Potter
- The Typed Index by Christoph Goller
- Implementing a Custom Search Syntax Using Solr, Lucene and Parboiled by John Berryman
I really enjoyed Lucene Solr Revolution. Not only were there a lot of interesting talks to listen to but it was also a good opportunity to meet new people. On both evenings there have been get togethers with free drinks and food which must have cost LucidWorks a fortune. I couldn't attend the closing remarks but I heard they announced that they want to move to smaller, national events in Europe instead of the central conference. I hope those will still be events that attract so many commiters and interesting people.