Berlin Buzzwords 201207 Jun 2012
Berlin Buzzwords is an annual conference on search, store and scale technology. I've heard good things about it before and finally got convinced to go there this year. The conference itself lasts for two days but there are additional events before and afterwards so if you like you can spend a whole week.
As I had to travel on sunday anyway I took an earlier train to attend the barcamp in the early evening. It started with a short introduction of the concepts and the scheduling. Participants could suggest topics that they either would be willing to introduce by themselfes or just anything they are interested in. There were three roomes prepared, a larger and two smaller ones.
Among others I attended sessions on HBase, designing hybrid applications, Apache Tika and Apache Jackrabbit Oak.
HBase is a distributed database build on top of the Hadoop filesystem. It seems to be used more often than I would have expected. Interesting to hear about the problems and solutions of other people.
The next session on hybrid relational and NoSQL applications stayed rather high level. I liked the remark by one guy that Solr, the underdog of NoSQL, often is the first application where people are ok with dismissing some guarantees regarding their data. Adding NoSQL should be exactly like this.
I only started just recently to use Tika directly so it was really interesting to see where the project is heading in the future. I was surprised to hear that there now also is a TikaServer that can do similar things like those I described for Solr. That's something I want to try in action.
Jackrabbit Oak is a next generation content repository that is mostly driven by the Day team of Adobe. Some of the ideas sound really interesting but I got the feeling that it still can take some time until this really can be used. Jukka Zitting also gave a lightning talk on this topic at the conference, the slides are available here.
The atmosphere in the sessions was really relaxed so even though I expected to only listen I took the chance to participate and ask some questions. This probably is the part that makes a barcamp as effective as it is. As you are constantly participating you keep really contentrated on the topic.
The first day started with a great keynote by Leslie Hawthorn on building and maintaining communities. She compared a lot of the aspects of community work with gardening and introduced OpenMRS, a successful project building a medical record platform. Though I currently am not actively involved in an open source project I could relate to a lot of the situations she described. All in all an inspiring start of the main conference.
Next I attended a talk on building hybrid applications with MongoDb. Nothing new for me but I am glad that a lot of people now recommend to split monolithic applications into smaller services. This also is a way to experiment with different languages and techniques without having to migrate large parts of an application.
A JCR view of the world provided some examples on how to model different structures using a content tree. Though really introductionary it was interesting to see what kind of applications can be build using a content repository. I also liked the attitude of the speaker: The presentation was delivered using Apache Sling which uses JCR under the hood.
Probably the highlight of the first day was the talk by Grant Ingersoll on Large Scale Search, Discovery and Analytics. He introduced all the parts that make up larger search systems and showed the open source tools he uses. To increase the relevance of the search results you have to integrate solutions to adapt to the behaviour of the users. That's probably one of the big takeaways for me of the whole conference: Always collect data on your users searches to have it available when you want to tune the relevance, either manually or through some learning techniques. The slides of the talk are worth looking at.
The rest of the day I attended several talks on the internals of Lucene. Hardcore stuff, I would be lying if I said I would have understood everything but it was interesting nevertheless. I am glad that some really smart people are taking care that Lucene stays as fast and feature rich as it is.
The day ended with interesting discussions and some beer at the Buzz Party and BBQ.
The first talk of the second day on Smart Autocompl... by Anne Veling was fantastic. Anne demonstrated a rather simple technique for doing semantic analysis of search queries for specialized autocompletion for the largest travel information system in the Netherlands. The query gets tokenized and then each field of the index (e.g. street or city) is queried for each of the tokens. This way you can already guess which might be good field matches.
Another talk introduced a scalable tool for preprocessing of documents, Hydra. It stores the documents as well as mapping data in a MongoDb instance and you can parallelize the processing steps. The concept sounds really interesting, I hope I can find time to have a closer look.
In the afternoon I attended several talks on Elasticsearch, the scalable search server. Interestingly a lot of people seem to use it more as a storage engine than for searching.
One of the tracks was cancelled, Ted Dunning introduced new stuff in Mahout instead. He's a really funny speaker and though I am not deep into machine learning I was glad to hear that you are allowed to use and even contribute to Mahout even if you don't have a PhD.
In the last track of the day Alex Pinkin showed 10 problems and solutions that you might encounter when building a large app using Solr. Quite some useful advice.
The event took place at Urania, a smaller conference center and theatre. Mostly it was suited well but some of the talks were so full that you either had to sit on the floor or weren't even able to enter the room. I understand that it is difficult to predict how many people attend a certain event but some talks probably should have been scheduled in different rooms.
The food was really good and though it first looked like the distribution was a bottleneck this worked pretty well.
This year Berlin Buzzwords had a rather unusual format. Most of the talks were only 20 minutes long with some exceptions that were 40 minutes long. I have mixed feelings about this: On the one hand it was great to have a lot of different topics. On the other hand some of the concepts definitively would have needed more time to fully explain and grasp. Respect to all the speakers who had to think about what they would talk about in such a short timeframe.
Berlin Buzzwords is a fantastic conference and I will definitively go there again.