and the languages of Singapore


Florian Hopf / @fhopf

elasticsearch

  • distributed search engine
  • HTTP and JSON
  • written in Java, uses Lucene
  • text search support for many languages

Installation


wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.4.1.zip
# zip is for Windows and Linux
unzip elasticsearch-5.4.1.zip
elasticsearch-5.4.1/bin/elasticsearch

HTTP and JSON

Indexing data


curl -XPOST "http://localhost:9200/voxxed/doc" -d '
> {
> "title": "Hello world!",
> "content": "Hello Voxxed Days Singapore!"
> }'

Searching data


curl -XPOST "http://localhost:9200/voxxed/doc/_search" -d '
> {
> "query": {
>   "match": {
>     "content": "Singapore"
>   }
> }
> }'

Searching data


{                                                                                                                                                                                                                                
  "took" : 127,                                                                                                                                                                                                                  
  [...]                                                                                                                                                                                                                
  },                                                                                                                                                                                                                             
  "hits" : {                                                                                                                                                                                                                     
    "total" : 1,                                                                                                                                                                                                                 
    "max_score" : 0.2876821,                                                                                                                                                                                                     
    "hits" : [                                                                                                                                                                                                                   
      {                                                                                                                                                                                                                          
        "_index" : "voxxed",                                                                                                                                                                                                     
        "_type" : "doc",                                                                                                                                                                                                         
        "_id" : "AVwAP4Aw9lCQvRKyIhgJ",                                                                                                                                                                                          
        "_score" : 0.2876821,                                                                                                                                                                                                    
        "_source" : {                                                                                                                                                                                                            
          "title" : "Hello world!",
          "content" : "Hello Voxxed Days Singapore!"
        }
      }
    ]
  }
}

Inverted index

TermDoc Id
days1
hello1
singapore1
voxxed1

Analyzing

  • Process incoming text
  • Analyzer
    • Tokenizer
    • TokenFilter
  • Default: Standard Analyzer
    • Splits on word boundaries
    • Lowercases

Language specific analyzing

  • Prebuilt analyzers available
    • Character normalization
    • Stemming: Reduce words to base form
    • ...
  • Custom analyzers
  • Configured upfront in the mapping

Configuring english analyzer


curl -XPUT "http://localhost:9200/voxxed_en" -d'
{
  "mappings": {
    "doc": {
      "properties": {
        "content": {
          "type": "text", 
          "analyzer": "english"
        }
      }  
    }
  }
}'

Analyzing

  • Index term determines search quality
  • Analyzing during index and search time
  • Stemming: days -> day

Languages of Singapore

  • 4 official languages:
    • Malay
    • Mandarin
    • Tamil
    • English
  • National language: Malay

Malay

Mari kita rakyat Singapura
Sama-sama menuju bahagia

Malay


	curl -XPOST "http://localhost:9200/voxxed/doc" -d'
	{
	  "title": "Majulah Singapura",
	  "content": "Mari kita rakyat Singapura Sama-sama menuju bahagia"
	}'

Malay


	curl -XPOST "http://localhost:9200/voxxed/doc/_search" -d'
	{
	  "query": {
	    "match": {
	      "content": "bahagia"
	    }
	  }
	}'

Malay

  • Standard Analyzer works fine for Malay
  • No language analyzer available
  • Indonesian Stemmer could help with some normalization

Tamil

ஆறின கஞ்சி பழங் கஞ்சி
Cold food is (soon) old food

Tamil


	{
	  "query": {
	    "match": {
	      "content": "கஞ்சி"
	    }
	  }
	}
	

Tamil

  • No special handling for Tamil
  • Standard tokenizer splits words correctly
  • Elasticsearch compares terms on byte level

Mandarin

你好新加坡
Hello Singapore

Mandarin

  • No whitespace
  • Standard Tokenizer splits to single characters
  • CJK Analyzer splits and builds bigrams

Mandarin

你好新加坡
你好 好新 新加 加坡

Mandarin


{
  "query": {
    "match": {
      "content": "新加坡"
    }
  }
}

Mandarin

  • CJK Analyzer builds bigrams
  • Alternative: Smart Chinese Plugin

Conclusion

  • Each language has its specialities
  • Basic support for all of them in Elasticsearch/Lucene
  • Working with multiple languages can be challenging

Thank you