Indexing Tweets with Logstash

Florian Hopf / @fhopf

Why?

Fun

Sentiment Analysis

  • What do people think about a brand?
  • How do people like my new product?
  • How effective is my current ad?

Twitter-River

Rivers are deprecated

  • Cluster stability
  • Scalability
  • Fault tolerance

Logstash to the rescue

Logstash for logfiles

Logstash for Tweets

Input


input {
  twitter {
      consumer_key => "..."
      consumer_secret => "..."
      oauth_token => "..."
      oauth_token_secret => "..."
      keywords => [ "logstash", "elasticsearch" ]
      full_tweet => true
  }
}

Filter


filter {
}

Output


output {
  stdout { 
    codec => rubydebug 
  }
  elasticsearch {
    protocol => "http"
    host => "localhost"
    index => "twitter"
    document_type => "tweet"
  }
}

Tweets are arriving


"created_at": "Wed Aug 26 11:45:59 +0000 2015",
"id": 636504862134521900,
"text": "Looking forward  to be at #elasticsearch FFM this evening. I'll be giving a short talk on how to index tweets with #logstash",
[...]
"user": {
  "id": 313122677,
  "name": "Florian Hopf",
  "screen_name": "fhopf",
  "location": "Karlsruhe",
  [...]

Aggregate on username?


curl -XPOST "http://localhost:9200/twitter/_search" -d'
{
    "aggs": {
        "users": {
            "terms": {
                "field": "user.name"    
            }
        }
    }
}'

Aggregate on username?


"buckets": [
{
   "key": "florian",
   "doc_count": 1
},
{
   "key": "hopf",
   "doc_count": 1
}
]

Missing a proper mapping

Elasticsearch-Output accepts index template


  elasticsearch {
    protocol => "http"
    host => "localhost"
    index => "twitter"
    document_type => "tweet"
    template => "twitter_template.json"
    template_name => "twitter"
  }

Dynamic template


{
  "template": "twitter",
  [...]
  "mappings": {
    "tweet": {
      "dynamic_templates" : [ {
      [...]
      }, {
         "string_fields" : {
           "match" : "*",
           "match_mapping_type" : "string",
           "mapping" : {
             "type" : "string", "index" : "analyzed", "omit_norms" : true,
               "fields" : {
                 "raw" : {"type": "string", "index" : "not_analyzed", "ignore_above" : 256}
               }
           }
         }
       } ]
      [...]

Aggregate on username?


curl -XPOST "http://localhost:9200/twitter/_search" -d'
{
    "aggs": {
        "users": {
            "terms": {
                "field": "user.name.raw"    
            }
        }
    }
}'

Aggregate on username?


"buckets": [
{
   "key": "Florian Hopf",
   "doc_count": 1
}
]

Resources

  • 29. - 01.10.2015
  • 30.09.: Search Driven Applications
  • 01.10.: Einführung in Elasticsearch