Broken aggregation in elasticsearch

Question

I'm getting erroneous results on performing terms aggregation in the field names in the index. The following is the mappings I have used to the names field:

{
  "dbnames": {
    "properties": {
      "names": {
        "type": "string",
        "index": "not_analyzed"
      }
    }
  }
}

Here is the results I'm getting for a simple terms aggregation on the field:

"aggregations": {
  "names": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "John Martin",
        "doc_count": 1
      },
      {
        "key": "John martin",
        "doc_count": 1
      },
      {
        "key": " Victor Moses",
        "doc_count": 1
      }
    ]
  }
}

As you can see, I have the same names with different casings being shown as different buckets in the aggregation. What I want here is irrespective of the case, the names should be clubbed together.

what way should they be grouped? under John Martin or John martin or something else e.g. lowercased john martin? — Russ Cam
– Russ Cam, Commented Oct 30, 2015 at 4:19

Val · Accepted Answer · 2015-10-30 04:40:37Z

The easiest way would be to make sure you properly case the value of your names field at indexing time.

If that is not an option, the other way to go about it is to define an analyzer that will do it for you and set that analyzer as index_analyzer for the names field. Such a custom analyzer would need to use the keyword tokenizer (i.e. take the whole value of the field as a single token) and the lowercase token filter (i.e. lowercase the value)

curl -XPUT localhost:9200/your_index -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "casing": {               <--- custom casing analyzer
            "filter": [
              "lowercase"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "names": {
          "type": "string",
          "index_analyzer": "casing"      <--- use your custom analyzer
        }
      }
    }
  }
}'

Then we can index some data:

curl -XPOST localhost:9200/your_index/your_type/_bulk -d '
{"index":{}}
{"names": "John Martin"}
{"index":{}}
{"names": "John martin"}
{"index":{}}
{"names": "Victor Moses"}
'

And finally the terms aggregation on the names field would return your the expected results:

curl -XPOST localhost:9200/your_index/your_type/_search-d '{
  "size": 0,
  "aggs": {
    "dbnames": {
      "terms": {
        "field": "names"
      }
    }
  }
}'

Results:

{
  "dbnames": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "john martin",
        "doc_count": 2
      },
      {
        "key": "victor moses",
        "doc_count": 1
      }
    ]
  }
}

Vineeth Mohan · Accepted Answer · 2015-11-01 16:35:16Z

2

There are 2 options here

Use not_analyzed option - This one has a disadvantage that same string with different cases wont be seen as on
keyword tokenizer + lowercase filter - This one does not have the above issue

I have neatly outlined these two approaches and how to use them here - https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer

answered Nov 1, 2015 at 16:35

Vineeth Mohan

19.4k9 gold badges70 silver badges81 bronze badges

Collectives™ on Stack Overflow

Broken aggregation in elasticsearch

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related