0

I'm getting erroneous results on performing terms aggregation in the field names in the index. The following is the mappings I have used to the names field:

{
  "dbnames": {
    "properties": {
      "names": {
        "type": "string",
        "index": "not_analyzed"
      }
    }
  }
}

Here is the results I'm getting for a simple terms aggregation on the field:

"aggregations": {
  "names": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "John Martin",
        "doc_count": 1
      },
      {
        "key": "John martin",
        "doc_count": 1
      },
      {
        "key": " Victor Moses",
        "doc_count": 1
      }
    ]
  }
}

As you can see, I have the same names with different casings being shown as different buckets in the aggregation. What I want here is irrespective of the case, the names should be clubbed together.

1
  • what way should they be grouped? under John Martin or John martin or something else e.g. lowercased john martin? Commented Oct 30, 2015 at 4:19

2 Answers 2

2

The easiest way would be to make sure you properly case the value of your names field at indexing time.

If that is not an option, the other way to go about it is to define an analyzer that will do it for you and set that analyzer as index_analyzer for the names field. Such a custom analyzer would need to use the keyword tokenizer (i.e. take the whole value of the field as a single token) and the lowercase token filter (i.e. lowercase the value)

curl -XPUT localhost:9200/your_index -d '{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "casing": {               <--- custom casing analyzer
            "filter": [
              "lowercase"
            ],
            "tokenizer": "keyword"
          }
        }
      }
    }
  },
  "mappings": {
    "your_type": {
      "properties": {
        "names": {
          "type": "string",
          "index_analyzer": "casing"      <--- use your custom analyzer
        }
      }
    }
  }
}'

Then we can index some data:

curl -XPOST localhost:9200/your_index/your_type/_bulk -d '
{"index":{}}
{"names": "John Martin"}
{"index":{}}
{"names": "John martin"}
{"index":{}}
{"names": "Victor Moses"}
'

And finally the terms aggregation on the names field would return your the expected results:

curl -XPOST localhost:9200/your_index/your_type/_search-d '{
  "size": 0,
  "aggs": {
    "dbnames": {
      "terms": {
        "field": "names"
      }
    }
  }
}'

Results:

{
  "dbnames": {
    "doc_count_error_upper_bound": 0,
    "sum_other_doc_count": 0,
    "buckets": [
      {
        "key": "john martin",
        "doc_count": 2
      },
      {
        "key": "victor moses",
        "doc_count": 1
      }
    ]
  }
}
Sign up to request clarification or add additional context in comments.

Comments

2

There are 2 options here

  1. Use not_analyzed option - This one has a disadvantage that same string with different cases wont be seen as on
  2. keyword tokenizer + lowercase filter - This one does not have the above issue

I have neatly outlined these two approaches and how to use them here - https://qbox.io/blog/elasticsearch-aggregation-custom-analyzer

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.