combine fields in elasticsearch

Question

Suppose I have a table in SQL where I combine two fields as one

A    |     B
-----|--------         => select A+' '+ B as Name => BMW X-Series
BMW  |   3-Series
BMW  |     X3

I dump it into a temp table and I do a wildcard search on the temp table which returns the result with count

select Name,count(Name) as frequency from Table where Name like '%3%' group by Name

     Name        Frequency
     ------------------  
    BMW 3-Series |  1
    BMW  X3      |  1

Now how do I achieve the same is elasticsearch given that A and B are seperate fields.

I tried this:

{ "query":{
      "query_string":{
          "fields":["A","B"],
          "query":"3"
      }
      }, "aggs": {
    "count": {
      "terms": {
        "field": "A"
      },
      "aggs": {
        "count": {
          "terms": {
            "field": "B"
          }

        }
      }
    }
  }

}

How do I add regular expression on the query

you can just use "query":"3" regular expressins analyzed by default. elastic.co/guide/en/elasticsearch/reference/current/… — alpert
– alpert, Commented May 21, 2016 at 14:53

Russ Cam · Accepted Answer · 2016-05-23 03:00:39Z

A key difference between SQL and Elasticsearch is that by default, string fields are analyzed at index time, and you can control how they are analyzed with Analyzers.

The default analyzer, the Standard Analyzer, will produce tokens from the input and store these in an inverted index. You can see what tokens would be generated for a given input by using the Analyze API:

curl -XPOST "http://localhost:9200/_analyze?analyzer=standard" -d'
{
  text : "3-Series"
}'

which yields the output

{
  "tokens": [
    {
      "token": "3",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "series",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Knowing this, using a search query that undergoes analysis at search time such as the Query String Query, there is no need for regular expression queries or wildcard queries if you analyze the input in a way to support your use cases.

You may decide to index "BMW 3-Series" in one field and analyze it in different ways using multi_fields, or keep the values in separate fields as you have and search across both.

Here's an example to get you started. Given we have the following POCO

public class Car
{
    public string Make { get; set; }
    public string Model { get; set; }
}

We can set up the following index

var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var carsIndex = "cars";
var connectionSettings = new ConnectionSettings(pool)
        .DefaultIndex(carsIndex);

var client = new ElasticClient(connectionSettings);

client.CreateIndex(carsIndex, ci => ci
    .Settings(s => s
        .Analysis(analysis => analysis
            .Tokenizers(tokenizers => tokenizers
                .Pattern("model-tokenizer", p => p.Pattern(@"\W+"))
            )
            .TokenFilters(tokenfilters => tokenfilters
                .WordDelimiter("model-words", wd => wd
                    .PreserveOriginal()
                    .SplitOnNumerics()
                    .GenerateNumberParts()
                    .GenerateWordParts()
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("model-analyzer", c => c
                    .Tokenizer("model-tokenizer")
                    .Filters("model-words", "lowercase")
                )
            )
        )
    )
    .Mappings(m => m
        .Map<Car>(mm => mm
            .AutoMap()
            .Properties(p => p
                .String(s => s
                    .Name(n => n.Model)
                    .Analyzer("model-analyzer")
                )
            )
        )
    )
);

We create a cars index and create a custom analyzer to use for the Model field. This custom analyzer will separate the input into tokens on any non-word character, with token filters that will then split each token on numerical characters to generate a token that preserves the original token, tokens that represent the number part(s) and a tokens that represent the word part(s). Finally, all tokens are lowercased.

We can test what the model-analyzer will do to our inputs, to see if it is suitable for our needs

curl -XPOST "http://localhost:9200/cars/_analyze?analyzer=model-analyzer" -d'
{
  text : "X3"
}'

produces

{
  "tokens": [
    {
      "token": "x3",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "x",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "3",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    }
  ]
}

and

curl -XPOST "http://localhost:9200/cars/_analyze?analyzer=model-analyzer" -d'
{
  text : "3-Series"
}'

produces

{
  "tokens": [
    {
      "token": "3",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "series",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    }
  ]
}

This looks suitable for the problem at hand. Now, if we index some documents and perform a search, we should get the results we're looking for

client.Index<Car>(new Car { Make = "BMW", Model = "3-Series" });
client.Index<Car>(new Car { Make = "BMW", Model = "X3" });

// refresh the index so that documents are available to search
client.Refresh(carsIndex);

client.Search<Car>(s => s
    .Query(q => q
        .QueryString(qs => qs
            .Fields(f => f
                .Field(c => c.Make)
                .Field(c => c.Model)
            )
            .Query("3")
        )
    )
);

yields the following results

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.058849156,
    "hits" : [ {
      "_index" : "cars",
      "_type" : "car",
      "_id" : "AVTbhENDDGlNKQ4qnluJ",
      "_score" : 0.058849156,
      "_source" : {
        "make" : "BMW",
        "model" : "3-Series"
      }
    }, {
      "_index" : "cars",
      "_type" : "car",
      "_id" : "AVTbhEOXDGlNKQ4qnluK",
      "_score" : 0.058849156,
      "_source" : {
        "make" : "BMW",
        "model" : "X3"
      }
    } ]
  }
}

Hope this has given you some ideas.

Collectives™ on Stack Overflow

combine fields in elasticsearch

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related