0

Suppose I have a table in SQL where I combine two fields as one

A    |     B
-----|--------         => select A+' '+ B as Name => BMW X-Series
BMW  |   3-Series
BMW  |     X3

I dump it into a temp table and I do a wildcard search on the temp table which returns the result with count

select Name,count(Name) as frequency from Table where Name like '%3%' group by Name

     Name        Frequency
     ------------------  
    BMW 3-Series |  1
    BMW  X3      |  1

Now how do I achieve the same is elasticsearch given that A and B are seperate fields.

I tried this:

{ "query":{
      "query_string":{
          "fields":["A","B"],
          "query":"3"
      }
      }, "aggs": {
    "count": {
      "terms": {
        "field": "A"
      },
      "aggs": {
        "count": {
          "terms": {
            "field": "B"
          }

        }
      }
    }
  }

}

How do I add regular expression on the query

1

1 Answer 1

1

A key difference between SQL and Elasticsearch is that by default, string fields are analyzed at index time, and you can control how they are analyzed with Analyzers.

The default analyzer, the Standard Analyzer, will produce tokens from the input and store these in an inverted index. You can see what tokens would be generated for a given input by using the Analyze API:

curl -XPOST "http://localhost:9200/_analyze?analyzer=standard" -d'
{
  text : "3-Series"
}'

which yields the output

{
  "tokens": [
    {
      "token": "3",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<NUM>",
      "position": 0
    },
    {
      "token": "series",
      "start_offset": 2,
      "end_offset": 8,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

Knowing this, using a search query that undergoes analysis at search time such as the Query String Query, there is no need for regular expression queries or wildcard queries if you analyze the input in a way to support your use cases.

You may decide to index "BMW 3-Series" in one field and analyze it in different ways using multi_fields, or keep the values in separate fields as you have and search across both.

Here's an example to get you started. Given we have the following POCO

public class Car
{
    public string Make { get; set; }
    public string Model { get; set; }
}

We can set up the following index

var pool = new SingleNodeConnectionPool(new Uri("http://localhost:9200"));
var carsIndex = "cars";
var connectionSettings = new ConnectionSettings(pool)
        .DefaultIndex(carsIndex);

var client = new ElasticClient(connectionSettings);

client.CreateIndex(carsIndex, ci => ci
    .Settings(s => s
        .Analysis(analysis => analysis
            .Tokenizers(tokenizers => tokenizers
                .Pattern("model-tokenizer", p => p.Pattern(@"\W+"))
            )
            .TokenFilters(tokenfilters => tokenfilters
                .WordDelimiter("model-words", wd => wd
                    .PreserveOriginal()
                    .SplitOnNumerics()
                    .GenerateNumberParts()
                    .GenerateWordParts()
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("model-analyzer", c => c
                    .Tokenizer("model-tokenizer")
                    .Filters("model-words", "lowercase")
                )
            )
        )
    )
    .Mappings(m => m
        .Map<Car>(mm => mm
            .AutoMap()
            .Properties(p => p
                .String(s => s
                    .Name(n => n.Model)
                    .Analyzer("model-analyzer")
                )
            )
        )
    )
);

We create a cars index and create a custom analyzer to use for the Model field. This custom analyzer will separate the input into tokens on any non-word character, with token filters that will then split each token on numerical characters to generate a token that preserves the original token, tokens that represent the number part(s) and a tokens that represent the word part(s). Finally, all tokens are lowercased.

We can test what the model-analyzer will do to our inputs, to see if it is suitable for our needs

curl -XPOST "http://localhost:9200/cars/_analyze?analyzer=model-analyzer" -d'
{
  text : "X3"
}'

produces

{
  "tokens": [
    {
      "token": "x3",
      "start_offset": 0,
      "end_offset": 2,
      "type": "word",
      "position": 0
    },
    {
      "token": "x",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "3",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    }
  ]
}

and

curl -XPOST "http://localhost:9200/cars/_analyze?analyzer=model-analyzer" -d'
{
  text : "3-Series"
}'

produces

{
  "tokens": [
    {
      "token": "3",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "series",
      "start_offset": 2,
      "end_offset": 8,
      "type": "word",
      "position": 1
    }
  ]
}

This looks suitable for the problem at hand. Now, if we index some documents and perform a search, we should get the results we're looking for

client.Index<Car>(new Car { Make = "BMW", Model = "3-Series" });
client.Index<Car>(new Car { Make = "BMW", Model = "X3" });

// refresh the index so that documents are available to search
client.Refresh(carsIndex);

client.Search<Car>(s => s
    .Query(q => q
        .QueryString(qs => qs
            .Fields(f => f
                .Field(c => c.Make)
                .Field(c => c.Model)
            )
            .Query("3")
        )
    )
);

yields the following results

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 0.058849156,
    "hits" : [ {
      "_index" : "cars",
      "_type" : "car",
      "_id" : "AVTbhENDDGlNKQ4qnluJ",
      "_score" : 0.058849156,
      "_source" : {
        "make" : "BMW",
        "model" : "3-Series"
      }
    }, {
      "_index" : "cars",
      "_type" : "car",
      "_id" : "AVTbhEOXDGlNKQ4qnluK",
      "_score" : 0.058849156,
      "_source" : {
        "make" : "BMW",
        "model" : "X3"
      }
    } ]
  }
}

Hope this has given you some ideas.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.