How to prevent certain fields from get indexed in Elasticsearch

Question

I need to prevent certain fields which have values like "null" (null as a string) and ""(empty string) from getting indexed in Elasticsearch i.e. I should be able to fetch rest fields in the document except fields with such values in _source. I am using normalizer as below

{
"analysis": {
    "normalizer": {
        "my_normalizer": {
            "filter": [
                "uppercase"
            ],
            "type": "custom"
        }
    }
}

}

Are there any settings required above or in field mappings?

P.S:- I am using elasticsearch 7.6.1

This is answered in other question I posted stackoverflow.com/questions/64751852/… — user11725513
– user11725513, Commented Nov 10, 2020 at 11:13

Briomkez · Accepted Answer · 2020-11-11 22:29:44Z

1

You can have a look to Elasticsearch Pipelines. They are applied before indexing (and in your case analyzing) take place.

Concretely, you could add an Elasticsearch Pipeline that removes the required fields if they meet the conditions you listed. Something like:

PUT _ingest/pipeline/remove_invalid_value
{
   "description": "my pipeline that removes empty string and null strings",
   "processors": [
       { 
          "remove": {
              "field": "field1",
              "ignore_missing": true,
              "if": "ctx.field1 == \"null\" || ctx.field1 == \"\""
          }
       },
        { 
          "remove": {
              "field": "field2",
              "ignore_missing": true,
              "if": "ctx.field2 == \"null\" || ctx.field2 == \"\""
          }
       },
       
        { 
          "remove": {
              "field": "field3",
              "ignore_missing": true,
              "if": "ctx.field3 == \"null\" || ctx.field3 == \"\""
          }
       }
   ]
}

Then, you can either specify the pipeline in the index request or by putting it as the default_pipeline or final_pipeline in your index settings. You can also specify this setting in the index template.

(Script) Loop Approach

If you don't want to write a long list of remove actions, you can try to use a script processor, something like this:

PUT _ingest/pipeline/remove_invalid_fields
{
  "description": "remove fields",
  "processors": [
    {
      "script": {
        "source": """
          for (x in params.to_delete_on_condition) {
                if (ctx[x] == "null" || ctx[x] == "") {
                    ctx.remove(x);
                }
          }
          """,
        "params": {
          "to_delete_on_condition": [
            "field1",
            "field2",
            "field3"
          ]
        }
      }
    }
  ]
}

It iterates over the list and removes the field if the condition matches.

Accessing nested fields in scripts is not trivial as reported in many answer, but it should be doable. The idea is that nested.field should be accessed as ctx['nested']['field'].

edited Nov 11, 2020 at 22:29

answered Nov 8, 2020 at 14:00

Briomkez

5773 silver badges18 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

user11725513 Over a year ago

Thanks for the response@Briomkez. I have multiple fields in the index. So is there a way to apply this processors to all the fields in document.

Briomkez Over a year ago

Checkout my edit. I tried to address the question of your comment :)

Collectives™ on Stack Overflow

How to prevent certain fields from get indexed in Elasticsearch

1 Answer 1

(Script) Loop Approach

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

(Script) Loop Approach

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related