MongoDB: multi-lingual (accent insensitive), case insensitive search, with partial words?

Question

For the application we are developing we need to allow our searches to support accents, be case insensitive and search for partial words. For example, given the product name "La Niña" in our collection, the following searches should be expected to return the entry:

La Niña
niña
nina
nin
La nin

Currently I have tried two approaches, each with their appear apparent limitations, based on testing and some research:

Regex
- supports case insensitive and partial searches
- does not support accents such that, niña != nina
Text Search
- support case insensitive, accents and partial phrases
- does not support partial words

Example regex search, as we have used:

function escapeRegExp(text) {
  return text.replace(/[.*+?^${}()|[\]\\]/g, '\\$&');
}

const escapedStr = this.escapeRegExp(searchTerm);
await Product.find({ name: new RegExp(`${escapedStr}`, 'i') });

Example text search, as we have used:

// On the schema
storeSchema.index({ name: 'text' });

// Searching:
await Product.find($text: { $search: searchTerm })
  .collation({locale: 'en', strength: 1});

BTW We have set the schemas in question to use collation strength level 1.

Some approaches I am considering, if MongoDB doesn't provide a solution:

shadow name field (not sure the right term?), with the accents removed
a separate full text search engine

Can anyone help here?

Note, we are leveraging mongoose 5.9.5, with node 12.16.2 and mongodb 4.3.8 running in mongo cloud.

Tunmise Ogunniyi · Accepted Answer · 2020-07-29 12:42:08Z

I believe the Text Search is what you need. There are two other features of Text Search that fulfills the requirement of a partial word match you described in the question.

Stop Words: Given a language option, MongoDB Text Search is capable of identifying words that shouldn't influence search results. The frequency of usage of these words is such that they appear in almost every sentence, for example, in English, words like "the", "a", "of", are all stop words. These words are stripped off the search phrase before the actual search takes place.
Word Stemming: Given a language option, MongoDB Text Search is capable of identifying the root version of a word, for example, in English, the stem version of "identifying" would be "identify" so they both would match in a text search".

I was able to figure with Google Translate that the "La Niña" example you gave is in Spanish.

If I insert the following into a sample product collection:

db.products.insertMany([
  { "term" : "La Niña" },
  { "term" : "niña" },
  { "term" : "nina" },
  { "term" : "nin" },
  { "term" : "La nin" },
])

By specifying a language option of "spanish" on my Test Search query:

db.products.find({ $text: { $search: "La Niña", $language: "spanish" } })

MongoDB would effectively match that with all the products that were previously inserted. You can get a list of the supported language options for MongoDB here.

I'm not 100% sure of how the accent matching works though.

Collectives™ on Stack Overflow

MongoDB: multi-lingual (accent insensitive), case insensitive search, with partial words?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related