2

I'm trying to implement a robust search function in my NestJS/Mongoose application that can handle partial matches while being case-insensitive and diacritics-insensitive (ignoring accents).

My current aggregation setup uses an $or block combining $text (for diacritic insensitivity on whole words) and $regex with $options: 'I' (for partial and case insensitivity). However, the $regex method remains sensitive to diacritics, causing my combined-feature test to fail.

My function:

  public async getCustomers(query?: GetCustomersQueryParamsDto) {
    const pipeline: PipelineStage[] = []

    if (query?.search) pipeline.push(this._searchCustomersStage(query.search))

    const customers = await this.customer.aggregate<CustomerWithBillsStats>(pipeline)

    return customers
  }


  private _searchCustomersStage(str: string): PipelineStage {
    return {
      $match: {
        $or: [
          { $text: { $search: str } }, // for full-text search (diacritics insensitive but exact match)
          { name: { $regex: str, $options: 'i' } }, // for partial match (can partially match but not diacritics insensitive)
        ],
      },
    }
  }

The problem is, when searching for the partial string 'elo' against a name like 'Élodie', the following happens in my _searchCustomersStage:

  1. { $text: { $search: 'elo' } }: Fails, as $text performs token/whole-word matching, not substring matching.
  2. { name: { $regex: 'elo', $options: 'i' } }: Fails, as PCRE regex with the i option is diacritics-sensitive and treats e and É as different characters.

exemple of test that fails

it('should search based on name and be diacritics(accents) insensitive, partial match and case insensitive', async () => {
        await customerModel.insertMany([
          { ...generateCustomer(), name: 'Élodie' }, // match: contains "elo"
          { ...generateCustomer(), name: 'Brandon' }, // no match
          { ...generateCustomer(), name: 'Daniel' }, // no match
        ])

        const { customers } = await customerService.getCustomers({ search: 'elo' })
        expect(customers).toHaveLength(1)
      })

the fails:

● CustomerService › getCustomers › search › should search based on name and be diacritics(accents) insensitive, partial match and case insensitive

    expect(received).toHaveLength(expected)

    Expected length: 1
    Received length: 0
    Received array:  []

      398 |
      399 |         const { customers } = await customerService.getCustomers({ search: 'elo' })
    > 400 |         expect(customers).toHaveLength(1)
          |                           ^
      401 |       })
      402 |
      403 |       it('should search based on shortName when exact match', async () => {

      at Object.<anonymous> (test/customer/integration/customer.service.spec.ts:400:27)

What is the most effective and performant way to modify the MongoDB aggregation pipeline to make the partial search diacritics-insensitive and while retaining case-insensitivity?

P.S:

I configured my indexes like this

@Module({
  imports: [
    MongooseModule.forFeatureAsync([
      {
        name: Customer.name,
        useFactory: () => {
          const schema = CustomerSchema

          schema.index({ name: 1 }, { unique: true }) // index for search and ensure that `name` is unique between customers

          schema.index({ name: 'text' }) // index for full-text search (diacritics-insensitive but must be exact match)

          schema.index({ name: 1, _id: 1 }) // compound index for pagination sorting

          return schema
        },
      },
    ]),
  ],
  providers: [CustomerService],
  controllers: [CustomerController],
  exports: [MongooseModule, CustomerService],
})
export class CustomerModule {}

This some tests case that succeed

it('should search based on name and return customers when exact match', async () => {
        await customerModel.insertMany([
          { ...generateCustomer(), name: 'Alpha' },
          { ...generateCustomer(), name: 'Bravo' },
          { ...generateCustomer(), name: 'Charlie' },
        ])

        const { customers } = await customerService.getCustomers({ search: 'Charlie' })
        expect(customers).toHaveLength(1)
      })

      it('should search based on name and return customers when partial match', async () => {
        await customerModel.insertMany([
          { ...generateCustomer(), name: 'Eleanor' }, // no match
          { ...generateCustomer(), name: 'Marcelo' }, // match: contains "elo"
          { ...generateCustomer(), name: 'Brandon' }, // no match
          { ...generateCustomer(), name: 'Elohim' }, // match: contains "elo"
          { ...generateCustomer(), name: 'Daniel' }, // no match
        ])

        const { customers } = await customerService.getCustomers({ search: 'elo' })
        expect(customers).toHaveLength(2)
      })

      it('should search based on name and be case insensitive and return found customers', async () => {
        await customerModel.insertMany([
          { ...generateCustomer(), name: 'Alpha' },
          { ...generateCustomer(), name: 'Bravo' },
          { ...generateCustomer(), name: 'Charlie' },
        ])

        const { customers } = await customerService.getCustomers({ search: 'cHarLiE' })
        expect(customers).toHaveLength(1)
      })

      it('should search based on name and be diacritics(accents) insensitive', async () => {
        await customerModel.insertMany([
          { ...generateCustomer(), name: 'Élodie' }, // match: contains "elo"
          { ...generateCustomer(), name: 'Brandon' }, // no match
          { ...generateCustomer(), name: 'Daniel' }, // no match
        ])

        const r1 = await customerService.getCustomers({ search: 'Elodie' })
        expect(r1.customers).toHaveLength(1)

        const r2 = await customerService.getCustomers({ search: 'elodie' })
        expect(r2.customers).toHaveLength(1)
      })

3
  • Have you considered using collation? mongodb.com/docs/manual/reference/collation Commented Nov 9 at 3:22
  • I used collation. But the problem with is that it doesn't support partial match from $regex, $regex will ignore collation and collation only support exact match with $match. Commented Nov 9 at 8:27
  • The best would be storing a normalized field. Or Atlas search indexes. Commented Nov 10 at 18:09

1 Answer 1

3

This is a perfect candidate for the techniques/technology detailed in my article MongoDB Text Search: Substring Pattern Matching Including Regex and Wildcard, Use Search Instead (Part 3) - we have just recently released (formerly only Atlas) Search and Vector Search into Community and available also for Enterprise. You can control the indexing and querying in very precise and scalable ways.

Sign up to request clarification or add additional context in comments.

1 Comment

Hi, thanks for the explanation! From which MongoDB Community version are Search and Vector Search officially available?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.