-4

📝 Body

I have a Mongo collection CollectionA where each top-level object contains a nested array of meetings now each meetings have start and end times, for example:

CollectionA = [
    {
        "id": 1,
        "meetings": [
            {"start": "2025-01-01T10:00:00", "end": "2025-01-01T11:00:00"},
            {"start": "2025-01-10T09:00:00", "end": "2025-01-10T09:30:00"},
        ]
    },
    {
        "id": 2,
        "meetings": [
            {"start": "2025-03-01T14:00:00", "end": "2025-03-01T15:00:00"}
        ]
    },
    ...
]

I frequently need to filter these objects by a date range — for example:

“Find all objects that have at least one meeting overlapping [query_start, query_end].”

However, this dataset can be large (thousands of objects, each with dozens of nested intervals), and I also use pagination to load CollectionA gradually.


đź§  Current Problem

Current Approach :-

  1. Filter the CollectionA based on some top level fields
  2. Filter the CollectionA further based on nested meetings array

This approach is getting expensive right now. And we are not looking ahead in splitting the whole collection.

Here's my pipeline :-

[
    {'$match': {'is_deleted': False, 'seller_company_id': ObjectId('XXX'), 'is_hidden': False
        }
    },
    {'$lookup': {'from': 'Company', 'let': {'companyId': '$seller_company_id'
            }, 'pipeline': [
                {'$match': {'$expr': {'$eq': ['$_id', '$$companyId'
                            ]
                        }
                    }
                },
                {'$project': {'configuration.crm_stages': 1
                    }
                }
            ], 'as': 'company'
        }
    },
    {'$unwind': '$company'
    },
    {'$addFields': {'crm_stage_info': {'$ifNull': [
                    {'$first': {'$filter': {'input': {'$ifNull': ['$company.configuration.crm_stages',
                                        []
                                    ]
                                }, 'as': 'stage', 'cond': {'$eq': ['$$stage.name', '$crm_stage.name'
                                    ]
                                }
                            }
                        }
                    }, None
                ]
            }
        }
    },
    {'$addFields': {'crm_stage': {'$cond': [
                    {'$ne': ['$crm_stage_info', None
                        ]
                    }, '$crm_stage_info', '$crm_stage'
                ]
            }
        }
    },
    {'$addFields': {'meetings': {'$filter': {'input': {'$ifNull': ['$meetings',
                            []
                        ]
                    }, 'as': 'meeting', 'cond': {'$and': [
                            {'$ne': ['$$meeting.is_meeting_deleted', True
                                ]
                            },
                            {'$and': [
                                    {'$gte': ['$$meeting.start_meet', datetime.datetime(2025,
                                            10,
                                            7,
                                            18,
                                            30, tzinfo=datetime.timezone.utc)
                                        ]
                                    },
                                    {'$lte': ['$$meeting.start_meet', datetime.datetime(2025,
                                            10,
                                            15,
                                            18,
                                            29,
                                            59,
                                            999000, tzinfo=datetime.timezone.utc)
                                        ]
                                    }
                                ]
                            }
                        ]
                    }
                }
            }
        }
    },
    {'$project': {'_id': 1, 'name': 1, 'is_lead_qualified': 1, 'is_from_calendar': {'$ifNull': ['$is_from_calendar', False
                ]
            }, 'is_hidden': {'$ifNull': ['$is_hidden', False
                ]
            }, 'is_closed': {'$ifNull': ['$is_closed', False
                ]
            }, 'is_closed_won': {'$ifNull': ['$is_closed_won', False
                ]
            }, 'is_closed_lost': {'$ifNull': ['$is_closed_lost', False
                ]
            }, 'updated_on': 1, 'created_on': 1, 'user_id': 1, 'average_sales_score': 1, 'total_sales_score': 1, 'crm_info': 1, 'crm_stage': 1, 'recent_meeting_stage': 1, 'meetings._id': 1, 'meetings.title': 1, 'meetings.start_meet': 1, 'meetings.end_meet': 1, 'meetings.meeting_stage': 1, 'meetings.bot_id': 1, 'meetings.is_completed': 1, 'meetings.is_meet_proper': 1, 'meetings.is_copilot_allowed': 1, 'meetings.crm_push_error': 1, 'meetings.is_crm_data_pushed': 1, 'meetings.crm_push_error_info': 1
        }
    },
    {'$match': {'meetings': {'$ne': []
            }
        }
    },
    {'$sort': {'meetings.start_meet': -1
        }
    },
    {'$facet': {'totalCount': [
                {'$count': 'count'
                }
            ], 'results': [
                {'$skip': 1
                },
                {'$limit': 50
                }
            ]
        }
    },
    {'$project': {'results': 1, 'total': {'$ifNull': [
                    {'$arrayElemAt': ['$totalCount.count',
                            0
                        ]
                    },
                    0
                ]
            }
        }
    }
]

So I need a data structure or indexing strategy that can:

  1. Precompute something at the top level.
  2. Allow me to filter objects by a time range.
  3. Tell me with certainty whether there exists at least one nested interval in that range.
  4. Work efficiently with pagination (i.e., I can skip irrelevant objects quickly).

đź§© Example

Suppose my query is:

query_start = "2025-01-05T00:00:00"
query_end   = "2025-01-06T00:00:00"

That means I’m still checking nested meetings inside this object for no reason — which scales poorly. I want to get all the top level object efficiently that way we can even make our filtering index based


💭 What I’ve Considered

  • Storing the minimum start time and maximum end time in the top level collection. This will surely shrink the search space but still will give a lot of false positives inside which we will iterate the nested object

🚀 What I’m Looking For

I’m looking for the best data structure, algorithmic approach or better solution that:

  • Reduces or eliminates false positives (so I don’t iterate objects unnecessarily).
  • Allows quick filtering by time range.
  • Works well with pagination (i.e., sequential fetching of matching objects).
  • Can be implemented in MongoDB with reasonable preprocessing.

📊 Constraints

  • Each object can have 10–200 nested intervals.
  • Typical queries are small date ranges (1–3 days).
  • Total objects: 11k+
  • Each objects are very data heavy
  • Performance matters — I’d like to minimize per-query iteration.
  • I can afford a preprocessing step to build an index or compressed structure.

đź’¬ Question

What is an efficient way to precompute or index nested time ranges so that:

  • I can quickly find top-level objects with at least one nested interval overlapping a query range,
  • without scanning every nested array,
  • and while supporting pagination?

Would appreciate any advice, data structure recommendations (Interval Tree, Segment Tree, compressed range lists, time buckets, etc.), or real-world patterns you’ve used in similar “nested interval query” scenarios.

5
  • 1
    Wrt "Current Approach :- Filter the CollectionA based on some top level fields Filter the CollectionA further based on nested meetings array" Show what you have tried so far (code or query). Read: How to create a Minimal, Reproducible Example. Commented Oct 15 at 10:10
  • Did you try creating an index on meetings.start and meetings.end ? Commented Oct 15 at 10:10
  • @aneroid Yes I did added the indices but they were never used when I ran the pipeline Commented Oct 15 at 11:34
  • 1
    This is why the MRE is important. Your very first stage in the pipeline in your current edit has NOTHING to do with meeting start & end: {'$match': {'is_deleted': False, 'seller_company_id': ObjectId('XXX'), 'is_hidden': False }. Then you have a lookup and other stages until the 6th stage where you have your meeting.start & meeting.end in a $filter expression! Your documents are now being processed in the pipeline/RAM. The index is for the initial disk fetch. Commented Oct 15 at 13:34
  • 1
    Also, your pipeline has conditions on fields which are not in your example documents, so it's not "reproducible" for us. Commented Oct 15 at 13:35

1 Answer 1

1

Your very first stage in the pipeline in your current edit has nothing to do with meeting start & end:

{'$match': {'is_deleted': False, 'seller_company_id': ObjectId('XXX'), 'is_hidden': False }

Then you have a $lookup stage and some others until the 6th stage where you have meeting.start & meeting.end in a $filter expression! Your documents are now being processed in the pipeline/RAM. The index is for the initial disk fetch at the start.


Now, your actual meeting-filtering stage doesn't appear to require data from the lookup or other documents. So you can move that to the first stage and use $elemMatch.

db.collection.aggregate([
  {
    $match: {
      is_deleted: false,
      seller_company_id: "ObjectId('XXX')",
      is_hidden: false,
      meetings: {
        $elemMatch: {
          // use `False` in Python
          is_meeting_deleted: false,
          start: {
            // use datetime.datetime in Python
            $gte: ISODate("2025-10-07T18:30:00Z"),
            $lt: ISODate("2025-10-15T18:30:00Z")
          }
        }
      }
    }
  },
  {
    $unwind: "$meetings"
  },
  {
    $match: {
      // repeat the check above
      "meetings.is_meeting_deleted": false,
      "meetings.start": {
        // use datetime.datetime in Python
        $gte: ISODate("2025-10-07T18:30:00Z"),
        $lt: ISODate("2025-10-15T18:30:00Z")
      }
    }
  }
])

Add the rest of your stages after these.

Mongo Playground with improved data. The 2nd meeting of id=1 will match and the 1st meeting of id=2.

Stages:

  1. The first $match stage will use an index on meeting start & end if you have it.
  2. Add the other fields to the index too. That will reduce the number of matched documents.
  3. Then unwind meetings and match again so that only those specific meetings are in the result. (This second match won't use the index.)

Notes:

  • Your existing match for meeting uses start_meet which doesn't exist in your example docs. Those just have start.
  • You are matching start with >= and then again start <= 18:29:59:999000. That can just be "less than 18:30", like start < 18:30:00.
  • You aren't checking meeting end in your filter stage, so the overlap logic is probably wrong. Fix that when using the pipeline above.

Edit: Converted the pipeline above to Python, copy-pasted from MongoDB Compass:

[
    {
        '$match': {
            'is_deleted': False, 
            'seller_company_id': ObjectId('XXX'), 
            'is_hidden': False, 
            'meetings': {
                '$elemMatch': {
                    'is_meeting_deleted': False, 
                    'start': {
                        '$gte': datetime(2025, 10, 7, 18, 30, 0, tzinfo=timezone.utc), 
                        '$lt': datetime(2025, 10, 15, 18, 30, 0, tzinfo=timezone.utc)
                    }
                }
            }
        }
    },
    {
        '$unwind': '$meetings'
    },
    {
        '$match': {
            'meetings.is_meeting_deleted': False, 
            'meetings.start': {
                '$gte': datetime(2025, 10, 7, 18, 30, 0, tzinfo=timezone.utc), 
                '$lt': datetime(2025, 10, 15, 18, 30, 0, tzinfo=timezone.utc)
            }
        }
    }
]
Sign up to request clarification or add additional context in comments.

1 Comment

Thanks a lot @aneroid. I did the change as you mentioned and there wasn't much improvement. I feel at query level it makes more sense to what you suggested . I am new to this. I think the issue is not just here. I see whenever I don't even filter using date then also I am recieving a huge response time. Whatever operation I do in this for fetching multiple Objects of CollectionA to be displayed for my dashboard along with metings I recieve huge latency. I think there's something in the infra level . I have an M20 instance in mongo. Any feedbacks will be highly appreciated

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.