In Summary

We achieved the following for our question search in Mongo:

  • p99 from >30s to 100ms
  • average 2s to 50ms

If you’re curious how, keep reading :)

Background

Allen Digital has a few million questions curated over the years, belonging to various topics and streams. These questions have been asked in previous tests and may have appeared in some papers, but most remain unutilized.
As our platform matures, we want to build an ecosystem that revolves around these questions and uses them in various places. Be it proctored tests, guided progressions, random questions from a topic, custom tests, or even personalized tests for students. To make this possible we need to fetch and filter the questions fast and in real-time with the help of quick backend systems.

While aiming for latency improvements in microservices, we need to optimize every possible layer. One such important layer is the database. Databases are, in fact, the most critical piece in this puzzle. In our case, it’s even more critical because of the vast amount of data we have. The few million questions we discussed are just the latest versions; if we consider the different versions with modified content, corrected answers, etc, the number goes up several folds (5x). Your DB reads cannot be slow when you want your API to be fast. We use MongoDB to store the questions.

Data and Initial Query

Today, we talk about how we optimized our query to make it performant and do things that were out of reach before.
The following is how the data looks: many unnecessary fields for this post have been omitted for clarity.

// https://github.com/jatinrungtaallen/mongo/blob/main/data.json
{
"_id": {
"$oid": "662fdd39e2f07061ae2576a2"
},
"qnsLevel": 3,
"questionId": "871f6a20-0650-11ef-a316-5a6e36d5de66",
"source": 3,
"status": 2,
"taxonomyData": [
{
"taxonomyId": "1701423699JY",
"classId": "750",
"subjectId": "1327",
"topicId": "1357",
"subtopicId": "1360"
},
{
"taxonomyId": "1701423699JY",
"classId": "1499",
"subjectId": "2068",
"topicId": "2098",
"subtopicId": "2101"
}
],
"type": 1,
"version": 1701648000000,
"customTags": [
{
"tag_name": "Q-REEL",
"tag_type": "boolean",
"value": "yes"
}
]
}

And these are the indexes that we started with,

// https://github.com/jatinrungtaallen/mongo/blob/main/indexes.json
[
{ "status": 1 },
{ "customTags.tag_name": 1 },
{ "taxonomyData.taxonomyId": 1, "taxonomyData.topicId": 1 },
{ "taxonomyData.taxonomyId": 1, "taxonomyData.subtopicId": 1 }
]

We started with the following query

// https://github.com/jatinrungtaallen/mongo/blob/main/query1.js
db.questions
.find({
customTags: { $elemMatch: { tag_name: 'Q-REEL', value: 'yes' } },
status: 2,
qnsLevel: 2,
$or: [
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '1418' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '664' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '664' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '1416' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '456' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '231' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '456' },
{ 'taxonomyData.taxonomyId': '1703061260BO', 'taxonomyData.subtopicId': '1416' },
],
})

There are several repeated subtopicId in the query. This will be important later. Initially, this query worked very well, but as the data grew rapidly (we migrated more data from our question bank in the new platform), the query soon hit performance bottlenecks. Even fetching ten questions like this would take ~30s (we terminated the query here) for larger queries. The average performance would be somewhere around 2s. We wanted to do randomization and use the whole question pool but were always limited to the first ten questions because DB randomization ($sample) on this data would never return.

Seeing this, we were disappointed and decided to drop randomization for the time being and focus on the first ten questions because 30 seconds with a loader was still doable, even though it was a subpar experience.

Optimization #1

We started digging into how we could optimize our query. Since our taxonomyData was an array, MongoDB was creating a multikey index. We quickly found out about $elemMatch and how it could be used to make the queries faster. This link had several examples that helped us understand what could be going wrong in our query.

We quickly changed the query to the following and noticed several improvements.

// https://github.com/jatinrungtaallen/mongo/blob/main/query2.js
db.questions
.find({
customTags: { $elemMatch: { tag_name: 'Q-REEL', value: 'yes' } },
status: 2,
qnsLevel: 2,
taxonomyData: {
$elemMatch: {
$or: [
{ taxonomyId: '1703061260BO', subtopicId: '1418' },
{ taxonomyId: '1703061260BO', subtopicId: '664' },
{ taxonomyId: '1703061260BO', subtopicId: '1418' },
{ taxonomyId: '1703061260BO', subtopicId: '667' },
{ taxonomyId: '1703061260BO', subtopicId: '667' },
{ taxonomyId: '1703061260BO', subtopicId: '1416' },
{ taxonomyId: '1703061260BO', subtopicId: '664' },
{ taxonomyId: '1703061260BO', subtopicId: '456' },
{ taxonomyId: '1703061260BO', subtopicId: '231' },
{ taxonomyId: '1703061260BO', subtopicId: '1467' },
{ taxonomyId: '1703061260BO', subtopicId: '124' },
],
},
},
})
.explain('executionStats');

The query would run for ~15s (down from 30s earlier), and the average time was down to ~800ms. It was undoubtedly an improvement.

We don’t have the graph handy for this query because it was a long time ago, but here’s a similar one we changed using the above optimization recently. Notice how the performance improves significantly because of $elemMatch

Now the query was nicely indexed, we enabled randomization, but it broke the database again. About half of the queries would return, but the others would take too long. Our students still had to wait for randomized questions in their practice.

Optimization #2

The query was still taking time, so we added a new index,

{ "taxonomyData.taxonomyId": 1, "taxonomyData.subtopicId": 1, "status": 1 }

However, it did not improve the query performance when we thought it would because we were filtering even more than before, right?

We tried to understand the explain output and quickly concluded that Mongo used the wrong index for our query, { "status": 1 } . We were unsure why and decided to add a hint for our newly created index.

Adding the hint improved the performance significantly. We were down from 800ms to 150ms average latency. This was nice, and we ran in this configuration for a few months.

However, the p99 was still something that was very far. We would get occasional spikes, and the large queries would still run beyond 5s. We added a query timeout at 2s and let the call fail, as only a few calls would take more time than this.

Adding randomization still would take the average time to 800ms, so we decided not to have it and started looking for other alternatives to randomization, like

  • Fetch the first 200 questions and randomly select 10 out of them. This would give our students a better experience but fails to meet our expectations. We debated for a long and decided to park it for later as other priorities had come up, and this would be a stop-gap solution.

We added one more index, thinking it would perform better, but no performance was improved.

{ "taxonomyData.taxonomyId": 1, "taxonomyData.subtopicId": 1, "status": 1, "qnsLevel": 1 }

We tried several combinations and concluded that Mongo was not using status and qnsLevel fields from the index. It only used the multikey prefix and ignored the other fields.

We also figured out why Mongo was not using the correct index during this process. Here is a snippet of the query explaining the stats. The full version is attached for reference.

// https://github.com/jatinrungtaallen/mongo/blob/main/explain1.js
{
"stage": "FETCH",
"inputStage": {
"stage": "OR",
"inputStages": [
{
"stage": "IXSCAN",
"keyPattern": {
"taxonomyData.taxonomyId": 1,
"taxonomyData.subtopicId": 1,
"status": 1
},
"indexName": "taxonomyData.taxonomyId_1_taxonomyData.subtopicId_1_status_1",
"multiKeyPaths": {
"taxonomyData.taxonomyId": ["taxonomyData"],
"taxonomyData.subtopicId": ["taxonomyData"],
"status": []
},
"direction": "forward",
"indexBounds": {
"taxonomyData.taxonomyId": ["[\"1701423699JY\", \"1701423699JY\"]"],
"taxonomyData.subtopicId": ["[\"1416\", \"1416\"]"],
"status": ["[MinKey, MaxKey]"]
}
},
{
"stage": "IXSCAN",
"keyPattern": {
"taxonomyData.taxonomyId": 1,
"taxonomyData.subtopicId": 1,
"status": 1,
"qnsLevel": 1
},
"indexName": "taxonomyData.taxonomyId_1_taxonomyData.subtopicId_1_status_1_qnsLevel_1",
"multiKeyPaths": {
"taxonomyData.taxonomyId": ["taxonomyData"],
"taxonomyData.subtopicId": ["taxonomyData"],
"status": [],
"qnsLevel": []
},
"indexBounds": {
"taxonomyData.taxonomyId": ["[\"1701423699JY\", \"1701423699JY\"]"],
"taxonomyData.subtopicId": ["[\"1360\", \"1360\"]"],
"status": ["[MinKey, MaxKey]"],
"qnsLevel": ["[MinKey, MaxKey]"]
}
}
]
}
}

The query planner took too long to decide which index combinations to use. Since we had multiple indexes with the same prefix, it would exhaust the plan limit trying out different combinations.

If there were 2 indexes and N subtopics, the query planner would have to try 2^N combinations just for this. That is why adding a hint improved query performance by several folds.

We also noticed the index bounds for status and qnsLevel would always be the full range; it didn’t help the index. We could not get Mongo to include those at this point, no matter what we tried.

The duplicate subtopics were also taking execution time as noticed in the explain plan. Mongo was including them and then dropping them.

// https://github.com/jatinrungtaallen/mongo/blob/main/explain1.js
{
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 168,
totalKeysExamined: 19578,
totalDocsExamined: 16623,
executionStages: {
stage: 'FETCH',
nReturned: 1,
executionTimeMillisEstimate: 109,
works: 19582,
advanced: 1,
needTime: 19580,
needYield: 0,
saveState: 20,
restoreState: 20,
isEOF: 1,
docsExamined: 16623,
alreadyHasObj: 0,
inputStage: {
stage: 'OR',
nReturned: 16623,
executionTimeMillisEstimate: 32,
works: 19582,
advanced: 16623,
needTime: 2958,
needYield: 0,
saveState: 20,
restoreState: 20,
isEOF: 1,
dupsTested: 19578,
dupsDropped: 2955,
}
}
}

We quickly filtered the duplicates in our app after realizing this. That also helped shave around ~30ms from our average time.

Since Mongo wasn’t using some fields in the index, it would fetch all those docs (16k in the above snippet) and then filter in memory. We are using a general disk provided by Mongo Atlas, and fetching these many documents in just the average case is very slow. We considered changing to NVME SSDs but decided against it as it would cost us almost 1.5x what we were paying and for an uncertain gain.

So, after all these, we reduced the average query time from 2s to 120ms. We still hadn’t enabled randomization yet and were almost happy with the performance. But our next product (custom tests) launch meant that randomization was necessary. We wouldn’t want two tests on the same topic to have the same questions when we had millions of them at our disposal.

Enter Atlas Search (Optimization #3)

Since we were already using Mongo Atlas, we thought of trying Atlas Search.

The initial sync took time, and we were ready to go.

It used ~14GB of disk space after we could search and replicate our above query. We used Keyword Analyzer for our fields because we wanted a full match. Using the default analyzer would lead to unexpected results and a large node size.

The modified query looks something like this,

// https://github.com/jatinrungtaallen/mongo/blob/main/atlas_query1.js
db.questions.aggregate([
{
$search: {
compound: {
must: [
{ equals: { path: 'status', value: 2 } },
{ equals: { path: 'qnsLevel', value: 2 } },
{
embeddedDocument: {
path: 'customTags',
operator: {
compound: {
must: [
{ text: { query: 'yes', path: 'customTags.value' } },
{ text: { query: 'Q-REEL', path: 'customTags.tag_name' } },
],
},
},
},
},
{
compound: {
minimumShouldMatch: 1,
should: [
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{ text: { query: '1418', path: 'taxonomyData.subtopicId' } },
],
},
},
},
},
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{ text: { query: '664', path: 'taxonomyData.subtopicId' } },
],
},
},
},
},
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{ text: { query: '667', path: 'taxonomyData.subtopicId' } },
],
},
},
},
},
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{ text: { query: '1416', path: 'taxonomyData.subtopicId' } },
],
},
},
},
},
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{ text: { query: '231', path: 'taxonomyData.subtopicId' } },
],
},
},
},
},
],
},
},
],
},
},
},
{ $sample: { size: 10 } },
]).explain('executionStats');

This ran in less than 50ms! Just by enabling Atlas Search, we brought down the query time significantly. We quickly tried randomization in the console, which ran in ~150ms. We had found the holy grail of optimizations. We enabled randomization in prod and monitored the performance.

The spikes in the yellow line are us trying randomization for our internal users to see how the performance was before Atlas search.

We were now serving random questions within the same time we took to serve the first ten questions for a given subtopic! We did not compare the query performance without randomization, but it would have been significantly lower.

The worst query we observed still took 1s to run. p99 was down to 700ms with randomisation enabled

Optimization #4

Our questions contain content, images, and other metadata, which were also being indexed. That led to the enormous index size. This meant we could only update the index with downtime if we did not add any extra disk. We made a custom mapping and indexed only the required fields. That reduced the index size to a mere ~500Mb

We saw memory usage decrease proportionally.

While we were thrilled with what we achieved, the worst query we observed still took 1s to run. p99 was down to 700ms

In the hunger for more, we decided to look into the query plan for the above. It was a complicated nested structure, but this part stood out the most:

// https://github.com/jatinrungtaallen/mongo/blob/main/atlas_explain1.js
[
{
path: 'compound.must[3].compound.should[2]',
type: 'WrappedToParentBlockJoinQuery',
stats: {
context: {
millisElapsed: 0.170723,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('40'),
},
},
match: { millisElapsed: 0 },
score: { millisElapsed: 0 },
},
args: {
query: {
type: 'BooleanQuery',
stats: {
context: { millisElapsed: 0 },
match: { millisElapsed: 0 },
score: { millisElapsed: 0 },
},
args: {
must: [
{
path: 'compound.must[3].compound.should[2].compound.must[1]',
type: 'TermQuery',
stats: {
context: {
millisElapsed: 0.069281,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('40'),
},
},
match: { millisElapsed: 0 },
score: { millisElapsed: 0 },
},
args: {
path: 'taxonomyData.subtopicId',
value: '667',
},
},
{
path: 'compound.must[3].compound.should[2].compound.must[0]',
type: 'TermQuery',
stats: {
context: {
millisElapsed: 0.053304,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('40'),
},
},
match: { millisElapsed: 0 },
score: { millisElapsed: 0 },
},
args: {
path: 'taxonomyData.taxonomyId',
value: '1703061260BO',
},
},
],
mustNot: [],
should: [],
filter: [],
minimumShouldMatch: 0,
},
},
},
}
]

The plan was filled with the above structure, and Mongo was going through the index multiple times as many subqueries for a taxonomyId_subtopicId were there (apart from other filters). It was taking a lot of time.

We changed the query structure to something like this,

// https://github.com/jatinrungtaallen/mongo/blob/main/atlas_query2.js
db.questions.aggregate([
{
$search: {
compound: {
must: [
{ equals: { path: 'status', value: 2 } },
{ equals: { path: 'qnsLevel', value: 3 } },
{
embeddedDocument: {
path: 'customTags',
operator: {
compound: {
must: [
{ text: { query: 'yes', path: 'customTags.value' } },
{ text: { query: 'Q-REEL', path: 'customTags.tag_name' } },
],
},
},
},
},
{
compound: {
minimumShouldMatch: 1,
should: [
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{
compound: {
minimumShouldMatch: 1,
should: [
{ text: { query: '1418', path: 'taxonomyData.subtopicId' } },
{ text: { query: '664', path: 'taxonomyData.subtopicId' } },
{ text: { query: '667', path: 'taxonomyData.subtopicId' } },
{ text: { query: '1416', path: 'taxonomyData.subtopicId' } },
{ text: { query: '231', path: 'taxonomyData.subtopicId' } },
],
},
},
],
},
},
},
},
],
},
},
],
},
},
},
{ $count: 'count' },
]).explain('executionStats');

Every taxonomyId would only have a single block, and the subtopicId would be in OR conditions inside this block. Earlier, each taxonomyId_subtopicId was in a block of its own. We can also write the above query in this format for more readability

// https://github.com/jatinrungtaallen/mongo/blob/main/atlas_query2_alt.js
db.questions.aggregate([
{
$search: {
compound: {
must: [
{ equals: { path: 'status', value: 2 } },
{ equals: { path: 'qnsLevel', value: 3 } },
{
embeddedDocument: {
path: 'customTags',
operator: {
compound: {
must: [
{ text: { query: 'yes', path: 'customTags.value' } },
{ text: { query: 'Q-REEL', path: 'customTags.tag_name' } },
],
},
},
},
},
{
compound: {
minimumShouldMatch: 1,
should: [
{
embeddedDocument: {
path: 'taxonomyData',
operator: {
compound: {
must: [
{ text: { query: '1703061260BO', path: 'taxonomyData.taxonomyId' } },
{
queryString: {
query:
'1418 OR 664 OR 667 OR 1416 OR 231',
defaultPath: 'taxonomyData.subtopicId',
}
}
]
}
}
}
}
]
}
}
]
}
}
},
{ $count: "count" },
]).explain('executionStats');

These two queries generate the same query plan and have the same performance. We chose the first version in our code because it’s easier to generate programmatically, and also, if we needed to add other conditions, it would be trivial.
The query plan changed to something like this

// https://github.com/jatinrungtaallen/mongo/blob/main/atlas_explain2.js
[
{
path: 'compound.must[3].compound.should.compound.must[0]',
type: 'TermQuery',
stats: {
context: {
millisElapsed: 0.065064,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('39'),
},
},
match: { millisElapsed: 0 },
score: {
millisElapsed: 0.008271,
invocationCounts: { score: Long('8') },
},
},
args: {
path: 'taxonomyData.taxonomyId',
value: '1703061260BO',
},
},
{
path: 'compound.must[3].compound.should.compound.must[1]',
type: 'BooleanQuery',
stats: {
context: {
millisElapsed: 0.477606,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('39'),
},
},
match: {
millisElapsed: 0.012053,
invocationCounts: { nextDoc: Long('8') },
},
score: {
millisElapsed: 0.021702,
invocationCounts: { score: Long('8') },
},
},
args: {
must: [],
mustNot: [],
should: [
{
path: 'compound.must[3].compound.should.compound.must[1].compound.should[1]',
type: 'TermQuery',
stats: {
context: {
millisElapsed: 0.07381,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('37'),
},
},
match: {
millisElapsed: 0.007747,
invocationCounts: { nextDoc: Long('3') },
},
score: {
millisElapsed: 0.016377,
invocationCounts: { score: Long('3') },
},
},
args: {
path: 'taxonomyData.subtopicId',
value: '664',
},
},
{
path: 'compound.must[3].compound.should.compound.must[1].compound.should[0]',
type: 'TermQuery',
stats: {
context: {
millisElapsed: 0.090813,
invocationCounts: {
createWeight: Long('1'),
createScorer: Long('37'),
},
},
match: { millisElapsed: 0 },
score: { millisElapsed: 0 },
},
args: {
path: 'taxonomyData.subtopicId',
value: '1418',
},
},
],
filter: [],
minimumShouldMatch: 1,
},
},
];

This improved the average performance by ~50%.

The p99 also fell to around 100ms, down from 700ms

This is where we will stop in the optimization journey as we have achieved so much. It was a fun ride, and we learned so much about how Mongo works over the weeks that we took this challenge.

Bonus Optimization

While we did not use this optimization for ourselves as we went the Atlas search route, this might be helpful for not using Atlas Search.

While figuring out Optimization #4 for duplicate blocks in the atlas search query, could we improve our Optimization #2 using the same principles? It turns out we can,

// https://github.com/jatinrungtaallen/mongo/blob/main/bonusquery.js
db.questions.find({
customTags: { $elemMatch: { tag_name: 'Q-REEL', value: 'yes' } },
status: 2,
qnsLevel: 2,
taxonomyData: {
$elemMatch: {
taxonomyId: '1703061260BO', subtopicId: {$in: ['1418','664','667','1416']},
},
},
}).explain("executionStats");

Instead of several blocks, we have a single block and subtopicId with the $in operator. This works the same as OR in our atlas search query.

The execution plan is something like this

executionStats: {
executionSuccess: true,
nReturned: 1,
executionTimeMillis: 50,
totalKeysExamined: 4901,
totalDocsExamined: 4891,
executionStages: {
stage: 'FETCH',
nReturned: 1,
executionTimeMillisEstimate: 47,
works: 4901,
advanced: 1,
needTime: 4899,
needYield: 0,
saveState: 5,
restoreState: 5,
isEOF: 1,
docsExamined: 4891,
alreadyHasObj: 0,
inputStage: {
stage: 'IXSCAN',
nReturned: 4891,
executionTimeMillisEstimate: 7,
works: 4901,
advanced: 4891,
needTime: 9,
needYield: 0,
saveState: 5,
restoreState: 5,
isEOF: 1,
keyPattern: {
'taxonomyData.taxonomyId': 1,
'taxonomyData.subtopicId': 1,
status: 1,
qnsLevel: 1,
},
indexName: 'taxonomyData.taxonomyId_1_taxonomyData.subtopicId_1_status_1_qnsLevel_1',
isMultiKey: true,
multiKeyPaths: {
'taxonomyData.taxonomyId': ['taxonomyData'],
'taxonomyData.subtopicId': ['taxonomyData'],
status: [],
qnsLevel: [],
},
isUnique: false,
isSparse: false,
isPartial: false,
indexVersion: 2,
direction: 'forward',
indexBounds: {
'taxonomyData.taxonomyId': ['["1703061260BO", "1703061260BO"]'],
'taxonomyData.subtopicId': ['["1416", "1416"]', '["1418", "1418"]', '["664", "664"]', '["667", "667"]'],
status: ['[2, 2]'],
qnsLevel: ['[2, 2]'],
},
keysExamined: 4901,
seeks: 10,
dupsTested: 4891,
dupsDropped: 0,
},
},
},

It is taking the compound index properly now and using the ranges properly. docsExamined dropped from ~16k to ~5k and execution time dropped to 50ms, similar to our Atlas Search query.

We will probably figure out later on why docsExamined has this huge impact on latency, as we also want to compare Atlas search performance with the above query performance.

TLDR

So overall, we went from

  • p99 from >30s to 100ms
  • average 2s to 50ms
  • first ten questions to random ten questions from the whole question pool

The code used in the post is available at — https://github.com/jatinrungtaallen/mongo

We are always on the lookout for passionate individuals who want to make a difference in the Indian education system. Visit our career page to learn more about the exciting opportunities at ALLEN Digital and how you can be part of our mission: lnkd.in/gyZiNZXv

Join ALLEN!
(Session 2024 - 25)

Choose class
Choose your goal
Preferred Mode
Choose State