Data migrations are one of those engineering tasks that seem straightforward, until the dataset reaches millions of records. At that point, the traditional read-transform-write pattern stops being a minor inconvenience and starts costing real hours of downtime risk, CPU load, and operational complexity.
Data migration in MongoDB becomes especially challenging when it involves restructuring nested fields inside each document. We recently faced exactly that problem: migrating over 4 million records that required a full MongoDB schema migration, reshaping nested arrays into normalized objects across every document in the collection. Based on previous experience with similar workloads, a conventional approach would have taken over a day to complete. Instead, we performed the entire transformation inside MongoDB using an aggregation pipeline and finished in under two hours.
This post breaks down:
- Why the traditional approach breaks at scale,
- How the in-database migration strategy works, and
- what you need to apply it to your own large MongoDB collections.
Let’s dive into it.
Why Traditional MongoDB Migrations Break at Scale
The most common migration approach moves data between the database and the application layer. The flow looks like this:
MongoDB → Application → Transform → MongoDB
The migration usually:
- Queries a batch of documents,
- Transforms them in application code, and
- Sends update operations back to MongoDB.
While easy to implement, this method introduces three compounding inefficiencies as dataset size grows:
- Network overhead is the first bottleneck. Each document must travel from the database to the application server and back. At 4 million documents, even small payloads generate massive I/O traffic.
- Round trips multiply the problem. Every batch requires both a read and a write operation. A minimum of two network round trips per document in the collection.
- Application CPU usage is the third drain. All transformation logic runs on the application server instead of inside the database engine. When dealing with millions of documents, this shifts load to the wrong layer and limits parallelization.
When these costs accumulate, even a well-optimized data migration in MongoDB becomes a multi-hour, or multi-day, operation.
A Smarter Strategy to Migrate MongoDB: Staying Inside the Database
Instead of pulling data out of MongoDB, we moved the transformation logic into MongoDB itself using an aggregation pipeline. This approach takes advantage of MongoDB’s native ability to reshape, transform MongoDB db schema structures, and write documents back to a collection without leaving the database engine.
The migration flow becomes:
MongoDB → Aggregation Pipeline → $merge → MongoDB
In this model, MongoDB handles reading, transforming, and writing, eliminating millions of network round trips and dramatically reducing execution time.
For teams already dealing with performance bottlenecks at the data layer, this isn’t just a migration tactic. It reflects a broader principle: computation should live as close to the data as possible. If you’re thinking through how that principle applies to your overall backend architecture and code performance, we’ve covered that in depth separately.
What a MongoDB Schema Migration Looks Like in Practice
Imagine a collection called user_reports. Each document previously stored event data as an array:
{
“_id”: “ObjectId(…)”,
“events”: [
{ “category”: “login”, “count”: 3 },
{ “category”: “purchase”, “count”: 1 }
]
}
The new schema requires a normalized structure:
{
“_id”: “ObjectId(…)”,
“metrics”: {
“login”: 3,
“purchase”: 1
}
}
A traditional approach would read each document, convert the array into an object in application code, and write the update back. This MongoDB schema migration, converting a flat array into a keyed object, is exactly the kind of transformation that the aggregation pipeline handles natively, without a single line of application-layer code.
How MongoDB Aggregation Pipeline Performance Works at Scale
MongoDB aggregation pipelines allow complex document reshaping using built-in operators. For this data migration in MongoDB, the pipeline uses two stages:
[
{
$set: {
metrics: {
$arrayToObject: {
$map: {
input: “$events”,
as: “event”,
in: [“$$event.category”, “$$event.count”]
}
}
}
}
},
{
$merge: {
into: “user_reports”,
on: “_id”,
whenMatched: “merge”,
whenNotMatched: “discard”
}
}
]
The pipeline runs three steps:
- It selects documents,
- Transforms their structure using $set and $arrayToObject, and
- Writes the result back into the same collection via $merge.
All without leaving MongoDB. MongoDB aggregation pipeline performance at this scale is possible because the database engine processes transformations in memory, without serializing documents over the network.
The difference compared to application-layer processing isn’t incremental, it’s structural. This is the kind of database-level optimization that becomes essential when you’re building systems for scale.
If you’re evaluating when to use MongoDB versus other options for your stack, our guide on choosing the right tech stack for long-term growth walks through the trade-offs in detail.
Why $merge Makes It Possible to Migrate MongoDB Efficiently
The key operator enabling this approach is $merge. It allows aggregation results to be written directly back into a collection, instead of issuing separate update queries. MongoDB handles the write operation as part of the pipeline execution itself.
This effectively turns the aggregation pipeline into a migration engine, capable of running a complete data migration in MongoDB in place, with no external writes and no application-layer bottleneck.
$merge supports four behaviors when a matching document is found (whenMatched):
- Replace,
- KeepExisting,
- Merge, and
- A custom pipeline.
For most MongoDB db schema transform operations, merge is the right choice: it updates only the fields produced by the pipeline and leaves unrelated fields intact.
Understanding MongoDB’s operator model at this level is the kind of knowledge that compounds. Engineers who can reason about data transformation inside the database rather than defaulting to application-layer scripting consistently ship more performant systems.
Processing Documents in Batches During Data Migration in MongoDB
Even when transformations run inside the database, processing 4 million documents in a single pipeline execution carries risk. Memory limits, oplog pressure, and lock contention can all create instability in production environments.
To keep the system stable, we processed documents in batches. The workflow was:
- Query a batch of document IDs using a cursor
- Execute the aggregation pipeline scoped to those IDs
- Apply the MongoDB db schema transform via $merge
- Advance the cursor and repeat until all documents were processed
Batch processing delivers three operational advantages:
- Predictable memory usage,
- Safer execution in production, and
- Easier progress tracking.
You know exactly where the migration stands at any point. This pattern also makes it easier to implement rollback logic or resume interrupted migrations. A batch-scoped pipeline can be re-run against a specific ID range without risk of double-applying transformations, as long as the $merge behavior is idempotent.
For teams working with systems under load, performance testing strategies using K6 can help validate that batch sizes are calibrated correctly before running migrations against production data.
Results
| Migration Strategy | Documents | Duration |
| Traditional read-transform-write | <500k | >24 hours |
| Aggregation pipeline migration | >4M | <2 hours |
By keeping the entire data migration in MongoDB inside the database engine, the migration avoided millions of network round trips and finished in under two hours, processing 8x more documents in roughly one-twelfth the time.
Key Lessons for Engineering Teams
Running a successful data migration in MongoDB at scale isn’t just about picking the right operator, it requires a set of principles that apply across any large collection. These are the four that made the biggest difference in our approach.
Move Computation Closer to the Data
Whenever possible, transformations should run inside the database engine rather than in the application layer. The network is almost always the bottleneck, not the query.
Aggregation Pipelines Are Powerful Tools to Migrate MongoDB at Scale
MongoDB’s aggregation framework can handle complex schema transformations that many teams implement unnecessarily in application code. Before writing a migration script, check whether a pipeline operator already solves the problem.
$merge Enables Efficient In-Place Updates
Using $merge, aggregation pipelines can safely write results back to the same collection with fine-grained control over matched and unmatched document behavior.
Batch Processing Improves Reliability
Breaking large migrations into batches keeps memory usage predictable and makes it easier to monitor progress, pause safely, and resume without re-running completed work.
Final Thoughts
By using MongoDB aggregation pipelines and $merge, we migrate MongoDB collections of any size efficiently, without application-layer involvement and without destabilizing the production environment. For engineering teams working with large MongoDB collections, this approach is worth adding to your standard migration playbook. If you’re thinking about how these techniques fit into a broader engineering culture, our post on generalist software engineers in the AI era covers how adaptable technical thinking compounds across problems like this one.
FAQs
Is it safe to run an aggregation pipeline migration on a production database?
Yes, with the right approach. Using $merge with batching limits the impact on production traffic. We recommend running migrations during low-traffic windows and using cursor-based batching to control execution scope and memory usage.
Does this approach work for all MongoDB schema migrations?
It works best for transformations that can be expressed with aggregation operators, array flattening, field renaming, type conversions, and nested document restructuring. For MongoDB schema migration scenarios that require external data lookups or complex business logic, a hybrid approach may be more appropriate.
What’s the difference between $merge and $out when you migrate MongoDB?
$out replaces the entire collection with the pipeline output. $merge writes results back into an existing collection on a per-document basis, with configurable behavior for matched and unmatched documents. For schema updates where you want to update specific fields without overwriting others, $merge is almost always the right choice.
How do I determine the right batch size for data migration in MongoDB?
Start conservative: 1,000 to 5,000 documents per batch is a reasonable baseline. Monitor memory usage and oplog size during the first few batches and adjust upward if the system remains stable. The goal is the largest batch that keeps execution time per batch under 30 seconds.
Can this technique be used with MongoDB Atlas?
Yes. MongoDB Atlas fully supports aggregation pipelines and $merge. Atlas also provides real-time performance monitoring, which makes it easier to track MongoDB aggregation pipeline performance and identify any resource contention during large migrations.
