This is Fundbase Nerds, written by the team behind Fundbase.

Go to Fundbase  

Mongoid embedded collection drawbacks

Posted by Marek Stanczyk on

MongoDB embedded documents are a cool and useful concept, but often not completely natural for a person coming from SQL databases, like many of us. So when we discover embedding and its features, we might be tempted to try to solve all our problems with it.

Let me first mention some advantages of embedding:

As usual, there is the other side of the coin. Referenced collections provide more flexibility and allow better normalization, but my point is not to make design comparisons, but show you some practical drawbacks of embedding instead (on Mongoid 4).

Increased size of documents slows down the queries

Regardless how simple the query is, the retrieval of large documents from the database takes longer. Much longer (this query matches 4 documents with ~ 1300 embedded positions each):

# complete documents
[1] pry(main)> Benchmark.measure { Portfolio.where(name: name).entries }.real
=> 4.782978
# just the ids
[2] pry(main)> Benchmark.measure { Portfolio.where(name: name).pluck(:id) }.real
=> 0.010231
# without positions
[3] pry(main)> Benchmark.measure { Portfolio.where(name: name).without(:positions).entries }.real
=> 0.032833

The duration is significant and promotes usage of only/without. This can however lead to a problem when you need to save a document loaded with incomplete attributes - it’s marked as readonly and you need to reload it (and there was a bug that reload didn’t reset the readonly flag, which was fixed).

Querying the embedded collection is slow

So we embedded our data into the parent document and are happy that we don’t need to query the database for related documents, expecting fast in-memory Array-like lookups. However,

# load a `portfolio` and `fund` first
[1] pry(main)> Benchmark.measure { portfolio.positions.where(fund_id: fund.id).map(&:to_s) }.real
=> 0.457588
[2] pry(main)> Benchmark.measure { portfolio.positions.select { |position| position.fund_id == fund.id }.map(&:to_s) }.real
=> 0.04752
[3] pry(main)> Benchmark.measure { portfolio.positions.where(:start_date.gte => Date.new(2016, 1, 1)).map(&:to_s) }.real
=> 0.1306
[4] pry(main)> Benchmark.measure { portfolio.positions.select { |position| position.start_date >= Date.new(2016, 1, 1) }.map(&:to_s) }.real
=> 0.027501

Not so expected, right? The reason is that the embedded collection is a richer class than Array - it allows querying and chaining, and thus is slower (explained in this ticket). So, when you don’t need the chaining and have many lookups, you might prefer to use the embedded collection as a regular Array (you might need to first convert it to one with e.g. #entries).

The bottom line is that we need to think twice before embedding a relation, considering how large the collection is going to be, whether the advantages of locality and atomicity outweigh the potential problems caused by the increased size of the parent document. One of the options (besides a has_many relation) would be a has_one relation to a document embedding all the related documents for a single parent document. That allows to keep the parent document smaller, but keeps all its subdocuments together, which can make sense in some scenarios (e.g. calculations involving several lookups of subdocuments).


Marek Stanczyk

Marek is a Fullstack developer at Fundbase.
Loves beautiful code and enjoys developing with Ruby and Rails the most.