10 Rules for fair ORM benchmarks

Vasil Rangelov - Sep 10 - - Dev Community

I often stumble upon benchmarks for ORMs, and it seems like there is a general problem of benchmarks for ORMs not being fair, but not necessarily for the reasons you may think. While I have yet to create my own benchmark that I would consider "fair", I feel I should write out what I think any fair benchmark of an ORM should be. And if someone else reads this and creates a fair benchmark satisfying all of this... awesome. Please share in the comments.

It's worth noting that the "rules" I am about to write are more "guidelines", but the word "rule" is probably appropriate still, in the sense that when you do steer off of these, you should have a good reason to, beyond "it was easy to create the benchmark this way".

Why 10? No specific reason really... That's just what it came to once the list was finished. There may be more I just haven't considered, or were too obvious to add (e.g. "compare results from tests on the same machine"... duh... That doesn't need enumerating).

1. Compare ORMs under the same programming language and database engine

This one probably seems obvious, but still worth enumerating, as you can sometimes see articles talking about significant performance improvements when switching one or both of these things.

Your programming language and database engine of choice will tie you to a performance ceiling, so it's not fair to compare f.e. an SQLite DB accessed via a Rust ORM vs a PostgreSQL database accessed from a NodeJS ORM.

It is valuable to compare database clients of the same database engine across languages (e.g. MySQL over PHP vs MySQL over NodeJS), to determine this performance ceiling for your app, given the database engine already in place. It is also valuable to compare the database client of different database engines within the same language (e.g. better-sqlite3 vs node:sqlite in NodeJS), to determine your app's performance ceiling given the same programming language you are already skilled in.

Comparing both programming language and database engine at the same time, or worse, adding an ORM into the mix of what the benchmark is testing, is not helpful in determining the best stack. It just gives an unfair advantage of whichever stack of your particular chosen ones is most optimized for your benchmark scenario.

2. Compare the same underlying database client

Even when you have the same database engine and language, there may be multiple database clients. PHP has Mysqli and PDO for MySQL, NodeJS has several SQLite clients, including most recently a natively built in one, etc.

Tests across ORMs should use the same underlying database client. Ideally, whichever one is deemed the most efficient one across them.

When a certain ORM doesn't support a certain client, it's OK to still include the ORM when comparing, but the used client should be highlighted, and if there is a shared client that all ORMs can use, it should also be included as a "control" (i.e. how much improvement to expect if the ORMs without support adds such support...).

3. Compare the same schema and data across ORMs

Pre-populating the database with random fake data is ok. Re-creating the schema and populating with (different) random fake data is also ok. However, this "seeding" should be done separately from CRUD related benchmarks for the different ORMs, and only only be re-done once all ORMs have been tested with the current data set. You can occasionally see benchmarks that combine those into the same benchmark per ORM under test.

This is not fair, because the generating of random fake data is a potential variable in the timing that is outside of what is being benchmarked - the ORM. Further, if you are using the ORM to do the insert, you may or may not be using the ORM's capabilities optimally, or realistically. Batch inserting should be its own test, ideally taking the data as some pre-prepared array of values.

4. Compare the same underlying SELECT queries

This is the biggest offender in pretty much all benchmarks I've seen. They just look at it from the user code perspective - give some filter input with the ORMs defaults, ensure same output. Compare times. Easy, right? Wrong.

The truth is that the biggest bottleneck in an application are the SQL queries themselves, not the ORM. So if you care about performance enough to check a benchmark, your choice of ORM shouldn't be based on the ORMs defaults, but whether it allows you to create the most efficient queries you can.

Basically, the EXPLAIN output of the SELECT queries in all ORMs should be the same. It's OK to have minor syntax variations like different alias names, quoting or lack thereof, use of "AS" or lack thereof, semantically insignificant brackets around expressions and the like. But having one ORM end with a full table scan select, while another ends up with an indexed select is not fair.

A fair benchmark should ideally disclose the raw SQL query that is ultimately created with all ORMs tested (not "roughly", but "semantically equivalent to"), so that adding a new ORM to the test or newer versions of the existing ORMs can be held to the same standard. Turn on any logging you can while researching (either at the ORM level, or at the database client level, or at the database server level... doesn't matter), to fine tune the query, and turn off this logging once the actual finalized benchmarking is running.

If an ORM doesn't let you in any way create a certain query without resorting to a query builder, or worse, raw SQL, then that query shouldn't be checked across the ORMs being benchmarked.

For example, some ORMs don't let you populate inner objects with joins. Instead, they always use a subquery. The choice between the two however is significant, and which one is more efficient depends. There's also JSON as a value. An ORM should let you use any of those options, and ideally default to what it can determine to be most efficient for the overall request (tradeoff between query time, and hydration time).

If an ORM does allow you to fine tune the generated query, but requires some non-default options to get there, then ok - add those options. If the ORM needs to go through a less happy path to finally generate that query, and/or takes longer to return the populated object, that's fair game in a benchmark of ORMs.

If the only query you can generate across all ORMs under test is a slow one, then it's fair to call this a missing feature in all of the ORMs that can't generate the more efficient query. In those cases, it may even be worth comparing a query builder version with the less feature rich ORM, compared to the feature rich ORM where the query builder is not required. It is still a bit less fair to do that though, so it should be explicitly called out in the benchmarking results that the query builder was required for that specific test.

5. Compare the same database transaction boundaries

A big bottleneck for many databases, that database clients (and inherently ORMs) then have to deal with is locks, which are in turn influenced by the use of database transactions. As soon as a transaction is committed, the database releases the locks it has on the data, which means further queries make the database re-obtain locks, which may cause a performance penalty that is not the ORMs fault. If no transaction control queries are present, each query is a transaction.

A lot of benchmarks do not account for this, and just blindly issue a set of queries which may or may not be running in the same database transaction. Different ORMs deal with this differently, and have different defaults, but they do all offer you some way to explicitly set when a transaction starts and ends. When comparing ORMs, those transaction boundaries should also be taken into account, and made to match.

If one ORM doesn't let you fine tune a set of queries' transaction boundary in a particular way that another ORM shows to be more efficient, it's valuable to call this out as a shortcoming of the ORM, but in practice, what happens more often is that some ORMs do not insert transaction management queries, which means you simply have to explicitly make your test run your test code in a transaction per iteration of the benchmark.

6. Do not run multiple ORMs concurrently or in parallel

In the real world, you're unlikely to be using multiple ORMs in the same application. The benchmark should reflect this. Do not feature the different ORMs running concurrently. If they have to be in the same process/application for the benchmark tool to work, run all tests for one ORM before moving to the next. If you can make a separate application project for each ORM under test (with every dependency other than the ORM being the same, of course), and run each application one after the other, that may be even better, but either way, let only one ORM at a time interact with the database.

The reason for this goes back to database locks - one ORM's interaction with the database could end up slowing the other one, or (depending on your test scenario) even cause deadlocks.

It's fair to run multiple concurrent/parallel HTTP requests to the same application, if they are using the same ORM. In fact, that is probably the best and most realistic way to do a benchmark, as opposed to running a synthetic benchmark in a single process. The higher RPS (Requests Per Second) you can achieve, the better.

7. Be consistent about serialized JSON object vs JSON string

A lot of ORMs provide JSON serialization controls and helpers, to serialize the result set to JSON. Even when they don't, they may have some fast path when the result set object is sent as JSON in the HTTP response. That fast path may in turn be faster than serializing the result to JSON explicitly/naively and passing the resulting string in the response.

Using either approach in your benchmark is fair, but whichever one you pick, be consistent about it across ORMs. If the difference between these ways of serialization is significant enough, and yet the output shape can be made the same, it would be worth to have tests in both modes for any ORM that supports both modes.

8. Minimize custom post-processing

It may seem unfair to have the resulting HTTP response shape between various ORMs be different, and to an extent, it is, depending on how vastly different the shape is. One ORM using { items: ... , count: ... } while another uses {data: ..., dataCount: ... } is probably not worth normalizing, but anything beyond that probably is. To counter such differences, some benchmarks try to further populate a custom object/array with the result set returned by the ORM.

However, whether that part is fair or not depends on the algorithms you use for that part, and whether they are needed in the first place. This means it is easy to construct a test optimized for one ORM over another. To keep things fair, it may be worth to try different algorithms for each ORM, and stick with whichever one performed best for that ORM, to give it the best chance. Acknowledge the algorithms you tried in the results, and if the difference is significant enough, include both. That would help readers, as well as the ORM maintainers.

Examples of where things can go wrong include using

const response = { items: [] };
for (const row of rowset) {
  //post processing of row
  response.items.push(row);
}
Enter fullscreen mode Exit fullscreen mode

vs

const response = { items: [] };
const l = rowset.length;
for (let i = 0; i < l; ++i) {
  const row = rowset[i];
  //post processing of row
  response.items.push(row);
}
Enter fullscreen mode Exit fullscreen mode

Either form could be better or worse for an ORM, depending on whether it is returning an array of objects, a generator, or an iterable object. Both of those forms above are likely to be more efficient than

const response = { items: [] };
rowset.forEach((row) => {
  //post processing of row
  response.items.push(row);
});
Enter fullscreen mode Exit fullscreen mode

or

const response = { items: rowset.map((row) => {
  //post processing of row
  return row;
}) };
Enter fullscreen mode Exit fullscreen mode

In all of the above examples, we assume row has a different shape that we're normalizing before returning an HTTP response, but if you can do that on the ORM level, you won't really need to do any sort of item post processing in the first place.

Example where this comes up is with ORMs supporting custom types (objects) for scalar values. If one ORMs doesn't have this feature, but another one has an option to ignore custom types in favor of the raw types, it's fair to compare the two while disabling custom types in the ORM with the option to ignore custom types.

If an ORM supports this feature, but has no option to ignore it, comparison with an ORM without this feature is unfair, because the custom type is extra work for the ORM that supports this. Instead, you can compare entities that don't use custom types in the first place.

If all ORMs you are comparing have custom scalar types, the custom type should ideally be the same class, or if that's not possible due to ORM interface constraints, it should at least have the most optimal path for serialization that the ORM can provide.

A test involving custom types should be isolated from the rest, as not everyone needs custom types in their code base.

9. Be consistent about batched vs unbatched writes

A lot of benchmarks about insert/update/delete iterate over items, and admittedly, if proper transaction boundaries and queries are in place (see rules 4 and 5), this can indeed result in the same set of queries, making this sort of a benchmark fair.

However, most ORMs also offer abstractions/helpers for doing insert/update/delete on deeply nested object hierarchies and collections. These end up producing a single query that affects multiple rows, or perhaps a set of queries, but a set that is smaller than the set when iterating over all items. This is a realistic scenario a realistic application is likely to encounter. It is worth to include tests for both single insert/update/delete, as well as one for batched insert/update/delete. Different ORMs may have one scenario optimized better than the other, and depending on which one you are more likely to use more frequently in your application, it is fair to make your choice of ORM based on that.

Note however that the generated queries when batched writes are performed may differ. Similarly to selects, it's worth to try and fine tune those to be the same where possible, or acknowledge and reveal the different set of queries generated where not possible.

10. Notify ORM authors about your benchmark effort

Whether it's showing the ORM in a good or bad light, authors of maintained ORMs will care about your benchmark... If it is fair. The fairer and more comprehensive your benchmark already is by the time you reach out to maintainers, the more useful feedback you are likely get from them about how you can further optimize their ORM's test code, or even better, you may cause them to optimize for some of your scenarios in their newer releases.

It's only fair to give them a chance to present their ORM in the best light possible in front of viewers of your benchmark results.

.
Terabox Video Player