ORM - the Killer of Scalability

I'm that jerk ready with a presentation to explain why ORM sucks and you should never use it.

Around 2004-05, software engineering was consumed with design pattern madness - not entirely unlike space madness but with code.  Design patterns aren't in themselves terrible thing -- algorithmic software tends to follow similar patterns and understanding these creates a decent engineering toolbox.  

Design patterns were categorized.  They were named and numbered.  Books were published.  Interview questions devised to show off who could memorize the most patterns and recite them at will.  Java people decided design patterns were the WAY and the LIGHT and shoved as many into their codebase as possible.

During this period, a book called Patterns of Enterprise Architecture by Martin Fowler (full disclosure: I have a copy) was published and contained a list of interesting design patterns describing the integrations and communications between systems.  Among them was the Object Relational Model.  This posited that rows in database tables were objects in a pure OOD/OOA sense. Code could work on them like objects and generate SQL to reflect the interactions between these objects.  Developers did not need to work directly with SQL to the database, a process that was inherently neither patterny nor objecty.  Developers could instead work in levels of abstraction and let the ORM engine do its job of dealing with the database's personal issues directly.  

There are two views of data in a RBDMS.  Straight SQL posits that databases are repositories of information that answers questions about data.  ORM posits that databases are repositories of information that represents objects in the system.   ORMs aren't terrible from a purely abstract sense except for the core assumption that a database row == a serialized object, and it isn't.  Computer scientists give this fallacy the fancy name of Object Relational Impedance Mismatch.

Not to be deterred by something minor like "database data is not object oriented," Hibernate, enemy of all mankind, was born.  A giant implacable humorless ORM with complexity layered on complexity, hibernate became a standard tool in the Java developer toolbox.  

Developers ran screaming from hibernate and fled to Ruby on Rails which re-implemented hibernate in the cuddly and user friendly ActiveRecord.  Between ActiveRecord and ruby's inherently bad garage collection algorithm (finally getting some love in ruby 2!) the entire Internet came to a crawl because ORMs, if they are anything, are slow.  That's the great tradeoff -- object manipulation of data vs performance.

ORM turns accessing database into an epic struggle of DBA vs. Developer.  Some Sin greatest hits:

  1. ORM wants to pull in objects so it pulls in entire rows with SELECT * unless explicitly told otherwise causing enormous time and network overheads.
  2. ORM wants objects to work with objects, spawning JOINS and often JOINS with JOINS that do JOINS.
  3. ORM hides the generated SQL code making it difficult to optimize.
  4. ORMs love VARCHARs.  Love them.  Everything is a default VARCHAR(255) field in an ORM.
  5. Lack of flexibility when needing to shard by data.
  6. Lack of flexibility when implementing read-write splitting.
  7. Developers go and generate schemas in code which are often interpreted in a less than optimized data type in the database and almost always miss critical INDEXes.  Default datatypes are rarely the correct datatypes for the job.
  8. Using an ORM does not negate the need to pull slow queries, run them through an EXPLAIN plan, and find the missing indexes.  Yet ORMs hide the SQL leaving the slow queries as a nasty surprise.
  9. Hand written queries almost always run faster than generated ones.  Even simple ones.
  10. In ActiveRecord and HIbernate (at least) giant layers of abstraction and reflection and metaprogramming layered between the code and the database serve as a tangle of execution designed to do nothing but make things slow.  In Java this isn't noticeable but in an interpreted language...

Bad SQL queries who pull back large, unnecessary blobs of data on every read or do not employ indexes are the main source of slow web code.   ORMs generate bad SQL queries.  OED.

Yes, the argument for an ORM is that one can learn the ins and outs of the ORM's DSL and, with enough practice and finesse, can get the ORM to generate decent queries.  But one can spend time learning the DSL or just learn to write standard SQL queries.  

The Python and Node.JS communities have eschewed ORM for the most part.  Python's SQLAlchemy has ORM as an optional add-on -- if it gets used at all.  It's mostly a Data Mapper.  Node.j's Sequelize is already going down the ActiveRecord route -- it has many of the same sins.  Sad, as it's a system designed to kill Node's performance.

The biggest source of poor performance in a system are the mystery queries, some often enormous, generated by ORM metaprogramming magic.  The best way to speed up code is to ditch the ORM and write the queries directly.  Even stodgy ruby with it's garbage collection issues can get enormous benefits simply by moving away from ActiveRecord to a data mapper.  

Simply to sum up, friends don't let friends write easy but slow code.

Of course, if moving to a data mapper or stored procedures doesn't speed things up, then we talk about read-write splitting... but that is another article for a different day.