Your Database Hates Your Strings

Back in Grad School, I dimly remember sitting in a conversation about crafting a fundamentals of computer engineering class for incoming computer engineering freshmen to, presumably, scare the living hell out of them.  The lofty goal was to introduce a basic overview of computer architecture internals (registers, stacks, heaps, accessing RAM, interrupts, etc) and if the student had not fled to computer science, a different engineering discipline, or out of engineering entirely, the student was a keeper.

Also, I was, at the time, utterly convinced that understanding how a computer worked made dealing with pointers and algorithms simpler.  Computer organization was important!  We're not talking about fancy loop unrolling algorithms or branch prediction or anything!  We're talking about registers.  Of course, I learned to program using, among other things, the MASM, so what the hell do I know.*

Now a days, we teach fundamentals of programming with Java where there are fewer pointers and more garbage collection.

BUT!  Had you, gentle reader, taken that fundamentals of computer engineering class you would have learned a truism of computer architecture and organization:

Computers Hate Strings

Computers have no idea what to do with strings.  They are these things crammed into memory or into disk somewhere with some beginning and some ending iterated over in 8 bit chunks (or, you know, a whole hell of alot more for UTF-8) until running into some kind of NULL pointer.  And then it ends?  Or we go flying off the end into heap exceptions?  Maybe?

Computers only know how to deal with numbers.  Sure, floats are their own special kind of hell but at the end of the day, a computer can deal with a float because floats are chunks of integers and it has its own special chunk of THING to deal with floats (the floating point unit).  Integers are awesome.  A computer can deal with integers all day long, and very very very very very fast.  A computer can just pop two numbers into a register and go YUP or NOPE when doing a sort.  But iterating through long lists of pointer data to find algorithmic matches is a much slower proposition.

Because somewhere deep inside your computer's little brain is the original core of the original computers who also dealt entirely with numbers.  It don't do words.  You need a human for words.

So anyway, to databases.  An RDBMS is an engine on a computer to slurp pages of somewhat ordered data off a hard disk and maintain secondary tables (called indexes) to push data through a B-Tree algorithm to find data quickly.  The better the index, the closer the data is to pure integer data, the faster that algorithm works its magic.  The worse the index, the more stringy that data is, the worst the query.  

In a worst case, take a query fragment like this: LIKE '%DATA%'

A bit of query like this makes the database's poor head explode.  "Full table scan ahead!" it says.  Because when it says well, I don't have a number, and what is LIKE SOME NUMBERS and then some text AND LIKE SOME MORE NUMBERS the database just collapses in a heap of slag.   This results in underperforming queries.  Databases don't know how to do lexical search.*

The database engine wants to help.  It wants to help you.  But you need to help it.  It's a partnership!  A computational hug! And what the database wants are numbers.  It doesn't want any of your goddamn string data.  And worse, it doesn't want any of your goddamn variable string data.  If you gotta have string data, fine.  Be of a fixed size.  Use a CHAR field.  But oh my god we wave our hands around in the air and just die if we have no idea how long our data is and now we have to iterate over that pointer in memory every time to figure out where the end of the string is because WE DO NOT KNOW and now we have to match it with a mate and oh I am dying. 

B-Trees are spiffy algorithms that, with a properly maintained indexes and against fixed width data, can ensure a nice O(log2) running time for most SELECT statements against an RDBMS.  This means that a table with 100M rows has an average of 26 lookups to find the right data.  This is fast.  This is amazing. Databases will do this for you if the data in tables -- at least data being queried -- is mostly numeric data.  No sweat.

This is a little rantier than I would like but here's the core nugget: we have invented NoSQL partially to solve a non-problem and a data architecture problem, which is we don't know how to craft tables where the main indexed fields are SMALLINT, INTs or LONGs and then go nuts about how they are slow and we need to install, I don't know, CouchDB to serve a form.   There are seriously excellent uses for NoSQL databases -- especially in analytics of enormous bulk or non-uniform data and I can think of a dozen uses --  but for most things the problem is in the schema constructions where the main data are VARCHARs instead of INTs.  Databases don't like strings.  Computers don't like strings.

We can do big data if we just remember that maxim.

Thinking about schemas:

  • Use integer data if possible in query parameters.  
  • One may need to abstract out string data into lookup tables to force an integer lookup.  But lookup tables can be cached and joined in memory of the application.  Or not.  Caching is a friend and 100 little computers with 2G RAM > 1 computer with 64G RAM.  
  • There are plenty of times when string data cannot be avoided, but from a disk and RAM standpoint, fixed width CHAR > VARCHAR when it comes to indexing and searching data.
  • And if you're querying string data, index the data.  

I'm not convinced modern RDBMS systems are slow.  I am convinced most developers think in strings and computers, especially databases, only think in the colors of numbers.  Thinking in numbers makes computers go zooomy zoom zoom fast.  

  • I know how to use OllyDbg, that's what I know.
  • Although we can get past the string problem.  Stuff like Lucene knows all about lexical search.   Here is where we get away from RDBMSes.