NoSQL's Time and Place

I meant to put together a quick little ERD for Project Butterfly last night and blog about that today so the application can get built but Harold Ramis died and now I am morally obligated to watch Stripes and Ghostbusters instead. 

Anyway, contrary to popular opinion, I don't hate NoSQL databases, but I this is a technology that has a time and a place.  It's less structured than a relational database (no transactions, no JOINs, no foreign key relationships) but more structured than a traditional key-value pair cache (memcached, redis).  It's the case of the right tool for the right job.*

The dividing point for me is this: if the data in the system corrupts, do I care?  When I think about systems as a whole, healthy unit, here's how I go about picking my tool:

T-SQL:

  • Personally identifiable information (user records)
  • Billing transactions
  • Billing information
  • Inventory and SKU-based systems
  • Legally liable information
  • Medical information
  • Data covered by PCI or HIPAA with strict privacy controls
  • Data that is a company-ending event if a single record should be lost

NoSQL/Caching Systems:

  • Sessions
  • Caching data fragments
  • Tracking clicks for marketing
  • Leaderboards
  • Non-permanent messages 
  • Unstructured data on its way to becoming structured data (ie Hadoop batch processing systems)
  • Comment systems
  • Transitory system state information
  • Real time analytics

These two buckets are fundamentally different.  

Here's a great example: Reddit.  The very important pieces of information -- tracking user information, stories, etc. are storied in PostgreSQL.  Votes are stored in Cassandra.  Votes are transitory and per-user and need to be stored distributed at high volume.  

Also, no one says the system you develop can't have both in the architecture.  

If I was going to write a real time analytics package for, say, scientific computing (science data has real interesting issues with RDBMS but that's a story for another day) and write the display in my new favorite toy Processing and knowing well that the data only needs to persist for the duration of the display and that losing a few data points is not a huge deal, I would definitely pull a NoSQL store off the shelf.  Write batch jobs to get the unstructured data into columnar format and write a Java plugin for Processing to pull the data into something Processing can use.  (Processing is a JVM language for art and data visualization.) But, Project Butterfly, which is largely a customer transactional system with customer data, billing data, and inventory management, needs to go into a TSQL traditional RDBMS.

I don't hate the toys.  I just want the toys to be used in the right place. 

  • NoSQL is so loose in definition that Redis and MemcacheD are sometimes considered NoSQL databases.  They aren't data stores.  They are in memory persistent key-value caches, and extremely useful ones at that.  If you turn them off they have no expectations of persistence.