Here's how to decide if you should use a traditional, boring transactional database or go with a spiffy new NoSQL system:
If you lose that data, someone sues.
In particular, someone sues you.
... or something waaaaaay more interesting happens like you fail an audit, you lose accreditation, you lose customers due to lost data, your company goes down under customer service issues, you know, that sort of thing.
Yeah, you're right, JOINs suck and JOINs are slow and writing to databases in an assured and atomic manner is much slower than caching data in RAM until the system can batch write to disk at some later date but hey, you know what's really super fast and doesn't do JOINs and doesn't ever have to write to disk? Piping customer data to /dev/null. That's damn fast and a web app using that has a data store is going to have fabulous response times! It will never row or table lock. Oh, it will it scale! Cheaply! And it is every bit as good and reliable as if you are storing customer data in RAM and it takes some sort of weird blip.
Contrary to many breathless tech blogs and most of twitter, regular boring old transaction databases powered with SQL92 are going nowhere. ACID (atomicity, consistency, isolation, durability) guarantees algorithmically that the set of operations going into and out of a database are consistent and reliable. Following ACID, what goes there, gets there, and what comes out of there is what you expect. Yay for data!
There's a time and a place to be exciting and bleeding edge, and there's a time and a place to be conservative. Personally identifiable information is an awesome time to be old and stodgy and boring. Financial data is when we reach to the most reliable platform there is. There's stuff we want to ensure get written every time. There's a time for boringness.
And yeah, old boring ACID transactional databases do scale. It takes work and planning. It means understanding the database's strengths and weaknesses and digging into the engine of choice. Facebook does it. So does Youtube.
I think NoSQL databases are neat -- I'm a huge Cassandra fan and a Solr/Lucene fan* -- and they have massively important use cases. But like a TSQL database, it's important to understand the tool and know when it is okay to lose data in a massively distributed system and when it is not.
And guess what? The number of use cases for NoSQL is small -- growing but still small. There are "you will not be sued if you lose a machine's worth of cached RAM" sets of data. However, the use case for TSQL is "any time you have data of any sort." So let's start there and work through the massive scaling issues as they arise on a case by case basis.
For Project Butterfly's data, we will put it in a nice transactional database where the customer data will, at the least, get written to disk at the time of INSERT and we won't lose any customers on the floor.
Anyway, I didn't invent the "send your data to /dev/null" thing. It came from this spiffy video. It's just every time I read another YAAAAAAAAAAAAY NOSQL blog post I think about it and laugh.
- I have already invoked MongoDB once. If I was going to build a cloud based analytics engine on non-uniform data, I'd probably use Cassandra instead of Hadoop/Flume/Accumulo but if I was building something smaller like a pure web log analytics engine, then Mongo might be the right tool.