In some ways, this tidal wave of data is the pollution problem of the information age. All information processes produce it. If we ignore the problem, it will stay around forever. And the only way to successfully deal with it is to pass laws regulating its generation, use and eventual disposal. - Bruce Schneier
Even the smallest application generates quantitative and measurable data - log data, usage data, number of times downloaded, number of times executed. A normal web application (even this blog) has reams of data powering it: the blog posts, blog settings, log files, statistics, reports.
The entire point of a computer is to store and vomit back up data on command. It ultimately doesn't do much else except for the occasional computation on said data which it then also stores and vomits back up. Computers create data, consume data, and are ultimately all about shuffling small piles of data around to create bigger and bigger piles of data until the world is consumed by terabytes and petabytes of the stuff in huge mountains now called.... big data. Which computers then sift to find... even more data.*
Data can be good, it can be evil, but to a computer, it's just data.
When approaching a new application the approach to storing, reading, and manipulating the data is the single most important decision one can make -- more important that the OS or the tools to build it. One can write the fastest low-level C program on Earth but if it sits around waiting on a socket for 60-second long query to execute it's just going to get to waiting around on that query very quickly. And, to the same extent, one can write the most convoluted ruby application but if all the queries are sub-millisecond it will still support a respectable user concurrency.
It is not just about access speed (although woah important). An application's data architecture is about the holistic approach to how the data is ultimately created, used, stored, and retrieved by the application and the application's management system. The right approaches are the difference between a successful application and one that fails.
We ask ourselves some questions:
- What kind of data is this? Is it customer data? What is its shape?
- How is the data queried? What kind of answers does the data provide?
- How much data are we expecting in a reasonable time frame? Megs? Gigs? Petabytes?
- What is the tolerance for data loss?
- What generates logs and how are they collected?
- What intelligence can we derive from the data we create?
- What sort of reports will be run on this data?
Back to the example, for Project Butterfly, even in its simplest possible incarnation of just handing out some reservations to a convention, we already see it has three different types of data:
- Customer transactional data describing customers, their attendance choices, and the probability of attendance;
- Web and application log data describing the usage patterns of the application itself;
- Reporting data back to convention planners to properly size for the expected population of convention goers.
We have two different kinds of business intelligence out of three flavors of data usage -- and the system hasn't even yet taken a credit card. We have customer intelligence and system intelligence. These are both critical pieces of data to the overall health and success of the project.
In the next couple of posts we'll deal with the two different classes of data and choose data stores for them correct to their core architecture and needs of the running system.
- With, granted, some pretty sweet tools.