Big Systems and Logging

In distributed systems, the systems need to log and log everything.  

This seems like an obvious statement but it's not.  In the past, systems rarely logged beyond exception logging and system logging.  They didn't need to; if a website ran on a single monolithic server, and the website experienced errors for the customer, the developer could log into the server, examine the exception logs and take action.  If the website needed to be patched the developer could build and roll out a patch to the server.  Done and done.

If the system has 500 backend servers or more all running the web code and a handful of them are experiencing errors, it's impractical to log into all the systems and examine them one by one (1) so we need an entire logging architecture.  And logging architecture leads us right into the wonderful new world of Hadoop and ElasticSearch.  

But it's not just about getting errors.  It's about getting analytics.  It's about learning answers about the system as it runs.  We want to know things like:

  • What is the rate (transactions/second) of someone buying a widget over any one time?
  • What is the flow from front page all the way down through microservices and back of my users?
  • What is the rate of my system errors?
  • What is the rate of my system successes?
  • What is the rate of my system errors in proportion to the number of overall system successes?
  • Are people trying to abuse or hack my system?  How?  In what manner?
  • What is the percentage completion of an action failing or completing?
  • Can I get an aggregate view of the system working in near real time?
  • What answers can I answer about how my system is used?

Yes, it's possible to answer ETL (extract, transform, load) information out of OLAP databases but now a days we can do all sorts of incredibly sweet and interesting things with logs.

Before digging into logging architectures and building something basic, here's some guidelines for good logging.  Now, there is a huge line between no logging, terrible logging, and good logging.   Most of the time we're at no logging and then add in bad logging.  Good logging is logging that will lead to a good overall picture of the system.

Guidelines

  • Always prefix a logline with a time and date stamp.
  • Keep an action to a single logline.  Mining multi-logline messages may be fine for grep but it makes life difficult when hundreds of loglines appear/second.
  • Log whenever there is an error but also log whenever there is a successful action.
  • Put in the following, if possible, into a single log line:
    • The name of the process
    • The name of the overall functionality in this process
    • SUCCESS/FAIL/ERROR
    • the REST call action taken
    • user identifiable information
    • the source IP

For example:

2013-05-07 19:40:00 microservice.js:create_account SUCCESS action=create status=ok user_id=awesome_dude src_ip=111.222.333.444 return_infos=0

Here I could parse on date, the service, the call, the action, the status, the user_id, the source ip or any combination of above.

There's also a trick, that is more advanced, to giving the user a GUID when they enter the system and passing that GUID through all of their calls so a single search on that GUID will pull up all actions by that user across all components through the entire system.  Very useful when mining data.  

There are also some super cool gotchas:

  • This much logging will slowly kill system performance/lead to iops-based disasters
  • We need a system to deal with all the logging because this cannot go into a database -- it will kill a database. Any database.  Even the big Oracle data warehouse things database.
  • Disk needs to be monitored and alerted.
  • Except for log4j, most need to roll their own loggers.  For example, there are no good loggers available now for node.js.  There are acceptable ones but none that can be considered good.

This is where web systems and big data meet.  There's some arguments that web logging is not the best place to start working with big data but it's what the system was designed to handle and the most obvious source of data to pump into a system.

The next few articles will add a basic logging infrastructure to project butterfly -- probably, like everything else so far, enormous overkill for the needs of the system -- and look at some very baby big data options for doing data mining beyond the normal SQL systems.  

(1) Some of us have done this.  Do not do this.

 

Technology's Bro Problem

I have been busy pushing something large and unwieldy online.  Amazingly, it took up all my time and attention so there was a slight blogging drought.  When I'm less fried I may write about the experience a bit but probably not.

The New York Times today had an article called Technology's Man Problem.  Make with the clicky and read it.  The blog will still be here when you get back.

The article is fairly spot on except I disagree on a point: technology doesn't have a man problem.  There's no problem with grown, adult, mature men in technology.  Technology has a bro problem.  

Let's face it: the hot startup culture worships the young, white, male brogrammer.  We've fallen in love with the myth of the young kid fresh out of Harvard making a billion dollar startup based on the CMS he built with his friends who goes off to live a lavish Hollywood lifestyle atop his awesome company.  He's hot, he's awesome, he's smart, he drives a great car and has a stripper girlfriend, and he is absolutely under no circumstances a nerd.

When I was growing up, we worshipped nerds like Steve Wozniak and Richard Garriott -- guys who were many things but certainly not bros.  Raise your hand if you thought the Woz was awesome.  Bill Gates was King Nerd.  

And it was woman friendly as far as something based heavily on mathematics could be.  Further back in time, Ada Lovelace invented the first algorithm.  Six women programmed ENIAC and won the war.  Anita Borg!  Hell, Roberta Williams at Sierra wrote all my favorite games.  

Don't forget, the unbro-iest of them all:  a nice gay dude invented the foundation of Computer Science.

But now we have bros.  Computers used to be about mathematics, algorithms, and engineering.  Now it's about getting a million to download apps about boobs in 24 hours.  

I would argue that the core problem with startup culture and the plague of bros is because computers have stopped being about computer science and about marketing. Combining computers with love and money is a toxic situation.  How popular can we be?  How many eyeballs can we get?  How awesome does the Internet think I am?  Can I raise $5M in funding if I make this super cool widget everyone wants?  DO I HAVE MY OWN SUBREDDIT?  Internet will you come and love me pleaaase?

And well that attracts bros instead of nice quiet nerds.  Nerds are endearing.  Bros are not.  So the womens, they hemorrhage from computers and go to do something more rewarding, like professionally getting their teeth drilled, rather than hang out with bros.  Then we get the New York Times writing about the lack of women due to the bros and the bros get tons of attention and they get outraged and we have a circular problem that's like an ouroborus eating its own tail.  

It's a deep culture problem and I don't think we're going to go back to the 80s when computers were like physics and no one went into it because it was deeply technical.  Only thing I really have on offer is to make being a bro culturally unacceptable in the community and, with the Internet, I have no idea how to make that happen.  Social media will get bigger, not smaller, and will open more avenues to a bigger more aggressive frat boy who craves affirmation of the anonymous hordes.  Why be worshipped by millions when you can be worshipped by billions.  

Anyway *wave hands around wildly* it's a problem.  There are no quick solutions because between the anything goes atmosphere and computers being about love instead of algorithms, we're all kind of screwed.  

It's a good article, though, and has lots of good links.

 

Adding Caching to Project Butterfly in Node.JS

Since Project Butterfly is written entirely in Node.JS, it needs a Node.JS compatible memcached library.  We will use the Overclocked/MC library as it's full featured -- memcached clustering, consistent hashing, configurable options, and supports the full memcached API.  Also, it has nice documentation on its web site.  Documentation wins out over slick but undocumented features every time.

I haven't mentioned Node.JS programming technique much because of a lack of code samples, but like every other operation in Node.JS, getting, setting or incr/decr a key requires a callback to process the result of the call.  Non-callback calls in Node.JS leads to blocking and reduces the efficiency of the scheduler. 1  It's a different technique than normal non-event driven programming.

Technically, it's fine to build one giant memcached cluster and let front end and backend servers use them as long as the keys are namespaced.  But it's also fine to build one big cluster and then logically separate the servers in the memcached configuration files on the applications into two.  This makes the clusters easier to monitor, easier to balance and easier to scale on demand if need be.  

Also, build systems out in pairs for the redundancy/failover/DR needs.

Project Butterfly with two MemcacheD cache clusters

Project Butterfly with two MemcacheD cache clusters

I'm somewhat convinced LucidChart is going to come take my car for the amount that I've been abusing their free service.

Memcached Front End Cache

Holds...

  • Rendered page fragments
  • REST API call fragments
  • Convention pass counts/running counts of passes burned/fast checks on capacity
  • Session data (later)
  • Rate Limiting on Login/Failed Passwords (later)
  • Anything that could be pulled from the backend

Fragment caching is more a Rails thing that a Node.JS thing but caching partial responses from AJAX calls will help scalability.  It's a key/value cache and can be configured to hoover up the entire RAM allotment of a VM.  Use the memory.

Memcached Back End Cache

Holds...

  • Queries.
  • More queries.
  • Honestly, it should hold queries.
  • It should expire out queries after a reasonable amount of time and then cache queries.
  • Any solutions derived from those queries -- roll up data that might go into queries, say.
  • Also, queries.

Web applications have an 80% reads/20% writes profile on using the database.  Memcached's job is to keep the code from hitting that database on the 80% reads to let the database deal with the writes.  It's up to the code to expire out keys from memcached on having to do an update to data stored in the cache and the reload the cache from data.  

If there's tons of data to keep in memory, yes, backend caches can grow to enormous size.

Note: This is all backend caching.  Adding backend caching will give the code a great lift.  Yet, after having two memcached clusters caching data, this doesn't include asset caching, nginx compression and tuning, and CDNs.  It's about 1/2 the caching in a running system.

Honestly, cache until it hurts and then cache some more.  Memory is the fastest resource in a live running system.

  1. Cough callback hell cough

Memcached vs Redis?

Ah, the great question of our time.

Question: Which is better, memcached or redis?

Answer:  It depends on the nature of the project and its data.

At a basic level, they are both in memory key/value lookup tables designed for high speed and high availability.  They both cluster.  They both add value to a large project by providing in-memory objects without need to reference a data store.  They have comparable data access speeds.

Ultimately, though, past a single instance, they have different deployed and scaled architectures.  Memcached is a flat array of additive instances sharded by a hash value in the client.  Redis is a classic master-slave architecture which scales out to slave of slaves.   A picture makes their two different core philosophies clear:

Memcached has a flat architecture while Redis uses Master-Slave and Slave of Slaves.

Memcached has a flat architecture while Redis uses Master-Slave and Slave of Slaves.

Making a technical and architectural decision is difficult, especially on a component as critical as caching.  Here's some of the pros and cons of each system selection:

Memcached Pros:

  • Low complexity
  • Simple to configure
  • Few command macros == simple to master
  • Atomic increment and decrement
  • Simple to cluster -- uses a hashing algorithm at the client to find keys in a cluster
  • Runs like a rock -- memcached requires a nuclear strike to fall over
  • Can withstand a member dying
  • Many years in production
  • Every programming language has a memcached library.  

Memcached Cons:

  • Doesn't do anything besides be an in-memory key/value store
  • Caches sharded by client do not scale across AWS zones
  • Unbalanced memcached clusters require a full system restart
  • Adding a member to the pool requires reconfiguring and rebooting the client
  • Seriously doesn't do anything besides be an in-memory key/value store

Redis Pros:

  • Stores data in a variety of formats: list, array, sets and sorted sets
  • Pipelining!  Multiple commands at once
  • Blocking reads -- will sit and wait until another process writes data to the cache
  • Mass insertion of data to prime a cache
  • Does pub/sub... if you want to do pub/sub through your cache
  • Partitions data across multiple redis instances
  • Can back data to disk

Redis Cons:

  • Super complex to configure -- requires consideration of data size to configure well
  • SENTINEL, the automated failover which promotes a slave to master, is perpetually on the redis unstable branch
  • Master-slave architecture means if the master wipes out, and SENTINEL doesn't work, the system is sad
  • Lots o' server administration for monitoring and partitioning and balancing... 

From a completely neutral position it's a push -- unless the plan is to do crazy things with the caching system like spread it across multiple availability zones, or use it as a publish/subscriber system instead of using ZeroMQ or RabbitMQ or Zookeeper, it comes down to religious arguments and the importance of maintaining that Redis master.  

For Project Butterfly, I will choose memcached because it's simple and it meets the needs of the project.  Even with features like rate limiting or keeping in-memory counts of used convention passes in real time, memcached works fine.  The project won't use any of Redis's advanced features.  Redis does neat things, but this project doesn't hit any of them.

My advice: if you need to cache database queries or REST round trips or numbers or page fragments, use memcached.  Most cached data types and caching needs are simple and straight forward.  This is 80% of all caching needs.  

If you need to do operations on cached datasets at once, or need to spread one enormous cache over a geographically challenged area, or read-write splitting against caching will enormously help performance and alleviate cache connection starvation, use redis.  

But where would I use something like redis?  Where would I need its extra features, where its featureset overwhelms its complexity?

  • Leaderboards

Redis can manage sets in memory, which gives it an advantage here over memcached.  Leaderbaords are sets of items ordered just in time and served from in memory to a page.  A database is too slow -- and it's all disposable data.  Here both the pipelining and the key/values to sorted sets are powerful features.

  • On the Fly Voting Systems

Again, piping to multiple items and ordering sets in memory where a system reads them out and streams results -- websockets! -- in real time makes implementing this feature simpler and streamlined.

  • Page Clicks and Analytics

One can implement a page clicks and analytics engine on memcached and backed to a database but redis is really good at counting lists and sets of things.  Of all of redis's features, its ability to do key/value to sorted sets is where it exceeds memcached -- and counting something like page clicks per sets of pages and then summing those numbers together into analytics which can be pumped via a worker into a bigger analytics engine is one place where the redis choice is the right one.

One last thing: regardless of the choice, a caching system is not a database.  Some folks out there want to replace their RBDMS with a cache.  Eventually, if that data has any worth, that data has to make its way via asynchronous jobs back to a real data store.  Man cannot live on caching alone.  A system needs caching and a database.  

Caching and Time Perception in Web Page Loads

When a visitor hits a site, the site has 250ms to load the page.  

Under 250ms, the page is golden.  The site feels snappy.  The browser runs smooth.  The site is healthy.  

Between 250ms - 1s, the site feels loaded.   Not overly so.  Users will not browse away in under a second.  It's not fantastic but it is workable.  

Between 1-4s, the site has user attrition.  Some will stay but a proportion of users will browse away, close the tab, complain the site is too slow, or otherwise leave.  The proportion increases as the site load approaches 4s.

After 4s, the user attrition rate climbs to 100%.  No one waits 4s for a page to load.

Get out a digital timer and try it.  Open up random pages and see how long it takes to load a page and the satisfaction of seeing it load.   Under 250ms, the page loads "immediately."  Over 4s the page is "sluggish, slow and dated."  

Everything in the path for loading a page causes latency on the page loads:

  • Loading a local asset from disk;
  • Executing complex logic;
  • Making a REST call;
  • Running a SELECT (any SELECT) on a database.

Anything that touches the disk causes latency.  Anything that touches the network causes latency.  Anything that causes code to execute causes latency.  In fact, anything that even touches the site causes latency.

The trick is this:

  • To load as much of the site into memory as possible;
  • To do as little work as possible.

So we cache in memory because memory is fast and cheap and disk is slow.  What do we cache?

Everything.

To whit:

  • Sessions (although project butterfly does not yet have sessions);
  • Results of SELECT queries;
  • Entire REST calls to reduce round-trip times;
  • Easy lookup data;
  • Assets (images, javascript and CSS);
  • Entire flat static web pages;
  • Anything that looks even remotely cacheable.

Taking ready-made data out of cache is faster than computing it, looking for it, going to disk for it, or asking for it.  In most production systems, 80% of its work is data reads and 20% generating data.  Of that, ~99% of those reads are the same reads repeated ad nauseum.  We take the pre-generated answer and throw it into a Key-Value cache and pull it on request.  Why generate the data when the data is already there?

Big sites means huge caches.  We can build caching into the system at the:

  • Web page/asset level
  • The web front end logic level
  • The backend application logic before the database level
  • The asynchronous backend worker level

Some of these can be shared.  

Anyway, this is all basic stuff.  There are two standard caching systems -- memcached and redis -- worth discussing.  These are both widely supported, with many available libraries in all languages.  There are some secondary ones like EHCache and Riak which are a bit more specialized.  

Datasets beyond what a memcached or redis can handle goes quickly into the memory-only/NoSQL database discussion.  But next, need to pick a standard to use for the site based on what will work best for the site.

 

Finishing up Sprint #2: A Basic But Horizontally Scalable Thing

As a quick refresher. Project Butterfly is designing a massively horizontally scalable convention management site.  So far I've:

  • Picked technology for the front and back (Javascript/Node.JS/Maria DB for MySQL)
  • Put together a basic architecture
  • Called out the need for an Administration app up front
  • Established a basic ERD with all of the one-to-one and one-to-many relationships
  • Broke the architecture into front end and service parts
  • Established a REST contract.

At this point we'd have to write code!  But because this is a thought experiment -- and mostly due to extreme time limitations -- I won't be actually writing the code.  In the imaginary universe where code is written, at the end of this sprint I would be delivering components to VMs.  By hand.  Likely in error prone tarballs.  Painful.  Time to add that to the backlog...

Even in simply talking through the first sprint the architecture has gone through a radical set of changes.

Here is what it looks like after fast iterations:

Converting a Fat App into a Scalable Architecture with Microservices

Converting a Fat App into a Scalable Architecture with Microservices

Reviewing what we have and what we know about the system, we add to our current running backlog.  New interesting pieces of the system have appeared!  This needs prioritization from the product owner to put these dots into a usable list for the next sprint.  A business PO would likely prioritize reporting and new flows for the front end website to drive feature development but! Screw usability! 

Instead, this will be about making this system huge and picking off the first interesting technological challenge -- interfacing with a third party system.

The bolded dots are the dots prioritized for the next sprint.  Everything else goes into the backlog.  I'm looking at you, automated deployment.

Things for the backlog

  • Creating a nice flow for the website.
  • Adding prices to options.
  • Creating reports.
  • Creating accounts
  • Session Management.
  • Automated configuration
  • Automated deployment
  • Load Balancing
  • Reporting database
  • Interfacing with local lodging options.
  • Add Caching
  • Doing Interesting Things with Logs

ORM - the Killer of Scalability

I'm that jerk ready with a presentation to explain why ORM sucks and you should never use it.

Around 2004-05, software engineering was consumed with design pattern madness - not entirely unlike space madness but with code.  Design patterns aren't in themselves terrible thing -- algorithmic software tends to follow similar patterns and understanding these creates a decent engineering toolbox.  

Design patterns were categorized.  They were named and numbered.  Books were published.  Interview questions devised to show off who could memorize the most patterns and recite them at will.  Java people decided design patterns were the WAY and the LIGHT and shoved as many into their codebase as possible.

During this period, a book called Patterns of Enterprise Architecture by Martin Fowler (full disclosure: I have a copy) was published and contained a list of interesting design patterns describing the integrations and communications between systems.  Among them was the Object Relational Model.  This posited that rows in database tables were objects in a pure OOD/OOA sense. Code could work on them like objects and generate SQL to reflect the interactions between these objects.  Developers did not need to work directly with SQL to the database, a process that was inherently neither patterny nor objecty.  Developers could instead work in levels of abstraction and let the ORM engine do its job of dealing with the database's personal issues directly.  

There are two views of data in a RBDMS.  Straight SQL posits that databases are repositories of information that answers questions about data.  ORM posits that databases are repositories of information that represents objects in the system.   ORMs aren't terrible from a purely abstract sense except for the core assumption that a database row == a serialized object, and it isn't.  Computer scientists give this fallacy the fancy name of Object Relational Impedance Mismatch.

Not to be deterred by something minor like "database data is not object oriented," Hibernate, enemy of all mankind, was born.  A giant implacable humorless ORM with complexity layered on complexity, hibernate became a standard tool in the Java developer toolbox.  

Developers ran screaming from hibernate and fled to Ruby on Rails which re-implemented hibernate in the cuddly and user friendly ActiveRecord.  Between ActiveRecord and ruby's inherently bad garage collection algorithm (finally getting some love in ruby 2!) the entire Internet came to a crawl because ORMs, if they are anything, are slow.  That's the great tradeoff -- object manipulation of data vs performance.

ORM turns accessing database into an epic struggle of DBA vs. Developer.  Some Sin greatest hits:

  1. ORM wants to pull in objects so it pulls in entire rows with SELECT * unless explicitly told otherwise causing enormous time and network overheads.
  2. ORM wants objects to work with objects, spawning JOINS and often JOINS with JOINS that do JOINS.
  3. ORM hides the generated SQL code making it difficult to optimize.
  4. ORMs love VARCHARs.  Love them.  Everything is a default VARCHAR(255) field in an ORM.
  5. Lack of flexibility when needing to shard by data.
  6. Lack of flexibility when implementing read-write splitting.
  7. Developers go and generate schemas in code which are often interpreted in a less than optimized data type in the database and almost always miss critical INDEXes.  Default datatypes are rarely the correct datatypes for the job.
  8. Using an ORM does not negate the need to pull slow queries, run them through an EXPLAIN plan, and find the missing indexes.  Yet ORMs hide the SQL leaving the slow queries as a nasty surprise.
  9. Hand written queries almost always run faster than generated ones.  Even simple ones.
  10. In ActiveRecord and HIbernate (at least) giant layers of abstraction and reflection and metaprogramming layered between the code and the database serve as a tangle of execution designed to do nothing but make things slow.  In Java this isn't noticeable but in an interpreted language...

Bad SQL queries who pull back large, unnecessary blobs of data on every read or do not employ indexes are the main source of slow web code.   ORMs generate bad SQL queries.  OED.

Yes, the argument for an ORM is that one can learn the ins and outs of the ORM's DSL and, with enough practice and finesse, can get the ORM to generate decent queries.  But one can spend time learning the DSL or just learn to write standard SQL queries.  

The Python and Node.JS communities have eschewed ORM for the most part.  Python's SQLAlchemy has ORM as an optional add-on -- if it gets used at all.  It's mostly a Data Mapper.  Node.j's Sequelize is already going down the ActiveRecord route -- it has many of the same sins.  Sad, as it's a system designed to kill Node's performance.

The biggest source of poor performance in a system are the mystery queries, some often enormous, generated by ORM metaprogramming magic.  The best way to speed up code is to ditch the ORM and write the queries directly.  Even stodgy ruby with it's garbage collection issues can get enormous benefits simply by moving away from ActiveRecord to a data mapper.  

Simply to sum up, friends don't let friends write easy but slow code.

Of course, if moving to a data mapper or stored procedures doesn't speed things up, then we talk about read-write splitting... but that is another article for a different day. 

Github and the Problem of Sexism in Tech

After 50 years of trying to end endemic sexism in technology, we are plagued by sexism in technology.  Another explosively public case reared it's head: one of harassment, bullying, and stalking of a female employee over at github who has since resigned. According to even more gossip at Valleywag, this is not the first female employee to depart the company under similar situations.  For every female employee who leaves, tells all on twitter, gives an email interview with TechCrunch, is covered in Ars Technica and Gigaom, and then gets the front page of Valleywag, another 10,000 are suffering in silence terrified to leave their jobs while being abused.  

From the standpoint of discussing Github, it brings the integrity of the privacy and security of the users who use the product on a mass scale into question.  Also, the company should face some serious legal questions about its practices.  An unauthorized non-employee not only on premises (itself a major issue) but interacting with employees and with corporate assets which belong to paying customers on a sustained basis is not going to hold up to any legal scrutiny.  As a product used by millions of developers, Github should be promoting its integrity as its core value, as its integrity is more valuable than its racks of servers.    

As a community, we need to ask why these situations keep happening.  There's something wrong with the Internet culture that not only creates these situations but perpetuates them.  Either the frathouse culture, the sense of entitlement that comes from being technologically superior to others, or simply the heady air of becoming popular on the Internet feeds a deep bro-culture insecurity that makes it okay to stalk and harass.  Over the years it hasn't gotten better.  It has gotten worse.  

The frat house culture is not okay.  The workplace is not undergrad.  The workplace is a place of business.  It is a place of rules, laws, and legal obligations.  It is not someone's madhouse with 15 people living in it.  And the problem is: we encourage the workplace as a frathouse to make technology "fun," "hip," "cool," and to get young men with online social lives to work long, horrific hours.

As a culture, we need women to enter technology fields.  Having 50% of the population frozen out of a profession is not healthy.  Something has got to give.

Bad culture comes from bad corporate leadership. When bad corporate leadership comes from the same frat house mentality then the company itself has a toxic culture.  If the company has a toxic culture, then the issue perpetuates.

Wish I had some solutions but the only solutions I have are:

  • Discourage a "bro" culture.
  • Focus on team cohesion over team competition.
  • Founders focus on growing the business over personal popularity.
  • Enforce a clean, professional work place clean (omg)
  • Turn off the Internet.

And none of this is satisfying.  Until tech becomes a safe place for women, there will continue to be few women in technology and those who are there end up in these situations.

I am in no way saying all men in tech are bad -- very much the opposite.  Most men in technology are made of awesome.  If they weren't, I would have left a long time ago.  But with 20 years of experience in computers, I can say this is an endemic issue, and the Github incidents will continue until something, and something ugly, gives.

 

 

 

Note about Writing Indexes, JOINS and EXPLAIN plans

One big source of poor performance in a system are bad queries, and bad queries stem from two sources: over data normalization and bad indexes.  Both lead to full table scans -- where the database gets into a state where it needs to read every row of data in a table to find the correct data to return to the caller.  Relational databases work best when they can execute a query at O(logn) speeds across tables.  

The closer an RDBMS can run to pure O(logn), the faster the queries will return data.  When a database returns data at O(n), with a million rows the query can take seconds to run.  

Indexes are a trade-off -- for every index a table has, the database keeps an extra copy of the data in indexed order.  When creating an index, the database will sort and B-Tree a copy of the data in indexed order and keep it.  INSERTS and UPDATES run slower as the database needs to insert into multiple copies of the table for every new data write.  But, disk is cheap, writes are by their nature slow, and most of an application's time is spent in reads.*   Reads need to be fast like bunnies.

Indexes are driven by the needs of the application so it helps developers to understand them, if not be able to write them.  The best place to go to get a good overview and learning experience about indexes is Use the Index Luke.  The site is great and walks through a huge number of indexing scenarios and where indexing helps and where it does not.  The section on JOINS is helpful, especially as the site covers the notorious-in-ActiveRecord-and-Hibernate N+1 problem found at the heart of so many performance related issues.

JOINS themselves are the enemy of performance.  A JOIN forces a database to scrape up all the data off disk for one table, and then scrape up all the data on disk for another table, try to execute any indexes there are in performing the various search actions, compare, merge results based on the type of JOIN (INNER, OUTER, LEFT, RIGHT...) and return the Venn diagram of results.  Three JOINS are slower still... the more JOINS, the slower the performance, the more the application will crawl.  Even Oracle RAC will not save an application from non-indexed multi-level JOINS.

A quick note on EXPLAIN plans.  The database will happily cough up exactly how it wants to run a query by passing it through an explain plan.  Just:

EXPLAIN SELECT 1;

will return exactly how the database (MySQL in this case) will run the SELECT 1 query.  So if the query is:

EXPLAIN SELECT * from CUSTOMERS where customer_id = 1 and customer_type=32; 

the database will return exactly how it wills search the CUSTOMERS table.  It will clearly state if it is using INDEXes, if it is using WHERE, or if it is doing a TABLE FULL ACCESS.  EXPLAIN plans are the MEGAtool for making databases generate good, happy, well tuned queries.

Most DBAs do not allow developers to run EXPLAIN on production databases -- and this is fine and sane.  Developers should not only understand their INDEXes but they should run EXPLAIN in development environments when hunting performance bugs.

Know thy RDBMS and live a happy data-filled life.

  • If the system is doing huge data write volumes, maybe the system should be running a Hadoop cluster instead of an RDBMS.  

 

Project Butterfly: Generating a REST API

Trying to get this project moving again, Project Butterfly needs a REST API to work as a functional example of contract-based web services which can scale to, theoretically, millions.  Using the schema presented earlier, a set of REST API contracts for a front end to a backend.

If we were going for serious node.js scaling at the service layer and building this system for production, this contract divides fairly cleanly into a user controller and a conventions controller.  Splitting apart services we could build:

  • An account service focused on user creation and account management.
  • A reservations service focused on convention creation, pass creation, and reserving a seat.  Also, even more interestingly, we do not have to reserve a seat in-line -- but that's a later topic.
  • A search service.

As an example these are all one service.  A quick implementation note: the schema should be updated with a flag to mark deletion but not delete data out of the database.  Deletions lead to lost data and fragmentation.

USERS MODEL API

GET /users -- Get all users
GET /users/:id -- Get user with id = :id
POST /users -- Create user, return ID
PUT /users/:id -- Update user with id = :id
DELETE /users/:id -- Mark user depreciated with id = :id 

CONVENTIONS MODEL API

GET /conventions -- Get all conventions
GET /conventions/:id -- Get all conventions at id = :id
POST /conventions -- Create convention, return ID
PUT /conventions/:id -- Update convention with id=:id
DELETE /conventions/:id -- Mark convention depreciated with id = :id 

CONVENTION PASS TYPES API

GET /conventions/:id/types -- Get all pass types for a convention
GET /conventions/:id/types/:id -- Get information about pass type :id
POST /conventions/:id/types/:id -- Create new pass type :id
PUT /conventions/:id/types/:id -- Update pass type :id
DELETE /conventions/:id/types/:id -- Mark a pass type depreciated

ASSOCIATIONS API

GET /conventions/:id/users -- Get all the users for a convention
GET /conventions/:id/type/:id/count -- Get the total users for a convention's pass type.
GET /conventions/:id/type/:id/max -- Get the total users for a convention's pass type.
GET /conventions/:id/type/:id/users -- Get the list of users for a convention's pass type.
POST /conventions/:id/type/:id/users/:id -- Add a user to a convention's day pass type
PUT /conventions/:id/type/:id/users/:id        Update a user to a convention's day pass type
DELETE /conventions/:id/type/:id/users/:id        Remove a user from a convention's day pass type

VERBS:

GET /search?users=:email  -- Search for user with email = :email
GET /search?conventions=:name  -- Search for convention with name = :name

The REST calls inform the schema.  A convention has a set of passes which have users associated with those passes.  Notes:

  • Count and Max calls are performance killers.  They must be cached.
  • Convention ID must be indexed everywhere it is not primary key.