Big Systems and Logging

In distributed systems, the systems need to log and log everything.  

This seems like an obvious statement but it's not.  In the past, systems rarely logged beyond exception logging and system logging.  They didn't need to; if a website ran on a single monolithic server, and the website experienced errors for the customer, the developer could log into the server, examine the exception logs and take action.  If the website needed to be patched the developer could build and roll out a patch to the server.  Done and done.

If the system has 500 backend servers or more all running the web code and a handful of them are experiencing errors, it's impractical to log into all the systems and examine them one by one (1) so we need an entire logging architecture.  And logging architecture leads us right into the wonderful new world of Hadoop and ElasticSearch.  

But it's not just about getting errors.  It's about getting analytics.  It's about learning answers about the system as it runs.  We want to know things like:

  • What is the rate (transactions/second) of someone buying a widget over any one time?
  • What is the flow from front page all the way down through microservices and back of my users?
  • What is the rate of my system errors?
  • What is the rate of my system successes?
  • What is the rate of my system errors in proportion to the number of overall system successes?
  • Are people trying to abuse or hack my system?  How?  In what manner?
  • What is the percentage completion of an action failing or completing?
  • Can I get an aggregate view of the system working in near real time?
  • What answers can I answer about how my system is used?

Yes, it's possible to answer ETL (extract, transform, load) information out of OLAP databases but now a days we can do all sorts of incredibly sweet and interesting things with logs.

Before digging into logging architectures and building something basic, here's some guidelines for good logging.  Now, there is a huge line between no logging, terrible logging, and good logging.   Most of the time we're at no logging and then add in bad logging.  Good logging is logging that will lead to a good overall picture of the system.

Guidelines

  • Always prefix a logline with a time and date stamp.
  • Keep an action to a single logline.  Mining multi-logline messages may be fine for grep but it makes life difficult when hundreds of loglines appear/second.
  • Log whenever there is an error but also log whenever there is a successful action.
  • Put in the following, if possible, into a single log line:
    • The name of the process
    • The name of the overall functionality in this process
    • SUCCESS/FAIL/ERROR
    • the REST call action taken
    • user identifiable information
    • the source IP

For example:

2013-05-07 19:40:00 microservice.js:create_account SUCCESS action=create status=ok user_id=awesome_dude src_ip=111.222.333.444 return_infos=0

Here I could parse on date, the service, the call, the action, the status, the user_id, the source ip or any combination of above.

There's also a trick, that is more advanced, to giving the user a GUID when they enter the system and passing that GUID through all of their calls so a single search on that GUID will pull up all actions by that user across all components through the entire system.  Very useful when mining data.  

There are also some super cool gotchas:

  • This much logging will slowly kill system performance/lead to iops-based disasters
  • We need a system to deal with all the logging because this cannot go into a database -- it will kill a database. Any database.  Even the big Oracle data warehouse things database.
  • Disk needs to be monitored and alerted.
  • Except for log4j, most need to roll their own loggers.  For example, there are no good loggers available now for node.js.  There are acceptable ones but none that can be considered good.

This is where web systems and big data meet.  There's some arguments that web logging is not the best place to start working with big data but it's what the system was designed to handle and the most obvious source of data to pump into a system.

The next few articles will add a basic logging infrastructure to project butterfly -- probably, like everything else so far, enormous overkill for the needs of the system -- and look at some very baby big data options for doing data mining beyond the normal SQL systems.  

(1) Some of us have done this.  Do not do this.