The original solution to building a centralized logging system for N+1 servers was to:
- Have code write directly to syslog based on the syslog library of choice for that particular language and/or have a syslog collector which tails logs and sends them into syslog;
- Forward the logs over UDP to a centralized collection point;
- Perform grep with a vengeance.
This required a centralized logging server with good log rotation scripts, some LDAP goodness for authentication, and some decent knowledge of a decent regex. For somewhere around a cap of 40 data sources sending data to the central server, this is not a terrible solution -- especially if the logs are structured and fit within the syslog limitations for a message. (1) Now this is data sources instead of servers: if the servers are sending both application logs, web logs and syslogs, divide the number of sources by three to be effective.
The topology of a system looks much like so.
This works well if we get beyond the occasional message lost through UDP weirdness. But if we have more than the servers than this can handle, and we don't have enough money for the all mighty Splunk, do we go straight to Hadoop?
The temptation is to get a jump start on the sexy world of Hadoop to get it on a resume tout de suite. Hadoop is, ultimately, a distributed file system much like AFS before it using a combination of clever Java processes and Zookeeper to turn commodity hardware into a vast sea of storage. Map/Reduce is a software framework embodied in Hive or Pig to sift all that data and find answers within its vast sea of non structured data. But Hadoop is varsity level that gets better the more nodes added to the cluster. Building it for a website that turns over 1M log lines or less a day is not worth the effort from the effort overhead of figuring it out and building the batch jobs that span the data store and surface useful data.
We need something a little more junior varsity level. Something that does the job and will run in the public cloud or a private cloud and doesn't require a bunch of data scientists standing around and going hmm. Something that will cough up good solid analytics.
This is a thing that maybe kinda sorta exists. It's called Elasticsearch ELK. Elasticsearch - Logstash - Kibana.
We need the following components:
- Search Indexers.
- Search Engines.
Forwarders are the most difficult to pick because there are so many to choose from. The best one is Apache Flume, a high performance java-based forwarder which has an elegant input-sink-output architecture and bridges the gap between CDH (Cloudera Distribution including Hadoop) (2) and Elasticsearch. But there's also Fleuntd and Lumberjack which are far more logstash-friendly and designed to be lighter on a system than the memory hog Flume.
This is a your mileage may vary -- if eventually the cluster blows out into a Hadoop cluster, you want to use Flume because that architectural choice will serve you oh so well in the future. If you love your Elasticsearch cluster -- and you will -- then it's a preference based on the log shape. Fluentd and Lumberjack will be much happier with more structured logs as inputs. Flume shines with unstructured logs, like log4j logs - although Flume really wants to write in Avro.
They have the same basic functionality: get logs away from the application and into the logging system quickly and efficiently without putting strain on the host system.
Indexers are a little more touchy and here. Logstash will do the work of the extract/transform/load for the data coming in from the forwarders and load it into Elasticsearch. The logstash configuration file requires well defined the inputs and outputs but the system has an enormous list of plugins on both sides of the equation. It will use grok to parse logs into Elasticsearch format.
Elasticsearch does all sorts of super interesting things -- it clusters, it grows, it's fault tolerant, it provides a concise query language, it has APIs for inter-operability and it can be used as a searching engine for Hadoop when moving to a full Hadoop cluster -- or along side it for specific log analytics. It's really a neat piece of software that slices and dices. It will burn through 10s of thousands of logs a second given it has enough RAM.
The query front end is a piece of software called Kibana. Not too much to say here other than it's a graphical query engine that plugs directly into most of these technologies -- elasticsearch, logstash, fluentd, flume -- and delivers intelligence and analytics about the system. If you have ever run an operation with a few hundred web servers and you can now get a nice live-updating trending graph of 502s across the entire system to find when the system is going down under load, you will appreciate a system like Kibana. If you are in business development and you'd like to understand how users actually use your system, or if you are in security and want to see users brute force your authentication systems, you will also appreciate Kibana. Graphs for everyone!
Here's an upgraded topology using all these technologies in concert to pull together a full working system:
Preferably, all of these components should run on separate VMs in production. In a testing environment, enh. Elasticsearch does want to run alone, though, because as a Java-based search head, it works best when it can consume the maximal cores and RAM on a system. If it does not have all the RAM it needs, it will work but searching will under perform.
It feels like a ton of work to pull together a system one could pay for but being able to spot trends, problems, and instability in a system is gold. This can all be built out of commoditized hardware and, with some gotchas, in public cloud. That logstash can also forward to ganglia and graphite is a golden bonus.
Cool stuff. We didn't get flying cars in the future but we certainly got hyper-fast log indexing, searching and trending.
One thing that I did not cover at all is scaling all these components. All of these components horizontally scale -- and logstash has some gotchas. If/when I go deeper I'll talk about that in more depth and breadth.
- Graylog2 has a format called GELF as a syslog replacement which gets around several limitations of the syslog format. It's another whole way of building this system with Apache Kafka.
- For reasons unknown, Cloudera, who is otherwise amazing, does not tell you what CDH stands for.
- This is an offside and a topic of another post: be very careful putting Elasticsearch nodes on cloud systems that charge/iOP. ES can easily do 1M iOps a month on a moderate sized system. Also, AWS specific: there's no multicast in AWS so nodes will not auto-discover. Documented work arounds are available.