Note about Writing Indexes, JOINS and EXPLAIN plans

One big source of poor performance in a system are bad queries, and bad queries stem from two sources: over data normalization and bad indexes.  Both lead to full table scans -- where the database gets into a state where it needs to read every row of data in a table to find the correct data to return to the caller.  Relational databases work best when they can execute a query at O(logn) speeds across tables.  

The closer an RDBMS can run to pure O(logn), the faster the queries will return data.  When a database returns data at O(n), with a million rows the query can take seconds to run.  

Indexes are a trade-off -- for every index a table has, the database keeps an extra copy of the data in indexed order.  When creating an index, the database will sort and B-Tree a copy of the data in indexed order and keep it.  INSERTS and UPDATES run slower as the database needs to insert into multiple copies of the table for every new data write.  But, disk is cheap, writes are by their nature slow, and most of an application's time is spent in reads.*   Reads need to be fast like bunnies.

Indexes are driven by the needs of the application so it helps developers to understand them, if not be able to write them.  The best place to go to get a good overview and learning experience about indexes is Use the Index Luke.  The site is great and walks through a huge number of indexing scenarios and where indexing helps and where it does not.  The section on JOINS is helpful, especially as the site covers the notorious-in-ActiveRecord-and-Hibernate N+1 problem found at the heart of so many performance related issues.

JOINS themselves are the enemy of performance.  A JOIN forces a database to scrape up all the data off disk for one table, and then scrape up all the data on disk for another table, try to execute any indexes there are in performing the various search actions, compare, merge results based on the type of JOIN (INNER, OUTER, LEFT, RIGHT...) and return the Venn diagram of results.  Three JOINS are slower still... the more JOINS, the slower the performance, the more the application will crawl.  Even Oracle RAC will not save an application from non-indexed multi-level JOINS.

A quick note on EXPLAIN plans.  The database will happily cough up exactly how it wants to run a query by passing it through an explain plan.  Just:

EXPLAIN SELECT 1;

will return exactly how the database (MySQL in this case) will run the SELECT 1 query.  So if the query is:

EXPLAIN SELECT * from CUSTOMERS where customer_id = 1 and customer_type=32; 

the database will return exactly how it wills search the CUSTOMERS table.  It will clearly state if it is using INDEXes, if it is using WHERE, or if it is doing a TABLE FULL ACCESS.  EXPLAIN plans are the MEGAtool for making databases generate good, happy, well tuned queries.

Most DBAs do not allow developers to run EXPLAIN on production databases -- and this is fine and sane.  Developers should not only understand their INDEXes but they should run EXPLAIN in development environments when hunting performance bugs.

Know thy RDBMS and live a happy data-filled life.

  • If the system is doing huge data write volumes, maybe the system should be running a Hadoop cluster instead of an RDBMS.