on ORDER BY optimization

Generally in MySQL we send queries massaged to a point where optimizer doesn’t have to think about anything. In our major user database environment 99.9% of queries don’t have alternative query plans available (either because of forced indexes or just straightforward Primary Key read). We have various other systems and from time to time we have to do SQL work there and chase optimizer errors.

There’re multiple places where optimizer can make a choice in very basic queries, for example:

  • Which index returns less rows
  • Which index can be used for ORDER BY

A query that I was looking asked a very basic question, on a job instances table, show state and status for latest-by-ID entry for job name=’Ship Christmas Presents’ (real name was a bit different ;-). So, it was SELECT c,d FROM t WHERE b=X ORDER BY a DESC LIMIT 1, where PK is (a) and a possible index is on (b,c).

What we were observing was a massive table scan on PK instead of using (b, ...) indexing. Table in question was in hundreds of gigabytes, so that did hurt, a bit. If one forced the (b,c) index, queries became really fast. This wasn’t an issue with some intermittent statistics flapping.

The first thing that immediately caught my attention was that LIMIT 2 produced a proper query plan, whereas LIMIT 1 did not. I shipped a few-gigabyte sized subset onto my test machine and carefully went through what was happening.

EXPLAIN was just telling me that it will be picking bad query plan, EXPLAIN FORMAT=JSON was still telling the same, so I needed to look at some detailed execution data. MySQL has this extremely promising facility called ‘optimizer trace’, so I got the trace for my test-case. Unfortunately, the trace gave everything that I knew – that only one index made sense for reducing the dataset and that it changed to PK to order things better. It told me about number of rows and some “cost” of the plan – whatever those numbers mean, and it gave super-high numbers for table scan.

My optimizer trace did not tell why it decided to switch from decent query plan to absolutely horrible one, it just had a block telling that “reconsidering_access_paths_for_index_ordering” and that "plan_changed": true, which I already knew. On the other hand, that provided me with quick pointer into the source code – the six thousand lines of sql_select.cc. I could find above trace pointer somewhere in test_if_skip_sort_order(), which then calls this:

test_if_cheaper_ordering(..., order, table, usable_keys, ref_key, select_limit, &best_key, ...);

Essentially, this function will look at all the indexes that can be used instead of “filesort” and see what would happen if we used them. This function would say that the ref_key (b,c) – the index that returns least rows – is not the best key, and the best key is one on (a). How did it come up with such conclusion? Short answer – optimizer is very naïve.

That logic is explained in this comment:

/*
We assume that each of the tested indexes is not correlated
with ref_key. Thus, to select first N records we have to scan
N/selectivity(ref_key) index entries.
...
*/

It makes the naïve and somewhat optimistic calculation on (a) and the most pessimistic possible calculation on (b,c). I’ll start with the pessimistic one. Optimizer assumes that there’re 20000 instances for our job, and for each of these jobs we have to do a separate seek into PK, so our cost is 20000 (~300MB) for PK reads + few more reads on index itself.

Now the optimism (and bad query plan) comes from the idea that if we’re only selecting only 1 out of 20000 rows, that means only 0.005% of table scan is enough to satisfy LIMIT 1. If we would provide LIMIT 10, it would be only 0.05%. On a 100GB table, that means 5MB of data read, and that seems to be cheaper than the very pessimistic calculation of more than 300MB above.

Unfortunately, optimizer doesn’t understand, that there’re humans who are building these systems, and humans have their own rationale and thinking. One of the ideas that a human would have is that if you have 100GB table, you better understand things like data locality and other sorts of efficiencies. That means that in real world most of large tables have various degrees of correlation of data. In our fictitious example we know that “Christmas” happen, well, at Christmas. Looking through our window we know that it isn’t anywhere near Christmas.

So, database assumes that our data is distributed like this:

----W----H----E----E----E-----

When it is more like this:

-----------W-H-EE-EE----------

With this data distribution the pessimistic decision on (b,c) becomes way too dark and miserable – there’s a chance that “random” seeks into PK will all hit same pages, so we will need very few megabytes read. Our best case ends up being at ~5MB, our worst case is somewhere at above mentioned 300MB. Now optimistic decision can vary from “oh hey, I just started reading and found a row immediately” win of a lottery to “hey, I calculated an average for you” naïveté of 5MB to “oh snap, I was wrong” of say… 30GB.

Though we all agree that optimizer cannot everything, it errs into the direction of chasing much wider range of possibilities and being way more opportunistic rather than being somewhat more reliable. Obviously, it is very hard to tell whether changing some of these heuristics is possible in a way that would not affect million other places, but in this exact case we can look at what could be done better with information at our hand, either by rewriting queries or implementing somewhat procedural access.

One way is materializing the maximum ID first simply because it is already hidden in the (b,c) index – internally inside InnoDB that index is (b,c,a). So if we

SELECT MAX(a) FROM t WHERE b=X

we can get most of the data for the read just by scanning few index pages. We can use that data then to dive into primary key for that single row. Very similar approach can be taken with different limits.

We should also know that optimizer has most of this logic back from MyISAM times and it doesn’t know that it knows multiple values of (a) just by doing records-in-range estimation for (b,c) index – it does two random dives and it already observes primary key values. Simply by knowing that both of these values represent the range nowhere close the table scan head or tail it can discard the truly expensive query plan.

Of course, that is messy and can have broken sampling, but hey, it may be better than nothing, although it will again assume something opposite – that data is somehow correlated. Which we humans think it is and computer is on the opposite side of the fence.

So, in summary, you have to understand that database assumes things you don’t and sometimes you have to force your query plans or write your queries in a way that there’s no optimization space for optimizer left.

See also:

The saddest bug of them all (SQL is dead?)

From time to time I will observe servers wasting lots of CPU when doing batch row operations. In perf top it will look like this:

8.24% mysqld [.] Arg_comparator::compare_int_unsigned()
7.17% mysqld [.] Item_cond_and::val_int()
4.37% mysqld [.] Item_field::val_int()
4.37% mysqld [.] Item_cond_or::val_int()
2.90% mysqld [.] MYSQLparse(void*)
2.64% mysqld [.] Item::val_bool()

Essentially if you construct queries like (a=1 AND b=2) OR (a=3 AND b=4) ..., at large enough batch size evaluating the WHERE will become far more expensive than anything else (yes, more expensive than decompressing rows or doing all the InnoDB magic and what not).

MySQL has awesome syntax that makes certain batch lookups much faster: WHERE a IN (1,2,3). It constructs a tree that then each row can be compared against and one does not have to iterate through lists of predicates to check whether the row returned by batch index lookups needs to be filtered away. One would think that the composite syntax that server has (WHERE (a,b) IN ((1,2),(3,4))) may help here.

Unfortunately, if you run a query using this syntax, a full table scan will be done, even if there’s a great index for that – even the syntax exists, none of sane ways to execute apply. Of course, there have to be bugs about that:

Compound (cola,colb) IN ((x,y)) statements do not use indexes. (filed by yours truly).

Well, it was closed by optimization lead (now at Maria), who decided, that this bug is a duplicate of another problem, (a,b,c)=(A,B,C). I probably should’ve contested there, but there might have been other people who’d do that? Sure there were, in 2007 Mark filed another bug:

Optimizer does not do index range scans using predicates in IN lists

If one would look more closely, there’s this small exchange in 2010:

[7 Jan 2010 21:35] Mark Callaghan
This is the same as http://bugs.mysql.com/bug.php?id=16247, 
but the author of this feature request is much more convincing.
[7 Jan 2010 21:36] Domas Mituzas
hehe.

He wrote a Facebook note back then too.

Apparently Mark’s bug became the master bug for this problem, even if it arrived bit later, but, I can’t argue, he is far more convincing :-)
There’s a bit of odd banter there, Harrison (ohi!) points out that it is documented (albeit incorrectly, Sergey Petrunia – ohi – notices) in the manual. No, it isn’t this exact issue that is noted in the manual.
Duplicates are pouring in:
Multi column IN does not use index, says Peter Zaitsev (ohi!) in 2008
MySQL not using an index when using tuples in IN clause, says Shlomo (ohi!) in 2010
There are few more.

Sinisa (ohi!) pronounces:

In my humble opinion, this is still not a bug, but feature request. This feature request, however, should be very high on TODO list, as it would make a feature that is sorely missing, particularly since tuple equality expressions have been already optimized.

Fast forward into 2013, even in 5.6 we still have to use expressions with exponential cost, (a,b) IN (( is causing full table scans and people are switching to NoSQLs, NewSQLs and OtherSQLs, because there’s no way to efficiently express batch predicates in MySQL (besides other reasons). Back in the day MySQL was ahead of everyone else with ease of use for simple expressions, but there have been nearly no changes since back in the day to make developer lives easier. Did you know that it may actually be faster to create a temporary table with all the values, then join against it, than actually to use any of existing SQL methods?

Where is MySQL innovation nowadays? In this exact case I guess you’re supposed to be memcached API (forget authorization, common interfaces, transactions and what not), solving this in SQL is too complicated.

A case for FORCE INDEX

I remember various discussions in different mediums where people were building cases against use of FORCE INDEX in SQL queries. I’ll hereby suggest it using way more often, but at first I’ll start with small explanation.

For ages, the concept of index statistics affecting query plans has been clogging minds of DBAs, supported by long explanations of MyISAM and InnoDB manuals. Actually, statistics are used just for determining which index to use for a joined table, as predicate is not known at the time of ‘optimization’.

What happens if you do a simple query like:

SELECT * FROM table WHERE a=5 AND b=6

? If there’s an index that enforces uniqueness on (a,b), it will be used – this is short-path for PRIMARY KEY lookups. Otherwise, it will go to any index, composite or not, that can satisfy either a or b (or both), and evaluate how many rows it will fetch from it using the provided criteria.

Now, contrary to what people usually think, the row count evaluation has nothing really much to do with cardinality statistics – instead it builds the range that the known predicate can check on existing index, and does two full B-Tree dives to the index – one at the start of the range, and one at the end of it. For each possible index.
This simply means that even if you are not using the index to execute query, two leaf pages (and all the tree branches to reach them) will end up being fetched from disk into the cache – wasting both I/O cycles and memory.

There’s also quite interesting paradox at this – in some cases, more similar other indexes are, more waste they create because of rows-in-range checks. If a table has indexes on (a,b,c) and (a,b,d), query for (a,b,d) will be best satisfied by (a,b,d) index, but will evaluate range sizes for (a,b). If the first index were (a,c,b), it would be only able to check head and tail of (a) – so way less B-Tree positions would be cached in memory for the check. This makes better indexing sometimes fare worse than what they’re worth in benchmarks (assuming that people do I/O-heavy benchmarking :)

The easy way out is using FORCE INDEX. It will not do the index evaluation – and no B-Tree dives on unneeded index.

In my edge case testing with real data and skewed access pattern hitting a second index during ‘statistics’ phase has increased execution time by 70%, number of I/Os done by 75%, number of entrances into buffer pool by 31% and bloated buffer pool with data I didn’t need for read workload.

For some queries like “newest 10 entries” this will actually waste some space preheating blocks from the other end of the range that will never be shown – there will definitely be a B-Tree leaf page in buffer pool with edits from few years ago because of RIR. Unfortunately, the only MySQL-side solution for this is HANDLER interface (or probably HandlerSocket) – but it doesn’t make using FORCE INDEX not worth it – it just pushes towards making FORCE INDEX be much more forceful.

So, use the FORCE, Luke :)

on primary keys

5.1.46 has this change:

Performance: While looking for the shortest index for a covering index scan, the optimizer did not consider the full row length for a clustered primary key, as in InnoDB. Secondary covering indexes will now be preferred, making full table scans less likely.

In other words, if you have covering index on * (which is quite common on m:n mapping tables), use it rather than PK. As I have spent my time getting indexing right and having PKs be based on primary access pattern and SKs on secondary access pattern, I hereby not welcome the new change that suddenly reverses the behavior in late GA version.

Not good, when mysqldump queries end up taking 6 days instead of previous half an hour, not good at all.

Update: Oh, MariaDB has this reverted, from their change log:

mybug:39653: reverted as invalid

If only upstream MySQL would take note ;-)

MySQL 5.0 optimizer: loose scans

MySQL 5.0 among lots of visible features, introduced several neat optimizer improvements, that may give surprising performance in some queries. Loose index scan, or rather index jumping, allows fast aggregate max() or min() operations, as well as distinct row queries.

Let’s have such table:

+------+------+
| a    | b    |
+------+------+
|    1 |    5 |
|    1 |    3 |
|    2 |    4 |
|    2 |    2 |
|    1 |    1 |
+------+------+

A SELECT MIN(b),MAX(b) FROM ournicelittletable GROUP BY a might be executed in different ways. Usually aggregate functions need to build a temporary table in order to sort results, or use index for sorting.

In case of temporary table, aggregation is done by reading all rows from dataset and updating in-memory grouped results. With our tiny sample dataset there would be table created with primary key ‘a’, as well as min() and max() columns. Then a table scan would happen, which would update aggregated values on every in-memory row with bigger for max() or smaller for min() b value found. After full scan is done, this table could be returned directly to user.

If there is an index on fields (a,b), then tight index scan may happen – it would read values in such order: (1,1),(1,3),(1,5),(2,2),(2,4). In this kind of execution engine does not need to build any kind of temporary tables, once it reads all rows with same ‘a’ value, it can return a minimums or maximums to the user, then continue working on other ‘a’ values.

If MySQL 5.0 would find an index on (a,b), it would simply go to each a values, read one row for min(), one row for max(), as that information is already in index tree, and immediately return that to user.

I built a bigger table, with >200k rows, with unique field a, and added 233 groups, specified in b, with 1000 values in c each. One group didn’t have 1000 values, so I tried to find which one:

mysql> select b,max(c),min(c) from tttable group by b  having max(c)<1000;
+------+--------+--------+
| b    | max(c) | min(c) |
+------+--------+--------+
|  233 |     83 |      1 |
+------+--------+--------+
1 row in set (0.01 sec) (no query cache :)

Then I tried same on different engine, which didn’t support loose scans, and query took 0.2s. I repeated executions on both engines, so that data would be cached by OS, but speed difference was quite noticable. Of course, we could check what was done in there. The 0.01s query was executed this way:

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: tttable
         type: range
possible_keys: NULL
          key: b_2
      key_len: 5
          ref: NULL
         rows: 471
        Extra: Using index for group-by

Please note the ‘Extra’ line, and how many rows were read. As intended, two rows from each group, one for min() and one for max().
Now the 0.2s query did tell me that it was reading 230k rows. It did actually decide that in-memory table would be faster than reading index and did scan whole table. Another operation I tried was SELECT DISTINCT b FROM tttable. MySQL 5.0 executed the query in 0.01s, though the other engine was again more than 15 times slower, just because it had to deal with more rows.

Conclusion: loose index scans help. Take a look at the manual.