linux memory management for servers

We’ve been learning for many years how to run Linux for databases, but over time we realized that many of our lessons learned apply to many other server workloads. Generally, server process will have to interact with network clients, access memory, do some storage operations and do some processing work – all under supervision of the kernel.

Unfortunately, from what I learned, there’re various problems in pretty much every area of server operation. By keeping the operational knowledge in narrow camps we did not help others. Finding out about these problems requires quite intimate understanding of how things work and slightly more than beginner kernel knowledge.

Many different choices could be made by doing empiric tests, sometimes with outcomes that guide or misguide direction for many years. In our work we try to understand the reasons behind differences that we observe in random poking at a problem.

In order to qualify and quantify operational properties from our server systems we have to understand what we should expect from them. If we build a user-facing service where we expect sub-millisecond response times of individual parts of the system, great performance from all of the components is needed. If we want to build high-efficiency archive and optimize data access patterns, any non-optimized behavior will really stand out. High throughput system should not operate at low throughput, etc.

In this post I’ll quickly glance over some areas in memory management that we found problematic in our operations.

Whenever you want to duplicate a string or send a packet over the network, that has to go via allocators (e.g. some flavor of malloc in userland or SLUB in kernel). Over many years state of the art in user-land has evolved to support all sorts of properties better – memory efficiency, concurrency, performance, etc – and some of added features were there to avoid dealing with the kernel too much.

Modern allocators like jemalloc have per-thread caches, as well as multiple memory arenas that can be managed concurrently. In some cases the easiest way to make kernel memory management easier is to avoid it as much as possible (jemalloc can be much greedy and not give memory back to the kernel via lg_dirty_mult setting).

Just hinting the kernel that you don’t care about page contents gets them immediately taken away from you. Once you want to take it back, even if nobody else used the page, kernel will have to clean it for you, shuffle it around multiple lists, etc. Although that is considerable overhead, it far from worst what can happen.

Your page can be given to someone else – for example, file system cache, some other process or kernel’s own needs like network stack. When you want your page back, you can’t take it from all these allocations that easily, and your page has to come from free memory pool.

Linux free memory pool is something that probably works better on desktops and batch processing and not low latency services. It is governed by vm.min_free_kbytes setting, which has very scarce documentation and even more scarce resource allocation – on 1GB machine you can find yourself with 5% of your memory kept free, but then there’re caps on it at 64MB when autosizing it on large machines.

Although it may seem that all this free memory is a waste, one has to look at how kernel reclaims memory. This limit sets up how much to clean up, but not at when to trigger background reclamation – that is done at only 25% of free memory limit – so memory pool that can be used for instant memory allocation is at measly 16MB – just two userland stacks.

Once you exhaust the free memory limit kernel has to go into “direct reclaim” mode – it will stall your program and try to get memory from somewhere (thanks, Johannes, for vmscan/mm_vmscan_direct_reclaim_begin hint). If you’re lucky, it will drop some file system pages, if you’re less lucky it will start swapping, putting pressure on all sorts of other kernel caches, possibly even shrinking TCP windows and what not. Understanding what kernel will do in the direct claim has never been easy, and I’ve observed cases of systems going into multi-second allocation stalls where nothing seems to work and fancy distributed systems failover can declare node dead.

Obviously, raising free memory reserves helps a lot, and on various servers we maintain 1GB free memory pool just because low watermark is too low otherwise. Johannes Weiner from our kernel team has proposed tunable change in behaviors there. That still requires teams to understand implications of free memory needs and not run with defaults.

Addressing this issue gets servers into much healthier states, but doesn’t always help with memory allocation stalls – there’s another class of issues that was being addressed lately.

I wrote about it before – kernel has all sorts of nasty behaviors whenever it can’t allocate memory, and certain memory allocation attempts are much more aggressive – atomic contiguous allocations of memory end up scanning (and evicting) many pages because it can’t find readily available contiguous segments of free memory.

These behaviors can lead to unpredictable chain of events – sometimes TCP packets arrive and are forced to wait until some I/O gets done as memory compaction ended up stealing dirty inodes or something like that. One has to know memory subsystem much more than I do in order to build beautiful reproducible test-cases.

This area can be addressed in multiple ways – one can soften allocation needs of various processes on the system (do iptables really need 128k allocation for an arriving segment to log it via NFLOG to some user land collection process?), also it is possible to tweak kernel background threads to have less fragmented memory (like a cronjob I deployed many years ago) or of course, getting the memory reclamation order into decent shape instead of treating it as a black box that “should work for you unless you do something wrong” (like using TCP stack).

Some of our quick changes (like net: don’t wait for order-3 page allocation) were addressing this case by case basis, but it was amazing to see that this kind of optimization was pulled in to cover many many more allocations via wide-reaching change (mm/slub: don’t wait for high-order page allocation). From my experience, this addresses huge class of reliability and stability issues in Linux environments and makes system behavior way more adaptive and fluid.

There are still many gray areas in Linux kernel and desktop direction may not always help addressing them. I have test-cases where kernel is only able to reclaim memory at ~100MB/s (orders of magnitudes away from RAM performance) – and what these test cases usually have in common is “this would happen on a server but never on a desktop”. For example if your process writes a [transaction] log file and you forget to remove it from cache yourself, Linux will thrash on the inode mutex quite a bit.

There’re various zone reclaim contract violations that are easy to trigger with simple test cases – those test cases definitely expose edge behaviors, but many system reliability issues we investigate in our team are edge behaviors.

In database world we exasperate these behaviors when we bypass various kernel subsystems – memory is pooled inside the process, files are cached inside the process, threads are pooled inside the process, network connections are pooled by clients, etc. Kernel ends up being so dumb that it breaks on a simple problems like ‘find /proc’ (directory entry cache blows up courtesy of /proc/X/task/Y/fd/Z explosion ).

Although cgroups and other methods allow to constrain some sets of resources within various process groups, it doesn’t help when a shared kernel subsystem goes into an overdrive.

There’re also various problems with memory accounting – although kernel may report you quickly how many dirty file system pages it has, it doesn’t give equal opportunities to network stack. Figuring out how much of memory is in socket buffers (and how full these buffers are) is a non-trivial operation, and on many of our systems we will have much more memory allocated to network stack than to many other categories in /proc/meminfo. I’ve written scripts that pull socket data from netlink, try to guess what is the real memory allocation (it is not straightforward math) to produce a very approximate result.

Lack of proper memory attribution and accounting has been a regular issue – in 3.14 a new metric (MemAvailable) has been added, which sums up part of cache and reclaimable slab, but if you pay more attention to it, there’s lots of guessing whether your cache or slab is actually reclaimable (or what the costs are).

Currently when we want to understand what is cached, we have to walk the file system, map the files and use mincore() to get basic idea of our cache composition and age – and only then we can tell that it is safe to reclaim pages from memory. Quite a while ago I have written a piece of software that removes files from cache (now vmtouch does the same).

Nowadays on some of our systems we have much more complicated cache management. Pretty much every buffered write that we do is followed by asynchronous cache purge later so that we are not at the mercy of the kernel and its expensive behaviors.

So, you either have to get kernel entirely out of your way and manage everything yourself, or blindly trust whatever is going on and losing efficiency on the way. There must be a middle ground somewhere, hopefully, and from time to time we move in the right direction.

In desktop world you’re not supposed to run your system 100% loaded or look for those 5% optimizations and 0.01% probability stalls. In massively interconnected service fabrics we have to care about these areas and address them all the time, and as long as these kinds of optimizations reach wider set of systems, everybody wins.

TL;DR: upgrade your kernels and bump vm.min_free_kbytes :-)

on io scheduling again

Most of database engines have to deal with underlying layers – operating systems, device drivers, firmware and physical devices, albeit different camps choose different methods.
In MySQL world people believe that InnoDB should be handling all the memory management and physical storage operations – maximized buffer pool space, adaptive/fuzzy flushing, crash recovery getting faster, etc. That can result in lots of efficiency wins, as managing everything with data problem in mind allows to tune for efficiency and performance.

Other storage systems (though I hear it from engineers on different types of problems too) like PostgreSQL or MongoDB consider OS to be much smarter and let it do caching or buffering. Which means that in top Postgres expert presentations you will hear much more about operating systems than in MySQL talks. This results in OS knowledge attrition in MySQL world (all you have to know is “use O_DIRECT, XFS and deadline scheduler”), yet Linux virtual memory behaviors and tuning are a constant issue where OS is allowed to cache or buffer.

This leads to very interesting behaviors – both PostgreSQL and MongoDB worlds have to deal with write starvation at checkpoint spikes – where no other operations can happen while data from dirty buffers is being written out to storage media. To help alleviate that instead of aiming to better managed I/O, they aim to keep less dirty pages in memory altogether. For example in Mongo world you will hear tuning recommendation to write out dirty pages once a second rather than once a minute (thus causing shorter stalls every second rather than system going dark every minute). Dirtying all the pages every second means that there is not much write merging going on and system is consuming expensive flash write cycles much faster than needed.

MySQL (and Flashcache) have demand driven checkpointing – pages are written out either if they are at the end of LRU and need to make space for fresh pages or too much log capacity is used and pages referred in oldest log events have to be synced. I prefer demand driven checkpointing much more as it means that system adapts to the workload and optimizes towards efficiency rather than performance workarounds.

So once I saw insert operations stalling for nearly a second in my tests on super-capable modern hardware (PCI-E flash, 144GB of RAM, etc), I started looking more at what can be done to eliminate such stalls. In order to be both efficient and not stall one has to do two things – reduce number of writes and prioritize important ones first. This is where the concept of “I/O scheduling” first comes to mind. As there’s lots of reliance on OS to do the right work, I looked at what exactly is being done.

From overall general perspective database workload on I/O stack looks like this:

buffered block layer

 

Alright, I probably could’ve done a better job at making a diagram, but my main observations here are that block layer has no idea what files you are talking to, only page reads and writes coming from various sources (threads/processes), all coming to single block device. There are some other interesting issues. For example a dirty page write is attributed not to the thread that modified the page, but the one that decided to flush it (so it is either some variant of pdflush, or userland thread calling fsync() or msync()).

So, to properly schedule things we need to inform operating system better about our intentions. The standard in database world is deadline scheduler, which separates reads and writes into separate queues and thats about it – it does not try to distinguish different types of writes coming from different sources. CFQ is much more complicated and allows to put different threads into different classes (realtime, best effort or idle) and priorities. Unfortunately, even if I attribute workloads correctly with I/O scheduling properties I hit another issue:

buffered block layer log

 

My two threads, one that has to get through ASAP in order not to stall system operations is actually stalling not because it cannot write its data soon enough, but because shared resource (file system journal) is being written (and delayed by scheduler) by idle or best-effort workloads.

There are two ways to deal with this, one is instructing file system to behave nicely and write its own journal with realtime scheduling, as otherwise it will stall other parts, or much simpler one – put transaction journal onto a separate file system (I can hear millions of “I told you so” voices, all forgetting that e.g. MongoDB does not even have configuration option where to put the journal and you have to hack your way around with symbolic links).

Obviously putting on completely independent device would also make sure that I/O scheduling is a non-issue too, but having multiple independent devices costs money, so we will have to still think about how to schedule things properly.

One would think that CFQ suddenly should be much more appealing as it allows to specify workload properties but the way it behaves is not exactly predictable. What one needs is much more straightforward rule set – don’t starve logs, don’t starve reads, allow other stuff to be drained eventually.

Technically, some of scheduling decisions can be simply be made by logical block addresses that get accessed (e.g. file system journal or DBMS transaction log) – and not by which thread is doing it – but interfaces to do that now are nonexistent.

There has been some interesting IO scheduler development at Taobao – they created FusionIO-oriented cgroups capable “tiny parallel proportion scheduler” for their MySQL workloads, that makes lots of sense in multi-tenant environments, but doesn’t yet address the problem of I/O starvation within same process that MongoDB has.

Currently if one wants to run Mongo or PG with perfect p99 (or pmax) behaviors, CFQ does not fully provide decent guarantees even with separate filesystems, and deadline is too limited in its scope. There should be more innovation in how user-land / file system / block layer cooperation should look like, rather than assuming that throwing hardware at the problem (or ignoring bad quality) is good enough.

That may be useful in InnoDB world some day too – I have seen issues where asynchronous batch write or read-ahead IO coming from many threads had capabilities to starve other workloads.

I tried some easy way out in some of the cases – currently MongoDB flushes one file at a time sequentially by calling msync() – which means up to 2GB flushes with default configuration or 512MB with “smallfiles” option. What one needs is much more predictable behavior, such as “flush 10MB at a time, wait for that to complete”. As it is not exactly tracked internally which pages are dirty and which are not, there is no way to provide that kind of throttling with current OS interfaces.

Though MongoDB already uses mincore() system call to check whether a page is cached in memory, there is no way yet to find out whether page is actually dirty (there have been LKML threads about providing that information). If there was an easy interface to get maps of dirty data from OS, database software would be able to schedule less aggressive writes without relying upon perfect I/O scheduler behavior.

P.S. Yes, I just blogged about MongoDB performance, albeit Mark Callaghan has been doing it for a while.

On binlogs and datacenters

Once MySQL is deployed inside a datacenter environment (i.e. forms a cloud ;-), major feature in it becomes replication. It is used to maintain hot copies, standby copies, read-only copies, invalidate external systems, replicate to external systems, etc. If this functionality is broken, datacenter is broken – components are not synchronized anymore, invalidations not done, data not consistent.

From performance perspective, replication not working properly results in unusable slaves so load cannot be spread. This results in higher load on other machines, including master (especially on master, if environment needs stronger consistency guarantees).

Judging on replication importance in MySQL deployments, it should attract performance engineering as much as InnoDB and other critical pieces. Though slave replication performance is being increased in 5.6, master side is not (well, group commit may help a bit, but not as much).

The major problem here is LOCK_log. It is a mutex that guards all MySQL log files (Update: that sentence is wrong, it is per-log-type  mutex, is not shared among different types, at is class private variable) (general query log, binary log, etc) – only one read or write can happen to any of them. Other logs aside (their impact can be minimized), binary log finds this mutex extremely problematic, because:

  • It stalls COMMITs
  • It stalls slave binary log reads

Though one can argue, that once there’re less commits, there will be less binary log and that will result in some sort of equilibrium. Unfortunately, one doesn’t get anywhere close to decent efficiency, as there’re too many situations, when LOCK_log is kept for way too long:

  • Slow binary log read (if they already got purged out of cache, highly likely)
  • Slow binary log write (if controller cache is saturated by page flush workload)
  • Slow memory allocation for larger events (on systems with malloc scalability issues)

Saturating RAID controller cache isn’t that difficult – and even if one dedicates separate disks to a separate log disk array (what a waste!), it doesn’t mean that out of the shelf hardware will support supplying separate cache for that disk array. Without a separate dedicated cache or high priority log processing one ends up blocking until disks are available for a write.

Read efficiency is a concern here too – MySQL uses statically sized buffer of 8k for all binary log reads, which means that on highly pressured system read-ahead may not work, and there will be way more I/O operations/seeks, while holding a precious mutex. Even if the operation to read the 8k page from memory buffers is very fast, slave threads will be forced to wait for other, much slower operations (a lagging slave reading yesterday’s data? high write workload?) to finish.

If binary log writes are on a separate device one can inspect the utilization on iostat. Usually there will be only one write or read outstanding, so the busy percentage shows really well how contended this resource is. A double-digit number already says that operations aren’t instant, and replication slaves can be subjected to log starvation, that is – not replicating.

From operation side solution to this is somehow obtaining a dedicated cache or fast device for binary log writes. One can simply disable write cache for the InnoDB data partition and use lots of parallel InnoDB page write threads, so that they win the fight against read threads. Using flashcache or similar technology to absorb writes can be beneficial in this case.

On the other hand, MySQL can be changed not to have giant log mutex – binary logs are append-only structures, and just guarding tail and rotation events could be enough, with additional one or two atomic variables to make sure that readers are fine.

Also, having compressed binlogs on disk would reduce the I/O pressure caused by writes or reads, as well as there’s a possible gain from not having to recompress logs for each remote reader.

Then datacenter will be happy, and datacenter users even happier.

Stonebraker trapped in Stonebraker 'fate worse than death'

Oh well, I know I shouldn’t poke directly at people, but they deserve that sometimes (at least in my very personal opinion). Heck, I even gave 12h window for this not to be hot-headed opinion.

Those who followed MySQL at facebook development probably know how much we focus on actual performance on top of mixed-composition I/O devices (flashcache, etc) – not just retreating to comfortable zone of in-memory (or in-pure-flash) data.

I feel somewhat sad that I have to put this truism out here – disks are way more cost efficient, and if used properly can be used to facilitate way more long-term products, not just real time data. Think Wikipedia without history, think comments that disappear on old posts, together with old posts, think all 404s you hit on various articles you remember from the past and want to read.

Building the web that lasts is completely different task from what academia people imagine building the web is.

I already had this issue with other RDBMS pioneer (there’s something in common among top database luminaries) – he also suggested that disks are things of the past and now everything has to be in memory, because memory is cheap. And data can be whatever unordered clutter, because CPUs can sort it, because CPUs are cheap.

They probably missed Al Gore message. Throwing more and more hardware without fine tuning for actual operational efficiency requirements is wasteful and harms our planet. Yes, we do lots of in-memory efficiency work, so that we reduce our I/O, but at the same time we balance the workload so that I/O subsystem provides as efficient as possible delivery of the long tail.

What happens in real world if one gets 2x efficiency gain? Twice more data can be stored, twice more data intensive products can be launched.
What happens in academia of in-memory databases, if one gets 2x efficiency gain? A paper.
What happens when real world doesn’t read your papers anymore? You troll everyone via GigaOM.

Though sure, there’s some operational overhead in handling sharding and availability of MySQL deployments, at large scale it becomes somewhat constant cost, whereas operational efficiency gains are linear.

Update: Quite a few people pointed out that I was dissing a person who has done incredible amount of contributions, or that I’m anti-academia. I’m not, and I extremely value any work that people do wherever they are, albeit I do apply critical thinking to whatever they speak.

In my text above (I don’t want to edit and hide what I said) I don’t mean that “a paper” is useless. Me and my colleagues do read papers and try to understand the direction of computer science and how it applies to our work (there are indeed various problems yet to solve). I’d love to come up with something worth a paper (and quite a few of my colleagues did).

Still, if someone does not find that direction useful, there’s no way to portray them the way the original GigaOM article did.

InnoDB locking makes me sad

Vadim and others have pointed at the index->lock problems before, but I think they didn’t good job enough at pointing out how bad it can get (the actual problematic was hidden somewhere as some odd edge case). What ‘index lock’ means is generally the fact that InnoDB has table-level locking which will kill performance on big tables miserably.

InnoDB is a huge pie of layers, that have various locking behaviors, and are layered on top of each other, and are structured nicely as subdirectories in your innodb_plugin directory. Low level storage interfaces are done via os/ routines, then on top of that there’s some file space manager, fsp/, which allocates space for btr/ to live in, where individual page/ entities live, with multiple row/ pieces. There’re few other subsystems around, that got quite some attention lately – e.g. buf/ pool, transaction log/, and large trx/ transactions are composed of micro transactions living in mtr/.

If you live in memory, you care about buffer pool and transaction log performance, if you write insane amounts of data to in-memory buffers you hit mtr/ problems and depend o how fast you can write out log/ or flush out buf/. If you are in I/O-heavy land most of stuff you care about happens in btr/.

Generally InnoDB is quite good about read scalability in I/O bound environments – nowadays one can saturate really fast I/O devices and there will be plenty of parallel reads done. Major scalability problem in this field was read-ahead which was funneling all read-ahead activity into a small set of threads, but other than that there can be hundreds of parallel reads issued to underlying devices. Situation changes when writes are added to the mix, though again, there’re few different scenarios.

There’re two ways for InnoDB to write out updates to pages, “optimistic” and “pessimistic”. Optimism here means that only in-page (page/row) operation will be needed without changing the tree structure. In one case you can expect quite high parallelism – multiple pages can be read for that operation at a time, multiple of them can be edited at a time, then some serialization will happen while writing out changes to redo log and undo segments. Expect good performance.

The much worse case is when B-Tree is supposed to be reorganized and multiple page operations can happen; thats pessimism. In this case whole index gets locked (via a read-write lock obtained from dict/),
then B-Tree path is latched, then changes are done, then it is all unlocked until next row operation needs to hit the tree. Unfortunately, both ‘path is latched’ and ‘changes are done’ are expensive operations, and not only in-core, but are doing sync page read-ins, one at a time, which on busy systems serving lots of read load are supposed to be slow. Ironically, as no other operations can happen on the table at that time, you may find out you have spare I/O capacity.. ;-)

What gets quite interesting though is the actual operation needed to latch b-tree path. Usual wisdom would say that if you want to change a row (read-modify-write), you probably looked up the page already, so there won’t be I/O. Unfortunately, InnoDB uses an slightly more complicated binary tree version, where pages have links to neighbors, and tree latching does this (a bit simplified for reading clarity):


/* x-latch also brothers from left to right */
get_block = btr_block_get(space, zip_size, left_page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, page_no, RW_X_LATCH, mtr);
get_block = btr_block_get(space, zip_size, right_page_no, RW_X_LATCH, mtr);

So, essentially in this case, just because InnoDB is being pessimistic, it reads neighboring blocks to lock them, even if they may not be touched/accessed in any way – and bloats buffer pool at that time with tripple reads. It doesn’t cost much if whole tree fits in memory, but it is doing three I/Os in here, if we’re pessimistic about InnoDB being pessimistic (and I am). So, this isn’t just locking problem – it is also resource consumption problem at this stage.

Now, as the dictionary lock is hold in write mode, not only updates to this table stop, but reads too – think MyISAM kind of stop. Of course, this ‘table locking’ happens at entirely different layer than MyISAM. In MyISAM it is statement-length locking whereas in InnoDB this lock is held just for row operation on single index, but if statement is doing multiple row operations it can be acquired multiple times.

Probably there exist decent workarounds if anyone wants to tackle this – grabbing read locks on the tree while reading pages into buffer pool, then escalating lock to exclusive. A bit bigger architectural change would be allowing to grab locks on neighbors (if they are needed) without bringing in page data into memory – but that needs InnoDB overlords to look at it. Talk to your closest MySQL vendor and ask for a fix!

How do regular workloads hit this? Larger your records are, more likely you are to have tree changes, lower your performance will be. In my edge case I was inserting 7k sized rows – even though my machine had multiple disks, once the dataset fell out of buffer pool, it couldn’t insert more than 50 rows a second, even though there were many disks idle and capacity gods cried. It gets worse with out-of-page blobs – then every operation is pessimistic.

Of course, there’re ways to work around this – usually by taking the hit of sharding/partitioning (this is where common wisdom of “large tables need to be partitioned” mostly comes from). Then, like with MyISAM, one will have multiple table locks and there may be some scalability then.

TL;DR: InnoDB index lock is major architectural performance flaw, and that is why you hear that large tables are slower. There’s a big chance that there’re more scalable engines for on-disk writes out there, and all the large InnoDB write/insert benchmarks were severely hit by this.

Update: Filed bugs #61735 and #61736 with MySQL

Logs memory pressure

Warning, this may be kernel version specific, albeit this kernel is used by many database systems

Lately I’ve been working on getting more memory used by InnoDB buffer pool – besides obvious things like InnoDB memory tax there were seemingly external factors that were pushing out MySQL into swap (even with swappiness=0). We were working a lot on getting low hanging fruits like scripts that use too much memory, but they seem to be all somewhat gone, but MySQL has way too much memory pressure from outside.

I grabbed my uncache utility to assist with the investigation and started uncaching various bits on two systems, one that had larger buffer pool (60G), which was already being sent to swap, and a conservatively allocated (55G) machine, both 72G boxes. Initial finds were somewhat surprising – apparently on both machines most of external-to-mysqld memory was conserved by two sets of items:

  • binary logs – write once, read only tail (sometimes, if MySQL I/O cache cannot satisfy) – we saw nearly 10G consumed by binlogs on conservatively allocated machines
  • transaction logs – write many, read never (by MySQL), buffered I/O – full set of transaction logs was found in memory

It was remarkably easy to get rid of binlogs from cache, both by calling out ‘uncache’ from scripts, or using this tiny Python class:

libc = ctypes.CDLL("libc.so.6")
class cachedfile (file):
    FADV_DONTNEED = 4
    def uncache(self):
        libc.posix_fadvise(self.fileno(), 0, 0, self.FADV_DONTNEED)

As it was major memory stress source, it was somewhat a no brainer that binlogs have to be removed from cache – something that can be serially re-read is taking space away from a buffer pool which avoids random reads. It may make sense to call posix_fadvise() right after writes to them, even.

Transaction logs, on the other hand, are entirely different beast. From MySQL perspective they should be uncached immediately, as nobody ever ever reads them (crash recovery aside, but re-reading then is relatively cheap, as no writes or random reads are done during log read phase). Unfortunately, the problem lies way below MySQL, and thanks to PeterZ for reminding me (we had a small chat about this at Jeremy’s Silicon Valley MySQL Meetup).

MySQL transaction records are stored in multiple log groups per transaction, then written out as per-log-group writes (each is in multiple of 512 bytes), followed by fsync(). This allows FS to do transaction log write as single I/O operation. This also means that it will be doing partial page writes to buffered files – overwriting existing data in part of the page, so it has to be read from storage.

So, if all transaction log pages are removed from cache, quite some of them will have to be read back in (depending on sizes of transactions, probably all of them in some cases). Oddly enough, when I tried to hit the edge case, single thread transactions-per-second remained same, but I saw consistent read I/O traffic on disks. So, this would probably work on systems, that have spare I/O (e.g. flash based ones).

Of course, as writes are already in multiples of 512 (and appears that memory got allocated just fine), I could try out direct I/O – it should avoid page read-in problem and not cause any memory pressure by itself. In this case switching InnoDB to use O_DIRECT was a bit dirtier – one needs to edit source code and rebuild the server, restart, etc, or…

# lsof ib_logfile*
# gdb -p $(pidof mysqld)
(gdb) call os_file_set_nocache(9, "test", "test")
(gdb) call os_file_set_nocache(10, "test", "test")

I did not remove fsync() call, but as it is somewhat noop on O_DIRECT files, I left it there, probably it would change benchmark results, but not much.

Some observations:

  • O_DIRECT was ~10% faster at best case scenario – lots of tiny transactions in single thread
  • If group commit is used (without binlogs), InnoDB can have way more transactions with multiple threads using buffered I/O, as it does multiple writes per fsync
  • Enabling sync_binlog makes the difference not that big – even with many parallel writes direct writes are 10-20% slower than buffered ones
  • Same for innodb_flush_log_on_trx_commit0 – multiple writes per fsync are much more efficient with buffered I/O
  • One would need to do log group merge to have more efficient O_DIRECT for larger transactions
  • O_DIRECT does not have theoretical disadvantage, current deficiencies are just implementation oriented at buffered I/O – and can be resolved by (in same areas – extensive) engineering
  • YMMV. In certain cases it definitely makes sense even right now, in some other – not so much

So, the outcome here depends on many variables – with flash read-on-write is not as expensive, especially if read-ahead works. With disks one has to see what is better use for the memory – using it for buffer pool reduces amount of data reads, but causes log reads. And of course, O_DIRECT wins in the long run :-)

With this data moved away from cache and InnoDB memory tax reduced one could switch from using 75 % of memory to 90% or even 95% for InnoDB buffer pools. Yay?

on performance stalls

We quite often say, that benchmark performance is usually different from real world performance – so performance engineering usually has to cover both – benchmarks allow to understand sustained performance bottlenecks, and real world analysis usually concentrates on something what would be considered ‘exceptional’ and not important in benchmarks – stalls of various kind. They are extremely important, as the state when our performance is lowest is the state of performance we provide to our platform users.

On a machine that is doing 5000qps, stalling for 100ms means that 500 queries were not served as fast as they could, or even hit application timeouts or exceptional MySQL conditions (like 1023 transaction limit). Of course, stalling for a second means 5000 queries were not served in time…

We have multiple methods to approach this – one is our ‘dogpiled’ framework – an agent doing status polling every second and reporting information about I/O state, MySQL/InnoDB statuses, processlists, etc – so we see the scope of stalls in our environment. We try to maintain the threshold between complete information overload and something that reveals problems – so it is always balancing act, especially with great work done by engineering team :)

Other approach, usually led to by dogpiles information, is auto-PMP – high-frequency status polling combined with gdb invocations, that allow us to jump into the process whenever we notice something weird is going on. We have some extensions to how we use PMP – but thats worth another post.

Issues we do find out that harm us most in production environments are ones that are quite often discarded as either “this never happens” or “get better hardware” or “your application is wrong”. Unfortunately, that happens, we do have thousands of machines that aren’t free and our application demands are our application demands :)

Few examples:

  • TRUNCATE stalls the server (oh well, DROP TABLE too) – in this case, truncating a table grabs dictionary mutex, other transaction blocks while holding LOCK_open, everything else stops. Though truncating is supposed to be fast operation, it has to unlink (delete) a file, and with large files such operation isn’t really instant on any filesystem. Even if one deletes all the data before truncating, file is still on the filesystem.
  • Extending data files stalls the server – when a data file is being extended, global mutex is held, which blocks all I/Os (with limited concurrency that is full server stall). Somewhat more impressive with file-per-table. This is the major reason for mini-stalls at the moment – on machines that grow at gigabytes-a-day rate this is being hit quite often.
  • Updating table statistics stalls the server – we hit this with high-performance task tracking machines, row churn there is quite amazing, and dictionary statistics are reread more often than one would expect. Updating statistics means locking the table while doing random reads from disk. Once major workload is hitting that table, it quickly escalates to full server stall
  • Fuzzy checkpoint stalls the server – this is one of biggest issues outstanding in stock MySQL – though one would expect that “fuzzy checkpoint” that uses async background threads is nonblocking, actually all writes during it will stall, taking all concurrency slots and leading to a server stall. Mark’s fix was just doing this work in background thread.
  • (no bug filed on this yet) – Purge stalls the server – purge holds dictionary lock while doing random reads from disk, with table stall leading to server stall.

There’re more issues (mostly related to heavier in-memory activities of the server), but these ones are most obvious ones – where single I/O request done is escalated to table or instance lockup, where no other work is done. Our machines have multiple disks, multiple CPUs and can support multiple SQL queries being executed at once, so any of these lockups effectively limit our available performance or damage the quality of service we can provide.

On the upside, my colleagues are absolutely amazing and I’m sure that we will have all these issues fixed in our deployment in near future, as well as everyone will be able to pick that up via mysqlatfacebook branch.

Read ahead…

Mark wrote about how to find situations where InnoDB read-ahead is a bottleneck. What he didn’t disclose, though, is his trick to disable read-ahead without restart or recompile of MySQL. See, there’s no internal “disable read ahead knob”. But there is…

buf_read_ahead_random(...){ ...
       if (srv_startup_is_before_trx_rollback_phase) {
                /* No read-ahead to avoid thread deadlocks */
                return(0);
        }

This variable is tested at two functions – buf_read_ahead_linear() and buf_read_ahead_random() and nowhere else. So yeah, “server startup is before transaction rollback phase” is another way of saying “don’t do read ahead, please please”.

gdb -ex "set  srv_startup_is_before_trx_rollback_phase=1" \
    --batch -p $(pidof mysqld)

And many servers bottlenecked on this became much much much faster (and 2000 concurrent threads running dropped to 10). Of course, this is most visible in high-latency-high-throughput I/O situations, but we’re hitting this contention spot on local disk setups too.

Don’t forget to have the fix if gdb decides to be nasty and locks up your server :)

On file system benchmarks

I see this benchmark being quoted in multiple places, and there I see stuff like:

When carrying out more database benchmarking, but this time with PostgreSQL, XFS and Btrfs were too slow to even complete this test, even when it had been running for more than an hour for a single run. Between EXT3, EXT4, and NILFS2, the fastest file-system was EXT3 and then its successor, EXT4, was slightly behind that. Far behind the position of EXT4 were NILFS2 and then Btrfs and XFS.

There were few other benchmarks, e.g. SQLite showed ‘bad performance’ on XFS and Btrfs.

*clear throat*

Dear benchmarkers, don’t compare apples and oranges. If you see differences between benchmarks, do some very very tiny research, and use some intellect, that you, as primates, do have. If database tests are slowest on filesystems created by Oracle (who know some stuff about systems in general) or SGI (who, despite giving away their campus to Google, still have lots of expertise in the field), that can indicate, that your tests are probably flawed somewhere, at least for that test domain.

Now, probably you’ve heard about such thing as ‘data consistency’. That is something what database stack tries to ensure, sometimes at higher costs, like not trusting volatile caches, enforcing certain write orders, depending on acknowledgements by underlying hardware.

So, in this case it wasn’t “benchmarking file systems”, it was simply, benchmarking “consistency” against “no consistency”. But don’t worry, most benchmarks have such flaws – getting numbers but not understanding them makes results much more interesting, right?

Oh, and… thanks for few more misguided people.

Linux 2.6.29

2.6.29 was released. I don’t usually write about linux kernel releases, thats what Slashdot is for :), but this one introduces write barriers in LVM, as well as ext4 with write barriers enabled by default. If you run this kernel and forget to turn off barrier support at filesystems (like XFS, nobarrier), you will see nasty performance slowdowns (recent post about it). Beware.