Tag Archives: Datenbank

Voldemort backups

At VZnet we use a number of different storage technologies, including both relational and NoSQL databases.  One database that powers a number of our services is Voldemort.  Voldemort is an open source key value store written by LinkedIn.  Some of the features that we really like about it is its very high scalability and its automatic partitioning and replication of data across storage nodes.

When I talk about backing up Voldemort, many might question whether that is really needed.  Data is already replicated across nodes, if one node dies or becomes corrupted, the data should safely exist somewhere else in the system.  This is true, with data replication you don’t need to worry about backing up for the purposes of protecting yourself against hardware failure.  However, that’s not the only reason why you might want to backup.

All software contains bugs, and our services at VZnet are no exception.  Voldemort itself is also no exception.  We can do our best to eliminate bugs through testing, but they will always pop up and we need to be prepared to deal with that.  One type of bug that could occur is a bug that corrupts data.  What if we pushed an update that, though it passed the tests, contained a race condition that under load results in the wrong data being associated with the wrong keys.  Replication won’t help you there, Voldemort will, by design, replicate the data corruption your code has caused across its nodes.  There are many other places where replication won’t help you, like if all your storage nodes use the same SAN and your SAN overheats destroying all its disks, or a dangerously under-skilled business owner decides to try to query the database er… key-value store and in the process deletes all your data.  But potential corruption due to bugs in our code was the issue that we were most worried about.

So, how do we go about backing up a Voldemort datastore?  The first question that needs to be answered is how Voldemort is storing its data.  Voldemort has pluggable back-end storage support, with the primary two back-ends available being MySQL and the Java implementation of Berkeley-DB (BDB).  We’re using BDB, so the next question we need to ask is how do we backup a BDB database?  Some answers on forums on the web point out that since BDB is an append only database, backups can be done safely by just copying its files.  This is only partially true.  BDB is an append only database, however it also has a clean up thread.  The clean up thread periodically scans through the database and removes outdated entries from the database and compacts the files.  This stops the database from constantly growing every time a record is updated.  It also means that if you copy the files while the clean up thread is running, you risk having a corrupt backup.

Fortunately, the Java BDB implementation comes with a tool that can be used to assist with backups.  In short, the tool pauses the clean up thread, finishes the current file that is being appended to, and then gives you a list of files that you can backup.  It supports incremental backups too, which is great for large databases.

My first naive attempt to use this tool was to write a small Java program that ran independent of Voldemort.  This was a failure, because an external process can’t pause the clean up thread, nor can it even access the database because Voldemort has it locked.  So I needed to implement something into Voldemort.  Controlling this feature is most sensibly done using the Voldemort admin tool, which I found very easy to extend.  The result can be found in the bdb-backup branch of my fork of Voldemort on GitHub.  I’ve initiated a pull request and hopefully the folks at LinkedIn will be happy to accept it.

The feature can be used by passing a new --native-backup command line option, information about how to use this is built into the admin clients help.  Using this new feature you can now create guaranteed consistent backups of your BDB Voldemort nodes, and you can rest assured knowing that you can recover from the many data corruption issues that replication won’t save you from.


Morgen früh (Dienstag, den 10.02.09) gehen wir mit meinVZ, studiVZ und schülerVZ von 01.00 bis ca. 08.00 Uhr offline aufgrund von Wartungsarbeiten an den Datenbanken.


Unter anderem sind wir dabei die Datensätze der Gruppenmitgliedschaften zu partitionieren, die vorher zusammen gespeichert waren. Es gibt von Flickr eine Präsentation über das Data Sharding die ziemlich gut beschreibt, was wir machen werden. Außerdem geht nach tagelanger Arbeit unser neuer LVS/ Heartbeat 2-Cluster mit einer neuen Version online. Mit dem LVS läuft im Backend die komplette Lastverteilung auf die Systeme, dank Direct Server Routing hält sich der Durchsatz auch in Grenzen, da nur die eingehenden Frames durch das System müssen.

Mehr dazu demnächst in einem anderem Blog-Post.

P.S. Der/das Blog bleibt online.