BookKeeper - Yahoo’s Distributed Log Storage - is an Apache Top Level Project

Apache Software Foundation recently announced that BookKeeper has been promoted as an Apache Top Level Project. BookKeeper is near and dear to Yahoo: it was actually created here in 2009 to provide horizontally scalable, reliable, replicated log storage on commodity hardware. We have been actively developing it ever since, and it was opened up for community development in 2011.

In this post, we’ll be sharing an inside look at BookKeeper from the developers who’ve worked on it, and what we’ll be looking to achieve now that it is an Apache Top Level Project.

Overview

BookKeeper is a highly scalable, reliable, replicated log storage based on commodity hardware. BookKeeper abstracts out certain complexities, such as replication, failure recovery, and consistency, while building web scale applications by providing simple constructs to store and retrieve sequential log entries.

Logging is a common pattern in many applications such as databases and file systems in the form of journals, and in messaging applications in the form of persistent queues. At scale, the logging service has to support thousands of logs with millions of transactions per second without compromising throughput and latency. BookKeeper was initially developed to serve as the Write Ahead Log for the Hadoop File System (HDFS). It was later used in our messaging system called Hedwig, for message queues, and was eventually adopted into the Yahoo Cloud Messaging Service discussed later in this blog.

BookKeeper provides capabilities important for Yahoo applications

  • Horizontal Scalability - BookKeeper scales in both IO and storage capacity seamlessly with addition of new commodity servers, thus enabling incremental elastic growth
  • High Throughput - BookKeeper architecture for reads and writes is optimized for sequential disk IOs which leads to higher throughput on a single server
  • Consistency - In case of failures and errors common in large scale distributed systems, BookKeeper provides a consistent view of the log entries to all readers

Architecture

BookKeeper abstracts out replication of sequential logs to multiple storage servers called Bookies. Application use a construct called “Ledger” that supports simple operations such as create, open, add entry, and read entry. Ledger entries are replicated by BookKeeper on a minimum number of Bookies using a quorum protocol. The entries are striped across Bookies to provide high read/write throughput. BookKeeper also transparently manages high availability during Bookie failures, connectivity loss etc. Finally, BookKeeper’s metadata about bookies and ledgers is maintained in ZooKeeper.

The following diagram depicts a schematic representation of how the sequential logs are maintained on a Bookie:

image

Ledger entries are first written to the journal and then to the BookKeeper Entry Log data structure. Writes to the journal are synchronously committed to disk for durability of entries. Journal writes are sequential and supported by high write throughput medium such as SSD.  Writes to the Entry Log data structure is cached in the File System before committing to the storage medium. In case of failures, Entry Logs are recovered from the journal.

For fast lookup in the read path, BookKeeper maintains an index of entries. The index maps entries of the ledger in the Entry Log data structure, thus enabling fast entry lookup for high read throughput.

BookKeeper in Yahoo Hosted Messaging Service

Let’s take a look at an important application of BookKeeper in Yahoo. BookKeeper provides the persistent storage required for Yahoo’s multi-tenant distributed messaging service called Cloud Messaging Service (CMS). CMS is used by over 60 applications in Yahoo including mobile notifications, Weather feeds, the Gemini Ad platform, personalization platform, Homepage, storage systems such as Sherpa  etc.

CMS provides both Best-Effort and Guaranteed message delivery. Guaranteed delivery has to be highly resilient in case of network, disk, and machine failures. CMS uses BookKeeper to store messages as a reliable queue. Additionally, BookKeeper maintains the position of each receiver in the queue to ensure at-least-once delivery. This offloads applications from maintaining queue positions, thus simplifying application development.

Motivation for CMS to use Bookkeeper:

  • CMS has been deployed in 10+ datacenters with full mesh replication
  • Approximately 10 Billion messages/day get exchanged via BookKeeper, which is forecasted to go up to 100 Billion/day by the end of the year 2015
  • CMS is deployed on 250+ servers today. We expect to scale this to about 1500+ servers by end of 2015
  • CMS will grow to support a million queues from the current 25,000 queues
  • Today, CMS supports more than 60 applications in Yahoo’s serving stack
  • BookKeeper has served CMS well so far and gives us enough confidence to hit the above goals in 2015.

Looking to the Future

BookKeeper has few challenges for the scale and reliability that we anticipate. Here are the types of improvements we think can be done:

  • 10x higher throughput per Bookie through cache optimization
  • Increase the number of ledgers in a bookie by 5x
  • Effectively isolate tenants during high read workloads
  • Reduce publish latency to less than 1 ms