September 14, 2011
Developers building an application on Riak typically have a love/hate
relationship with Riak's simple key/value-based approach to storing
data. It's great that anyone can grok the basics (3 simple operations,
get/put/delete) quickly. It's convenient that you can store anything
imaginable as an object's value: an integer, a blob of JSON data, an
image, an MP3. And the distributed, scalable, failure-tolerant
properties that a key/value storage model enables can be a lifesaver
depending on your use case.
But things get much less rosy when faced with the challenge of
representing alternate keys, one-to-many relationships, or
many-to-many relationships in Riak. Historically, Riak has shifted
these responsibilities to the application developer. The developer is
forced to either find a way to fit their data into a key/value model,
or to adopt a polyglot storage strategy, maintaining data in one
system and relationships in another.
This adds complexity and technical risk, as the developer is burdened
with writing additional bookkeeping code and/or learning and
maintaining multiple systems.
That's why we're so happy about Secondary Indexes. Secondary Indexes
are the first step toward solving these challenges, lifting the burden
from the backs of developers, and enabling more complex data modeling
in Riak. And the best part is that it ships in our 1.0 release, just a
few weeks from now.
How Do Secondary Indexes Work?
From an application developer's perspective, Secondary Indexes allow
you to tag a Riak object with some index metadata, and later retrieve
the object by querying the index, rather than the object's primary
key.
For example, let's say you want to store a user object, accessible by
username, twitter handle, or email address. You might pick the
username as the primary key, while indexing the twitter handle and
email address. Below is a curl command to accomplish this through the
HTTP interface of a local Riak node:
curl -X POST \
-H 'x-riak-index-twitter_bin: rustyio' \
-H 'x-riak-index-email_bin: rusty@basho.com' \
-d '...user data...' \
http://localhost:8098/buckets/users/keys/rustyk
Previously, there was no simple way to access an object by anything
other than the primary key, the username. The developer would be
forced to "roll their own indexes." With Secondary Indexes enabled,
however, you can easily retrieve the data by querying the user's
twitter handle:
# Query the twitter handle...
curl localhost:8098/buckets/users/index/twitter_bin/rustyio
# Response...
{"keys":["rustyk"]}
Or the user's email address:
# Query the email address...
curl localhost:8098/buckets/users/index/email_bin/rusty@basho.com
# Response...
{"keys":["rustyk"]}
You can change an object's indexes by simply writing the object again
with the updated index information. For example, to add an index on
Github handle:
curl -X POST \
-H 'x-riak-index-twitter_bin: rustyio' \
-H 'x-riak-index-email_bin: rusty@basho.com' \
-H 'x-riak-index-github_bin: rustyio' \
-d '...user data...' \
http://localhost:8098/buckets/users/keys/rustyk
That's all there is to it, but that's enough to represent a variety of
different relationships within Riak.
Above is an example of assigning an alternate key to an object. But
imagine that instead of a twitter_bin field, our object had an
employer_bin field that matched the primary key for an object in our
employers bucket. We can now look up users by their employer.
Or imagine a role_bin field that matched the primary key for an
object in our security_roles bucket. This allows us to look up all
users that are assigned to a specific security role in the system.
Design Decisions
Secondary Indexes maintains Riak's distributed, scalable, and failure
tolerant nature by avoiding the need for a pre-defined schema, which
would be shared state. Indexes are declared on a per-object basis, and
the index type (binary or integer) is determined by the field's
suffix.
Indexing is real-time and atomic; the results show up in queries
immediately after the write operation completes, and all indexing
occurs on the
partition where
the object lives, so the object and its indexes stay in sync. Indexes
can be stored and queried via the HTTP interface or the Protocol
Buffers interface. Additionally, index results can feed directly into
a Map/Reduce operation. And our Enterprise customers will be happy to
know that Secondary Indexing plays well with multi data center
replication.
Indexes are declared as metadata, rather than an object's value, in
order to preserve Riak's view that the value of your object is as an
opaque document. An object can have an unlimited number of index
fields of any size (dependent upon system resources, of course.) We
have stress tested with 1,000 index fields, though we expect most
applications won't need nearly that many. Indexes do contribute to the
base size of the object, and they also take up their own disk space,
but the overhead for each additional index entry is minimal: the
vector clock
information (and other metadata) is stored in the object, not in the
index entry. Additionally, the LevelDB backend (and, likely, most
index-capable backends) support prefix-compression, further shrinking
index size.
This initial release does have some important limitations. Only single
index queries are supported, and only for exact matches or range
queries. The result order is undefined, and pagination is not
supported. While this offers less in the way of ad-hoc querying than
other datastores, it is a solid 80% solution that allows us to focus
future energy where users and customers need it most. (Trust me, we
have many plans and prototypes of potential features. Building
something is easy, building the right thing is harder.)
Behind The Scenes
What is happening behind the scenes? A lot, actually.
At write time, the system pulls the index fields from the incoming
object, parses and validates the fields, updates the object with the
newly parsed fields, and then continues with the write operation. The
replicas of the object are sent to
virtual nodes where
the object and its indexes are persisted to disk.
At query time, the system first calculates what we call a "covering"
set of partitions. The system looks at how many replicas of our data
are stored and determines the minimum number of partitions that it
must examine to retrieve a full set of results, accounting for any
offline nodes. By default, Riak is configured to store 3 replicas of
all objects, so the system can generate a full result set if it reads
from one-third of the system's partitions, as long as it chooses the
right set of partitions. The query is then broadcast to the selected
partitions, which read the index data, generate a list of keys, and
send them back to the coordinating node.
Storing index data is very different from storing key/value data: in
general, any database that stores indexes on a disk would prefer to
be able to store the index in a contiguous block and in the desired
order--basically getting as near to the final result set as
possible. This minimizes disk movement and other work during a query,
and provides faster read operations. The challenge is that index
values rarely enter the system in the right order, so the database
must do some shuffling at write time. Most databases delay this
shuffling, they write to disk in a slightly sub-optimal format, then
go back and "fix things up" at a later point in time.
None of Riak's existing key/value-oriented backends were a good fit
for index data; they all focused on fast key/value access. During the
development of Secondary Indexes we explored other
options. Coincidentally, the Basho team had already begun work to
adapt LevelDB--a low-level storage library from Google--as a storage
engine for Riak KV. LevelDB stores data in a defined order, exactly
what Secondary Indexes needed, and it is actually versatile enough to
manage both the index data AND the object's value. Plus, it is very
RAM friendly. You can learn more about LevelDB from
this page on Google Code.
Want To Know More?
If you want to learn more about Secondary Indexes, you can read the
slides from my talk at OSCON Data 2011:
Querying Riak Just Got Easier. Alternatively,
you can watch the video.
You can grab a pre-release version of Riak Version 1.0 on the
Basho downloads site to try the
examples above. Remember to change the storage backend to
riak_kv_eleveldb_backend!
Finally keep an eye out for documentation that will land on the newly
re-organized Basho Wiki within the next two
weeks.
-- Rusty