Anti-RDBMS: A list of distributed key-value stores
Please Note: this was written January 2009 – see the comments for updates and additional information. A lot has changed since I wrote this.
- RJ
Perhaps you’re considering using a dedicated key-value or document store instead of a traditional relational database. Reasons for this might include:
- You’re suffering from Cloud-computing Mania.
- You need an excuse to ‘get your Erlang on’
- You heard CouchDB was cool.
- You hate MySQL, and although PostgreSQL is much better, it still doesn’t have decent replication. There’s no chance you’re buying Oracle licenses.
- Your data is stored and retrieved mainly by primary key, without complex joins.
- You have a non-trivial amount of data, and the thought of managing lots of RDBMS shards and replication failure scenarios gives you the fear.
Whatever your reasons, there are a lot of options to chose from. At Last.fm we do a lot of batch computation in Hadoop, then dump it out to other machines where it’s indexed and served up over HTTP and Thrift as an internal service (stuff like ‘most popular songs in London, UK this week’ etc). Presently we’re using a home-grown index format which points into large files containing lots of data spanning many keys, similar to the Haystack approach mentioned in this article about Facebook photo storage. It works, but rather than build our own replication and partitioning system on top of this, we are looking to potentially replace it with a distributed, resilient key-value store for reasons 4, 5 and 6 above.
This article represents my notes and research to date on distributed key-value stores (and some other stuff) that might be suitable as RDBMS replacements under the right conditions. I’m expecting to try some of these out and investigate further in the coming months.
Glossary and Background Reading
- Distributed Hash Table (DHT) and algorithms such as Chord or Kadmelia
- Amazon’s Dynamo Paper, and this ReadWriteWeb article about Dynamo which explains why such a system is invaluable
- Amazon’s SimpleDB Service, and some commentary
- Google’s BigTable paper
- The Paxos Algorithm – read this page in order to appreciate that knocking up a Paxos implementation isn’t something you’d want to do whilst hungover on a Saturday morning.
The Shortlist
Here is a list of projects that could potentially replace a group of relational database shards. Some of these are much more than key-value stores, and aren’t suitable for low-latency data serving, but are interesting none-the-less.
| Name | Language | Fault-tolerance | Persistence | Client Protocol | Data model | Docs | Community |
| Project Voldemort | Java | partitioned, replicated, read-repair | Pluggable: BerkleyDB, Mysql | Java API | Structured / blob / text | A | Linkedin, no |
| Ringo | Erlang | partitioned, replicated, immutable | Custom on-disk (append only log) | HTTP | blob | B | Nokia, no |
| Scalaris | Erlang | partitioned, replicated, paxos | In-memory only | Erlang, Java, HTTP | blob | B | OnScale, no |
| Kai | Erlang | partitioned, replicated? | On-disk Dets file | Memcached | blob | C | no |
| Dynomite | Erlang | partitioned, replicated | Pluggable: couch, dets | Custom ascii, Thrift | blob | D+ | Powerset, no |
| MemcacheDB | C | replication | BerkleyDB | Memcached | blob | B | some |
| ThruDB | C++ | Replication | Pluggable: BerkleyDB, Custom, Mysql, S3 | Thrift | Document oriented | C+ | Third rail, unsure |
| CouchDB | Erlang | Replication, partitioning? | Custom on-disk | HTTP, json | Document oriented (json) | A | Apache, yes |
| Cassandra | Java | Replication, partitioning | Custom on-disk | Thrift | Bigtable meets Dynamo | F | Facebook, no |
| HBase | Java | Replication, partitioning | Custom on-disk | Custom API, Thrift, Rest | Bigtable | A | Apache, yes |
| Hypertable | C++ | Replication, partitioning | Custom on-disk | Thrift, other | Bigtable | A | Zvents, Baidu, yes |
Why 5 of these aren’t suitable
What I’m really looking for is a low latency, replicated, distributed key-value store. Something that scales well as you feed it more machines, and doesn’t require much setup or maintenance – it should just work. The API should be that of a simple hashtable: set(key, val), get(key), delete(key). This would dispense with the hassle of managing a sharded / replicated database setup, and hopefully be capable of serving up data by primary key efficiently.
Five of the projects on the list are far from being simple key-value stores, and as such don’t meet the requirements – but they are definitely worth a mention.
1) We’re already heavy users of Hadoop, and have been experimenting with Hbase for a while. It’s much more than a KV store, but latency is too great to serve data to the website. We will probably use Hbase internally for other stuff though – we already have stacks of data in HDFS.
2) Hypertable provides a similar feature set to Hbase (both are inspired by Google’s Bigtable). They recently announced a new sponsor, Baidu – the biggest Chinese search engine. Definitely one to watch, but doesn’t fit the low-latency KV store bill either.
3) Cassandra sounded very promising when the source was released by Facebook last year. They use it for inbox search. It’s Bigtable-esque, but uses a DHT so doesn’t need a central server (one of the Cassandra developers previously worked at Amazon on Dynamo). Unfortunately it’s languished in relative obscurity since release, because Facebook never really seemed interested in it as an open-source project. From what I can tell there isn’t much in the way of documentation or a community around the project at present.
4) CouchDB is an interesting one – it’s a “distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’, which are essentially key-value maps themselves, using the data types you see in JSON. Read the CouchDB Technical Overview if you are curious how the web’s trendiest document database works under the hood. This article on the Rules of Database App Aging goes some way to explaining why document-oriented databases make sense. CouchDB can do full text indexing of your documents, and lets you express views over your data in Javascript. I could imagine using CouchDB to store lots of data on users: name, age, sex, address, IM name and lots of other fields, many of which could be null, and each site update adds or changes the available fields. In situations like that it quickly gets unwieldly adding and changing columns in a database, and updating versions of your application code to match. Although many people are using CouchDB in production, their FAQ points out they may still make backwards-incompatible changes to the storage format and API before version 1.0.
5) ThruDB is a document storage and indexing system made up for four components: a document storage service, indexing service, message queue and proxy. It uses Thrift for communication, and has a pluggable storage subsystem, including an Amazon S3 option. It’s designed to scale well horizontally, and might be a better option that CouchDB if you are running on EC2. I’ve heard a lot more about CouchDB than Thrudb recently, but it’s definitely worth a look if you need a document database. It’s not suitable for our needs for the same reasons as CouchDB.
Distributed key-value stores
The rest are much closer to being ’simple’ key-value stores with low enough latency to be used for serving data used to build dynamic pages. Latency will be dependent on the environment, and whether or not the dataset fits in memory. If it does, I’d expect sub-10ms response time, and if not, it all depends on how much money you spent on spinning rust.
MemcacheDB is essentially just memcached that saves stuff to disk using a Berkeley database. As useful as this may be for some situations, it doesn’t deal with replication and partitioning (sharding), so it would still require a lot of work to make it scale horizontally and be tolerant of machine failure. Other memcached derivatives such as repcached go some way to addressing this by giving you the ability to replicate entire memcache servers (async master-slave setup), but without partitioning it’s still going to be a pain to manage.
Project Voldemort looks awesome. Go and read the rather splendid website, which explains how it works, and includes pretty diagrams and a good description of how consistent hashing is used in the Design section. (If consistent hashing butters your muffin, check out libketama – a consistent hashing library and the Erlang libketama driver). Project-Voldemort handles replication and partitioning of data, and appears to be well written and designed. It’s reassuring to read in the docs how easy it is to swap out and mock different components for testing. It’s non-trivial to add nodes to a running cluster, but according to the mailing-list this is being worked on. It sounds like this would fit the bill if we ran it with a Java load-balancer service (see their Physical Architecture Options diagram) that exposed a Thrift API so all our non-Java clients could use it.
Scalaris is probably the most face-meltingly awesome thing you could build in Erlang. CouchDB, Ejabberd and RabbitMQ are cool, but Scalaris packs by far the most impressive collection of sexy technologies. Scalaris is a key-value store – it uses a modified version of the Chord algorithm to form a DHT, and stores the keys in lexicographical order, so range queries are possible. Although I didn’t see this explicitly mentioned, this should open up all sorts of interesting options for batch processing – map-reduce for example. On top of the DHT they use an improved version of Paxos to guarantee ACID properties when dealing with multiple concurrent transactions. So it’s a key-value store, but it can guarantee the ACID properties and do proper distributed transactions over multiple keys.
Oh, and to demonstrate how you can scale a webservice based on such a system, the Scalaris folk implemented their own version of Wikipedia on Scalaris, loaded in the Wikipedia data, and benchmarked their setup to prove it can do more transactions/sec on equal hardware than the classic PHP/MySQL combo that Wikipedia use. Yikes.
From what I can tell, Scalaris is only memory-resident at the moment and doesn’t persist data to disk. This makes it entirely impractical to actually run a service like Wikipedia on Scalaris for real – but it sounds like they tackled the hard problems first, and persisting to disk should be a walk in the park after you rolled your own version of Chord and made Paxos your bitch. Take a look at this presentation about Scalaris from the Erlang Exchange conference: Scalaris presentation video.
The reminaing projects, Dynomite, Ringo and Kai are all, more or less, trying to be Dynamo. Of the three, Ringo looks to be the most specialist – it makes a distinction between small (less than 4KB) and medium-size data items (<100MB). Medium sized items are stored in individual files, whereas small items are all stored in an append-log, the index of which is read into memory at startup. From what I can tell, Ringo can be used in conjunction with the Erlang map-reduce framework Nokia are working on called Disco.
I didn’t find out much about Kai other than it’s rather new, and some mentions in Japanese. You can chose either Erlang ets or dets as the storage system (memory or disk, respectively), and it uses the memcached protocol, so it will already have client libraries in many languages.
Dynomite doesn’t have great documentation, but it seems to be more capable than Kai, and is under active development. It has pluggable backends including the storage mechanism from CouchDB, so the 2GB file limit in dets won’t be an issue. Also I heard that Powerset are using it, so that’s encouraging.
Summary
Scalaris is fascinating, and I hope I can find the time to experiment more with it, but it needs to save stuff to disk before it’d be useful for the kind of things we might use it for at Last.fm.
I’m keeping an eye on Dynomite – hopefully more information will surface about what Powerset are doing with it, and how it performs at a large scale.
Based on my research so far, Project-Voldemort looks like the most suitable for our needs. I’d love to hear more about how it’s used at LinkedIn, and how many nodes they are running it on.
What else is there?
Here are some other related projects:
- Hazelcast – Java DHT/clustering library
- nmdb – a network database (dbm-style)
- Open Chord – Java DHT
If you know of anything I’ve missed off the list, or have any feedback/suggestions, please post a comment. I’m especially interested in hearing about people who’ve tested or are using KV-stores in lieu of relational databases.
UPDATE 1: Corrected table: memcachedb does replication, as per BerkeleyDB.
148 Comments to Anti-RDBMS: A list of distributed key-value stores
Leave a comment
About Me
Tags
Recent Posts
- Rewriting Playdar: C++ to Erlang, massive savings
- Erlang talk at London Hackspace
- Anti-RDBMS: A list of distributed key-value stores
- How we use IRC at Last.fm
- Getting to know ejabberd and writing modules
- ssh hack: connect directly to machine via a firewall box
- A Million-user Comet Application with Mochiweb, Part 3
- A Million-user Comet Application with Mochiweb, Part 2
- A Million-user Comet Application with Mochiweb, Part 1
- On bulk loading data into Mnesia
Excellent overview, thanks!
For CouchDB: We are working on partitioning but it is currently not a priority as we are working on the 0.9 release. This is definitely a planned feature, though. In the meantime, a consistent-hashing HTTP proxy or storage layer in your will get you partitioning really easily, just not as convenient as if CouchDB would do it for you.
Cheers
Jan
–
If the documentation is accurate, then memcachedb supports replication via BerkleyDB’s replication features.
See this pdf starting at slide 39: http://memcachedb.org/memcachedb-guide-1.0.pdf
Thanks for the nice breakdown of services. We want Thrudb to be the efficient layer that glues storage and indexing backends together since each one has it’s own strength and weaknesses.
Did you take a look at MongoDB?
http://www.mongodb.org/
Looks promising…
Hi Richard,
We only just got the site up for project voldemort a week ago and started getting linked to before it was finished, so you are right that a lot of data is missing about how LinkedIn uses it, basic performance numbers, and especially about operational necessities like JVM settings, data archival, etc. Of course all of this information is quite critical if you are thinking of storing your precious data in it. I am going to work on getting all of this in shape over the next few weeks. We will probably do a LinkedIn engineering blog post soon which has more specifics about how we are using it.
I would also add that, as you mention, having a JVM-only API is a big weakness right now. The plan for this is to move all the network serialization to Protocol Buffers or Thrift and support first class clients in popular languages in our source tree. This will use the ability to push all the routing and conflict resolution to the server so that the non-java clients need only implement the code necessary to put the message on the wire. I will post the branch as soon as I get a chance to start development, in hopes that people will be willing to work on a client in their favorite language.
Werner Vogels, Amazon’s CTO, seems to have some disparaging things to say about Scalaris:
http://twitter.com/Werner/statuses/1008722501
What are your thoughts on his thoughts?
You may want to add M/DB to your list: a free API-compatible alternative to SimpleDB (http://www.mgateway.com/mdb.html)
The Google stack uses Paxos, Amazon’s Dynamo uses vector clocks. I’m not sure what these systems use (I know Voldemort uses vector clocks). It’s important to see the difference.
Paxos is a consensus protocol that requires the majority of the nodes to be up. For example, Google’s Chubby systems use Paxos as a primitive to reach consensus on the value to append to a replicated log (a new round of Paxos is run for each entry in the replicated log). The entries are treated as commands, which are then applied to the database. This is called the distributed state machine approach. Because of Paxos, roughly speaking, you can be sure that all nodes see the same entries in the same order in their replicated log – there is no chance of inconsistency due to Paxos. (Your implementation could of course be buggy..) This also means that in case of a netsplit, or enough nodes failing, the minority will not be able to make progress.
Dynamo, on the other hand, uses vector clocks as the distributed primitive. In this approach, clients select some nodes to write to. This selection uses a hash table, but in terms of consistency guarantees, this is not that important. As Werner Vogels describes on his blog, you can tune how many nodes you write to, this is usually denoted W. Usually W = 2 or 3. If nodes fail, other nodes can accept the write requests. This means that in this approach, in case of a netsplit, the disjoint parts of the network will accept diverging versions. This is what Amazon calls the “always-on” experience. When the disjoint parts join up, these versions have to be reconciled. This is usually handled by writing application-specific code, or doing something primitive like relying on the node’s local time and keeping the newer versions. Due to this weak consistency semantic, I think Paxos-based systems are more likely to get a foothold in the real world: although they require a majority to be up, their consistency semantics are far easier to grasp, and additionaly, they don’t require application-specific code for reconciliation. AFAIK, Amazon only uses Dynamo where the reconciliation can be handled naturally, like the shopping cart use-case.
As an interesting sidenote, both Paxos and vector-clocks were introduced by Leslie Lamport.
For a collection of papers, see http://bytepawn.com/readings-in-distributed-systems/
It’s possible for the distributed system to be made up of many Paxos cells, in which case, if a Paxos cell is down (not enough nodes to make progress), the controller redirects the write request to a live cell, thus, at a higher level, introducing inconsistent versions and emulating the reduced availability requirements of a Dynamo-like architecture.
There are additional design-options to make Paxos more attractive, but I can’t say more…
What about Tokyo Cabinet and Tyrant? It looks like Tryrant (the network interface) speaks memcached’s protocol and allows for replication. Might be worth a look but probably falls into your second category of distributed key-value stores.
Is SimpleDB itself not a viable alternative?
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. – [...]
@Russ SimpleDB is tied to one vendor (Amazon) and the only way to use it is on their cloud. If you could download and run SimpleDB on your own server/workstation or on another cloud vendor, then it *might* be viable. FWIW, I’m using CouchDB in production at my startup and really liking it so far. Though I do miss SQL for simple queries sometimes.
Hi, thanks for the Dynomite mention. I just wanted to let you know that Dynomite currently does read-repair using vector clocks. And there’s a small community starting to form around it. And I agree, I really need to write up some more comprehensive documentation.
At Powerset we’re using it for scraping and serving images from our data sources (currently only wikipedia). I’ve been doing a lot of work on performance recently and will do some posts about getting the 99.9% latency under control.
@Toby DiPasquale, @Marton Trencseni
The reason Werner Vogels doesn’t like Scalaris is that Amazon values write availability over consistency. In a sales-oriented environment like Amazon, they’re obviously pretty keen on availability, but Google’s writers are primarily backend tasks, so it makes sense that they prefer consistency.
I think you have to be running a fairly massive (and very distributed) network in order for this availability vs. consistency argument to mean anything in real terms. Last.fm certainly isn’t anywhere near there yet.
Do you have any numbers to back up your claims? For example, I haven’t seen you asking any related questions in Hypertable mailing lists/forums. Do you know Hypertable can support 10K+/s (1 KB size) random reads/s *per node* (1 4-core 2.33Ghz Xeon) using a single threaded client with tables configured as “in memory” (still persisted, checksumed and replicated 3-ways on disks for recovery)?
As for random write throughput, nobody on the list comes close to Hypertable (1TB injection replicated 3-ways at 1M cells/s on a 8 node cluster).
OTOH, serious web services should still use memcached (as a recent facebook engineerng note pointed out can serve up to 400k ops/s per node) for application specific caching (as only the application has most of the information to do intelligent invalidation.) and use a distributed DB for transactions and persistence.
I was just looking into MonetDB http://monetdb.cwi.nl/
It didn’t make your list…
Great post!
@vicaya
We have done testing on HBase, Hypertable, and Cassandra at Last.fm and we’ve generally found that their latency isn’t with our acceptable limits for services which directly service our site (we’re looking at variance as much as mean latency here).
Some of these benchmarks are a bit old, and we’re always re-evaluating them. Part of what we’re doing at the moment is trying to take stock of the situation and find out which projects have the most merit to throw our technical weight (and possible developer time) behind.
Russ, your system doesn’t have to be very complicated for consistency to be an issue. With n=3 nodes, if you have a netsplit between them and you get divergent versions, then your application will have to deal with it. Of course, depending on the application, this may be trivial. (For example, in the Amazon shopping cart case, you can keep the maximal item count per item in the divergent shopping carts.)
Anyway, my point was, if you’re designing a distributed system where you want wide applicability then systems with stronger consistency guarantees (= easier to understand) win out. This is why RDBMS/SQL is king.
@Marton:
Of course. What I’m trying to highlight here is that there are two schools of thought of how to handle this problem: the Paxos/Google/Scalaris method, which will refuse to accept writes if the system is not quorate, and the Vector Clock/Amazon method, which will always accept writes at the expense of having to make a potentially tricky judgement call later on if there were network troubles at the time of write.
Which one of these two methods you chose has got to be based on the type of data you’re storing, your anticipated downtime between data nodes, and many other factors. If you’re small, these considerations probably don’t even matter. (i.e. it’s not as black and white as Werner Vogels makes it out to be)
I totally agree with your last point though. Relational databases still sit at the core of Last.fm and it doesn’t look like they’re going anywhere in a hurry.
Russ, that’s just what I was saying in my original comment. Anyways, to me, the table in the article is really missing a ‘consistency’ column that would tell me what I can expect from these KV-stores [what we're discussing, ie. do I have to write reconciliation code or not].
[...] 19-Jan-2009 at · Filed under Uncategorized http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ [...]
@Russ, please consult our community if you have performance problems. Many problems are discovered and fixed all the time. Hypertable has many knobs to turn. There are R&D underway (with UC Berkeley etc.) to use machine learning to let users to specify declaratively, consistency, latency & throughput requirements and let an adaptive model to pick the right implementation strategy.
[...] good source of Wordpress, jQuery, CSS, and other (e.g. Greasemonkey) information. Digg is good too. Cloud computing. Layer upon layer of aggregators. Anonymous, temporary webpage (in a wiki sandbox…). Mono/C# [...]
Hi, thanks for your mention of MemcacheDB. Currently, it does supports HA, which is a Paxos compatible master-client replication with electing(we build it on top of the BerkeleyDB’s replication framework). And there are couple of users, including Chinese biggest portal – Sina.com.cn, and digg? reddit? and other dotcom company and startups.
[...] 早上看到博客Anti-RDBMS: A list of distributed key-value stores(抵制关系型数据库:分布式key-value存储列表),最近的项目中也有用到类似Key-Value的存储。 [...]
RedTea의 생각…
분산 key-value 저장소에 관한 포스팅 나중에 진지하게 도입을 검토해봐야겠다….
[...] Anti-RDMBS? Richard Jones gives the lowdown Richard Jones has an excellent breakdown of your alternatives if you do decide to Anti-RDBMS (or if you just want [...]
How would one index all data available in distributed k/v store? Just as the data, the index should be distributed as well to allow scalability. So far I’ve only seen BigTables and SimpleDB having support for this. (Well, CouchDB has it with it’s map/reduce queries and views, but it’s not distributed in my sense. Yet.)
> Carl: How would one index all data available in
> distributed k/v store?
You don’t. You use another tool.
what exactly were the reasons why couchdb didnt make the final cut?
i ask because we are looking to start development on a system that would use a product like this for building output for WS calls in real time.
it looks like the only product on the list that is today being used heavily in real world settings is haddop, but couchdb is going to start to see production deployments soon. am i mistaken about this?
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. (tags: storage list programming performance scalability hacking store) [...]
[...] January 20, 2009 at 5:20 pm · Filed under Cloud computing [From Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq.] [...]
Umm, there’s also Neptune, I guess: http://www.openneptune.com.
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. [...]
You missed IBM Lotus Domino. The document database at its core does all that stuff.
http://www-01.ibm.com/software/lotus/products/domino/
푸디딕의 생각…
관계형 데이터베이스에 질리신 분들을 위한 분산된 키-값 저장소의 목록. 고놈 참 재미있네….
Hi, thanks for the Kai mention.
In Kai, data is replicated like Amazon’s Dynamo. Same as Dynamite, Kai currently does read-repair using vector clocks. And there’s a small community found in sourceforge.net, where mainly spoken in Japanese as you pointed out.
Kai is under testing at goo.jp for production use, which is Japanese 2nd large portal site. We’re going to write up more comprehensive documentation.
[...] not the string of bits, lets not go that far (yet). It’s the hashtable. The hashtable is the anti-relational database. You might think it’s highly impractical or limited, but that’s where you’re [...]
[...] Scaling, Software Development — François Schiettecatte @ 3:45 pm Interesting list of distributed key-value stores, Perhaps you’re considering using a dedicated key-value or document store instead of a [...]
CouchDB is not really distributed at the moment, although it is certainly designed to be. Currently, replication is done by manually running a command-line tool that synchronizes stuff between databases. I don’t think they even have the conflict resolution stuff worked out.
[...] Anti-RDBMS: A list of distributed key-value stores – good, if superficial, survey. [...]
Hope some Erlang masters check out dets bottleneck for mnesia and we can get changeable cool dht/db for Erlang.
Peace.
Forget what I said: CouchDB apparently has some replication now: http://wiki.apache.org/couchdb/Frequently_asked_questions#how_replication.
I’m not sure I understand why you disapprove of Cassandra and CouchDB. I don’t see a clear reason listed for either.
Have you got any performance statistics?
Hi Richard,
it has been said a lot of times, but Great Site.
You mentioned hazelcast once without any comment. I tried it out but not took the chance evaluating it for very high amount of data or throughput. It has a very simple API and the developer made it open source lately. It should be very fast and reliable, but you are much more knowlege is much deeper to judge this.
It would be very interesting to hear opionons.
[...] If you are producing lots of small files, then, depending on the access pattern, a different type of storage might be more appropriate. HBase stores data in MapFiles (indexed SequenceFiles), and is a good choice if you need to do MapReduce style streaming analyses with the occasional random look up. If latency is an issue, then there are lots of other choices — see Richard Jones’ excellent survey of key-value stores. [...]
[...] de datos hace que aun estén disponibles. Vía Diigo. Richard Jones, Weblog, January 20, 2009 [Liga] [etiquetas: Redes, [...]
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. storage scalability database [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. (tags: scalability distributed database keyvalue properties) [...]
[...] 以前就知道hypertable,没有深入研究,最近随着云计算兴起–hadoop的逐步进入商业应用。各种开源数据库也随之颇受关注,我想这都由于google的bigtable而起。从hypertable.org中可知,baidu已经是其的新赞助商了。另外一个想研究hypertable的原因是它是使用c++实现,同时,它可以支持本地文件系统,感觉其扩展、易用。hypertalbe入门文档,值得一读。 归类于: 开源系统, hypertalbe [...]
I 2nd Tokyo Tyrant… at least initially; we’re in the middle of evaluating it but have no concrete test results as yet. The administration of master/master[/master...][/slave.../] is about as easy as it gets, as is general operational administration. I’ve found in the past (as I’m sure many have) that operational maintenance is a major pain point and one reason I’m leaning towards it.
And thanks to your post we have some other avenues to pursue as well.
If you want the benefits of a higher abstraction level than key/value stores provide, a graph database is an interesting alternative to RDBMS as well. Most data fits naturally into a graph representation. http://neo4j.org/ is an open source database of this new breed. – I’m part of the team behind it.
Hi,
What do you think about Tokyo Cabinet, Tyrant and Dystopia tripod ?
[...] Copyright © 2008 Richard Jones Оригинал (Английский): Anti-RDBMS: A list of distributed key-value stores Перевод: © orangeudav Лицензия: All code licensed under GPLv2 unless otherwise [...]
Hi Richard,
I have also implemented a version of Chord in C# (nChord.sourceforge.net) in case you’d like to include it in your list.
In terms of key-value stores, around the same time many of these other systems were being built, I also had the opportunity to build, work on, deploy and test at scale another similar storage system (Resonance) based on Chord. It is no longer in use but I wrote up a short paper on it with so-so results if you’d be interested in taking a look.
Thanks,
–andrew
I went through and documented (at a high level) API of Cassandra. It turns out that you either get a map from (key1,key2) to a value, or a map from (key1,key2) to a list of values.
http://www.brightyellowcow.com/blog/Evaluating-the-API-of-Cassandra-BigTable-.html
[...] http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ : une liste de bases de données de type clé/valeur [...]
[...] distributed key value stores, except that unfortunately it was only mentioned in the comments of Richard’s post. Anyway, because I’ve known Geir through friends in the Boston tech scene, I decided to take [...]
And what of CloudBase? (https://sourceforge.net/projects/cloudbase/) How does that fit in?
I’m working on this: http://code.google.com/p/redis/
That looks like a lot what you need but it is still
in beta (actually alpha to be honest).
In a few weeks and after a few of more beta releases we
hope to have Redis stable and ready for prime time since
I’m actually using it inside my company.
Hi!
Thanks for the great list!
Memcachedb does do sharding. This is built into any of the drivers. Also, you may want to take a look at the DB that Mixi.jp uses :
http://tokyocabinet.sourceforge.net/tyrantdoc/
Cheers,
-Brian
Hey,
I’m a developer that has been contributing to hbase – specifically I closed out HBASE-61 – new fileformat for HBase. It’s been recently integrated into HBase’s mainline, and is on track for the 0.20 release (next major release).
This is lots of project management stuff for what I am really trying to say – hbase’s latency has been worked on. The goal is to serve websites straight from hbase. Which means as faster than mysql is the goal.
Lots of exciting things are happening in hbase, new cache systems, new file formats, zookeeper (a paxos lock/data system) integration, all of which should make 0.20 more stable, less prone to outage and faster than ever.
Join us on IRC, #hbase (freenode), or via email if you’d like to talk more!
Oddly you missed (strange given last.fm’s heritage) probably the longest established object store there is. ZODB.
http://en.wikipedia.org/wiki/Zope_Object_Database
Given it is over a decade old, supports MVCC, clustering, multiple backend storages, etc etc it is a very worthy contender for your list.
-Matt
[...] distributed computing, network, storage by Nicolas Bonvin on March 6th, 2009 According to the blog entry of Richard Jones, the most mature distributed stores [...]
[...] Una lista de los diversos sistemas de bases de datos clave-valor: Anti-RDBMS [...]
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq.Overview and commentary on several non-relational databases. [...]
what about Microsoft’s project Velocity?!
[...] anyone who has missed this there is a survey from Richard Jones at last.fm of some of the main software projects out there – although missing some like Tokyo [...]
have you try LightCloud its based on tokyotyrant multimaster replication and consistent hashing.
http://opensource.plurk.com/Overview/
Excellent article!
But my concerns are on writing speed. It seems that the comparison is always based on the latency that a database gives you when you want to serve content picking docs from the database.
I started some tests with Tokyo cabinet+Tyrant and CouchDB but I am now at the research point.
Does anyone know which one of this databases is the fastest at WRITING? The reading will be launched in background cron processes, no matter how long it takes. The application will receive thousands of connections per second (requesting write), modifying the database through HTTP requests would be the best option.
IDEAS?
Thanks for all the comments and feedback – since I posted this I discovered many more interesting non-relational databases that already existed, and several new projects have surfaced too (mentioned in the comments already). The rate they are cropping up at the moment means I don’t have time to keep the blog post up-to-date with the latest and greatest.
Most of the interesting projects I find are added to my delicious.
This comments thread is invaluable tho, and is serving as an excellent reference.. so please keep adding to it :)
RJ
[...] you’re interested in distributed key-value stores here is a great post on what is out there: Anti-RDBMS: A list of distributed key-value stores Related Articles: Baby is almost [...]
Are there any good Java open source column based databases out there.
Thanks
-Pete
[...] post is a good write-up of the more well-known ones. I’ve already posted about [...]
Great write-up. One overlooked distributed solution is
to use a commercial, fully supported, federated database.
Objectivity DB is one of these. It is massively scalable,
fully replicated, and supports not only Java but also C++
and C#, all working concurrently against the same data.
Cheers
Roy
[...] other recent useful starting points may be Richard Jone’s Anti-RDBMS roundup and Bob Ippolito’s Drop ACID and think about data Pycon [...]
[...] about (distributed) key-value storage, for more info I recommend a nice article by Richard Jones Anti-RDBMS: A list of distributed key-value stores. HighScalability blog also has a lot of references and [...]
Have you thought about LDAP? If you are looking things up by key (i.e., base scope search), then it is hard to find anything faster, more accurate, and more reliable than LDAP. Take a look at OpenDS (version 2 will be out in June). On my little MacBook I was running OpenDS 1.3 (build 4) with 2.3 million records and averaging 6.5 requests/second. LDAP is super simple to scale, too. (But you’ll want to use Sun’s LDAP Proxy Server for that, which is free, but not open source.)
What about RDF based Quad Stores, especially those implemented using hybrid database engines (by this I mean RDBMS and Entity-Attribute-Value model combos).
An example of a hybrid database engine is: OpenLink Virtuoso.
Links:
1.http:/virtuoso.openlinksw.com – Virtuoso Home page
2.http://delicious.com/kidehen/virtuoso_whitepaper – white paper collection re. Virtuoso
3. http://bit.ly/euz2O – My post about end of RDBMS primacy and why
4. http://www.readwriteweb.com/archives/is_the_relational_database_doomed.php#comments – Interesting discussion that needs to be linked to this article (what I am doing by commenting)
5. http://bit.ly/bRwDm – Virtuoso 6.0 FAQ.
Kingsley
아시모프의 생각…
key-value store에 대해 자세히 아시는분 있나요? 아니면 링크라도…..
@Lucas LDAP may be an option if you have a very high read/write ratio, but it’s not really designed for heavy writing. And when you do need to write, LDAP clients are generally pretty unwieldy.
There is also Schemafree.
http://code.google.com/p/schemafree/
[...] Anti-RDBMS a list of distributed key-value stores [EN]: Un buen artículo en el que se presenta una lista con algunas de las alternativas existentes a bases de datos relacionales. Particularmente interesantes son Project Voldemort y CouchDB. [...]
[...] Project Voldemort, Cassandra, and CouchDB. An overview of some key-value stores may be found here. The rest of this post will deal specifically with CouchDB, though many of the concepts may be [...]
[...] HBASE, Tokyo Cabinet, LightCloud and Cassandra, which we discussed here. These projects are concisely summarized here and here. In addition, LinkedIn has made most of the information on Project Voldermort, the [...]
[...] also like to state that the majority of alternate key-value store databases listed in Richard Jones’ article and in Lenoard Lin’s blog are really not ready for high production loads (with maybe the [...]
Hi
You are ommitting a “public secret” which is pretty widely used: The NDB database/storage engine also known as *MySQL Cluster*.
While it is commonly used through an SQL interface, the architecture and performance is exactly what you want: Cloud-like sharding, very good performance on key-value lookups, etc… And if you don’t want the SQL, you can use the NDB API directly, or REST through mod_ndb Apache module (http://code.google.com/p/mod-ndb/).
This would score high on your list if you evaluated it:
* Transparent sharding: Data is distributed through an md5sum hash on your primary key (or user defined key), yet you connect to whichever MySQL server you want, the partitions/shards are transparent behind that.
* Transparent re-sharding: In version 7.0, you can add more data nodes in an online manner, and re-partition tables without blocking traffic.
* Replication: Yes. (MySQL replication).
* Durable: Yes, ACID. (When using a redundant setup, which you always will.)
* Commit’s to disk: Not on commit, but with a 200ms-2000ms delay. Durability comes from committing to more than 1 node, *on commit*.
* Less than 10ms response times: You bet! 1-2 ms for quite complex queries even.
* …
* … so I forgot performance: I routinely get 10-30+k writes per second on modern HW with only 2 nodes, with a larger cluster we have production systems with 100k+ per second.
* Yes, it scales linearly unlike some stuff you tried.
The only caveat: Network latency affects performance a lot (due to ACID commit to 2 nodes). So would be interesting to see how this works on EC2.
[...] size , but we’d love to spruce things up with perhaps squid servers, memcached , merle , or a distributed key-value store [...]
It’s not free, but I’d be curious how you find Oracle (previously Tangosol) Coherence to stack up ..
Peace,
Cameron.
Have you revisited this issue in the last 5 months?
What did you end up choosing/doing?
Have you revisited this issue in the last 5 months?
or what did you end up with?
Люди в таких вот случаях говорят – Белую ворону и свои заклюют. ;)
I’d like to add my project – SubRecord to the list which is however not stable nor finished yet (alpha), but a lot of thing have been already done. This is a Java project, to some extent similar to Voldemort, but still very different in a few aspects what makes it kind of unique. Check it out and if you are looking for a project to contribute in – please let me know.
We haven’t settled on anything yet, and since posting this there has been an explosion of new projects. Tokyo Cabinet keeps coming up, sorry I missed that the first time around. There are many, many more.
I’m going to be in San Francisco in June for “nosql” meetup, which is all about distributed kv stores: http://nosql.eventbrite.com/
This should help us narrow our options! Ping me on twitter if you will be there and want to meet up.
Действительно интересно написанно, я наверное бы так не смог.
Разное
Хотя еще полностью непонятно, что там такое происходит, но на 100% могу сказать, что это не в лучшую сторону!
[...] и Last.fm присматривает key-value storage для себя http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ пишет Richard Jones, бывший Last.fm CTO. Коментарии к посту тоже [...]
[...] for our server monitoring application, Server Density, uses MySQL. We are investigating using a non-RDBMS for the metrics data but plan to continue using MySQL for less intensive use such as storing users. [...]
[...] Richard Jones | Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] is another overview available online as well as a set of slides from the NOSQL [...]
[...] Jones (from Last.fm) has compiled an useful comparison of the various distributed key-values stores out there including MemcacheDB, Cassandra and CouchDB. He seems most interested in Scalaris and Project [...]
[...] Jones, co-founder of Last.fm, has written up an excellent overview of distributed key-value stores. Also Tony Bain gives an introduction to the conceptual differences between relational databases [...]
Видел что-то наподобие в англоязычном инете, в Русскоязычном интернете про такие вещи как-то не особо часто посты увидишь.
[...] how I am going to store my data, but then what would I use? I came across a really nice blog post “Anti-RDBMS” that did a break down of a few front running candidate systems… it got me spinning off [...]
[...] quiser se aprofundar um pouco mais, recomendo este link e mais este, todos em inglês. E visitar os sites dos [...]
The guys from Scalaris explain in their FAQ why persistence to disk makes little sense for Scalaris consistency strategy, crash-stop, and will not be implemented, as it would *reduce* robustness.
“If a single failed node does crash and recover, which is not foreseen in the crash-stop model, but might happen if we have local persistent storage, we have three choices:
1. drop the persistent storage and start as new node (crash-stop model)
2. get some inconsistencies as another node already took over. For a short timeframe there might be more replicas in the system than allowed, which destroys the proper functioning of our majority based algorithms.
3. friendly join the system and update the stored peristent state with the current state in the system (one way to implement that would be (1)).
So, persistent storage does not help in improving the availability or robustness of the system.
…
Persistency as in traditional replicated databases is not intended by our system model and thereby not possible. The best alternative would be periodic snapshots, that can be done also without interrupting the service.”
http://code.google.com/p/scalaris/wiki/FAQ#Is_the_store_persisted_on_disk?
[...] Here is a great blog from the cofounder of Last.fm on the multitude of alternatives to a traditional RDBMS for heavy distributed write based applications http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
Great insight into technologies that are new to me.
Working with customers who have existing database scalability and performance problems. I am interested to hear about any case studies where an existing RDBMS was replaced with a data store?
It seems to me that data stores provide excellent scalability and performance, but only in the context of applications written from scratch due to the tight coupling of the data and application layers. For now, I cannot see their ready acceptance to solve companies existing RDBMS’s problems as I point out here:
http://bigdatamatters.com/bigdatamatters/2009/07/nosql-vs-rdbms.html
Just out: Keyspace, a consistently replicated, fault-tolerant key-value store.
http://scalien.com/keyspace
Люди в таких случаях говорят – Без косы сена не накосишь. ;)
[...] are several good blog posts around that go into more detail for each [...]
[...] Anti RDBMS : a list of distributed key value stores http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] Shared Anti-RDBMS: A list of distributed key-value stores [...]
[...] that is going on about suitability of traditional DBMS as Cloud Databases and an emergence of key-value pair data stores as an anti-RDBMS movement of sorts. August 25th, 2009 | Tags: Azure, cloud computing, cloud [...]
[...] Jones, co-fundador de Last.fm, ha escrito una excelente visión de conjunto de los almacenamientos de clave-valor. También, Tony Bain ofrece una introducción a las diferencias conceptuales entre las bases de [...]
[...] A list of distributed key value stores [...]
[...] It’s a “distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API”. Data is stored in ‘documents’, which are essentially key-value maps themselves, using the data types you see in JSON. CouchDB can do full text indexing of your documents, and lets you express views over your data in Javascript. I could imagine using CouchDB to store lots of data on users: name, age, sex, address, IM name and lots of other fields, many of which could be null, and each site update adds or changes the available fields. In situations like that it quickly gets unwieldy adding and changing columns in a database, and updating versions of your application code to match. (from: Anti RDBMS a List of Distributed Key Value Stores) [...]
[...] Richard Jones | Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. (tags: article performance scalability database denormalization nosql) [...]
[...] Richard Jones | Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ [...]
[...] http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] Store Database,而应当叫做Anti-RDBMS或NOSQL [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] Anti-RDBMS: A list of distributed key-value stores | Richard Jones, Esq. (tags: programming erlang dht store comparison couchdb storage mysql data performance scalability architecture db cloud key-value key) Interesting Links [...]
[...] one thinks of the capabilities and limitations of distributed key-value stores relative to relational databases, one thing is clear – the stranglehold that SQL has held [...]
[...] database engine is a direct competitor to BerkeleyDB, and other key-value stores: one key, one value, no duplicates, and crazy fast. However, being faster is one thing, and [...]
[...] ou du moins proposant des concepts similaires sont également disponibles sous la forme d’outils Open Source que l’on peut déployer dans nos [...]
are you looking for low latency, replicated key-value? no exist!
CAP Theorem :
Consistency == low latency
Partition tolerance == replicated
key-value is high Availability.
so you break CAP Theorem !
[...] información sobre el mundo NOSQL. Un ejemplo para comenzar a hacer clicks y perderse podría ser este artículo/comparativa en el blog de Richard Jones -sí, otra vez él-, o este otro artículo/resumen de un encuentro de NOSQLeros en San Francisco. [...]
Great resource.
Cassandra is on apache incubator now. The page :
http://incubator.apache.org/cassandra/ has good documentation and the IRC channel(#cassandra on freenode) has good population.
An update to the article would help a lot.
I’ve added a note to the top of the article to draw attention to the fact that *lots* has changed since I wrote this. I won’t be updating the article tho.
I gather that people here mostly compare the key-value store system to a relational database. And indeed some KV stores resemble RDBMS – indexing, processing on data… But generally key-value stores have much more common with file systems. If you do a complexity list than RDBMS would be sitting at the top (semi-structured namespace, complex queries, procedures, ACID), file systems would occupy the second place (hierarchical namespace, rather simple metadata, POSIX) and then there are key-value stores (flat namespace, no other metadata, no security…). The simplicity is advantage here as the speed and scalability comes with it. Anyway, project FAWN has interesting (experimental) key-value store – http://www.cs.cmu.edu/~fawnproj/
Oh, and another KV store that does not seem to be mentioned here is Riak – http://riak.basho.com/arch.html Pretty similar to Dynamo (consistent hashing, vector clocks, merkle tree).
great resource!
Most of the resources I have found, lightly touch on BI aspects – i.e. doc based stores (like Mongodb etc.) are not best suited for BI especially adhoc reporting.
We are considering moving part of oracle data to mongodb. I am wondering how to best address the BI needs for our end users? Do we setup an ETL to create a relational datawarehouse that can use traditional BI tools like business objects/cognos etc.?
If we use our developers to create adhoc reports on docbase datastore then the cost adds up.
Any other approach?
Since BI (Adhoc reporting, dashboarding, olap etc.) is an important byproduct of the database, I wonder how do these docbased data store handle it?
Could you please recompile this post with the new data, adding those that were not in the list.
Thanks in advance
[...] Another nice general comparison: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ [...]
[...] Anti-RDBMS: A list of distributed key-value stores [...]
[...] Anti-RDBMS: A list of distributed key-value stores von Richard Jones, vergleicht hauptsächlich Key-value Stores miteinander [...]
[...] databases have become pretty trendy lately – especially with the news that big players like Facebook and Twitter are using the [...]