采访《七周七数据库》作者 Eric Redmond - 数据库的故事
本期音频文本非常感谢@吴怡村的整理。本文是 Teahour 第 18 期 『Interview with Eric Redmond about Database』 的录音文本，欢迎大家订阅 Teahour，iTunes URL 是 http://itunes.apple.com/cn/podcast/teahour.fm/id608387170。Android 用户可以使用 AntennaPod 来订阅。同时，欢迎加 Teahour 好友，微博和 Twitter。
Part 1: Introduction and the CAP theorem####
Eric: Sure, as you've mentioned, I'm Eric Redmond. I'm relevant to this episode. I co-authored a book with Jim Wilson called Seven Databases in Seven Weeks. And the idea was that with all the changes that have been recurring in the database marketplace over the past few years, it was probably the single best way we knew to try and communicate all of these changes: by just teaching different databases that correspond to different styles. So in the book, as an example of relational databases, we had Postgres, and as an example of key-value storage of two different kinds; we had Riak and Redis; for column-oriented data stores we had HBase and a couple document data stores with Couch and Mongo, and a graph data store with Neo4j. Other than that, I actually work for a company Basho that makes one of the databases: Riak. Strangely enough, until I wrote this book, I had not even heard of it and I was so impressed with the design. I even mentioned in the book that out of all of these databases it was the one that struck me as one of the most architecturally elegant of the distributed styles that I just couldn't say enough positive things. Finally I was just saying so many positive things I guess they decided to pay me for it. So now I work at Basho.
Kevin: OK, so maybe (let's talk about) something about your background. Why are you all of a sudden interested in databases?
Eric: Well, like most things. It seems like all of a sudden but there's a long and boring history behind it. No, I've actually been working in databases since college at Purdue. I worked in (of course at that time, relational databases were the only real options anybody had) specifically Postgres; that's where I did a lot of my work and even more specifically than that, high-dimensional indexing: the ability to index multimedia data by mapping it to a point in n-dimensional space was the focus of our research at the time. It's interesting now in retrospect how almost silly that was. There are so many more efficient ways of mapping high-dimensional data than simply plotting a point in 5,000 dimensional features of vector space and doing a nearest neighbour search on it but it was still really interesting. It was sort of my first foray into this world as an undergrad.
Kevin: OK, so you actually did some research on database in university.
Eric: Whoa, whoa (let's) back up. My name is not on any papers. I definitely wouldn't go as far as saying I did research. I was an undergrad that helped with a lot of this research though.
Kevin: OK, just to be accurate on the record.
Dingding: This is Dingding. So (about) your book Seven Database in Seven Weeks, I'm just curious that why it's seven, not six and not eight?
Eric: That's a good question. Actually, there was a book preceeding this by an author Bruce Tate called Seven Languages in Seven Weeks and the theory was very similar that if you learn all these different types of programming languages that will give you a very good overview of the programming language ecosystem. He's the one that picked the number and we just stuck with it.
Kevin: That actually is a very nice book and very mind-expanding. I think your book will be following a very similar pattern: just seven representative databases in different genres.
So when you graduated, did you start as a database guy?
Eric: No, I graduated right at the end of the dotcom boom: right when the dotcom era crashed (at) end of 2002, so I got whatever job I could. As much as I would have love to get into data management, I actually try to interview with Google. At that time, they had less than 100 employees but since I didn't have a PhD or anything, they weren't really interested. Although big data management, even at that time, was something I was very interested in. But I just took whatever job I could and I ended up being a software engineer for a few years and finally I was able to fall back into the database world: originally working for a company that does Mongo hosting called MongoHQ then from there moved on over to Basho to work on Riak.
Kevin: OK. I want to start this discussion of databases with the CAP theorem; I actually heard this from you. I think that would lay a pretty good foundation when we talk about each genre of databases. Why don't you talk about this CAP theorem?
Eric: Sure. There's a lot of theory to know about distributed systems in general and those that manage data in particular. A lack of knowledge of these details have been bringing people to make wild claims like 100% up-time guarantees or what I think is even crazier that is that CAP doesn't matter. You're starting to hear this trope quite often now. Really, the cornerstone of all distributed databases is the CAP theorem. The CAP theorem: C stands for consistency, A stands for availability and P is partition-tolerance. I'll start with partition-tolerance first. This means a system that can tolerate a network partition, meaning lost network packets, which is always a possibility. Effectively, partition-tolerance means your system is distributed. If it's possible to have two computers on a network where one tries to communicate to another one and that signal can be lost. If it can be lost, that's a network partition. In the simplest case where you have two computers and one cable running between them, if you just take a pair of scissors and you cut that cable, you now have two computers that are effectively their own little sub-networks because they can't talk to each other any more. You can heal that partition eventually (take a bunch of electrical tape and fix that cable) but that possibility is always there. So as far as the P (partition-tolerance), this is something that's not optional if it's a distributed system. So the layman definition of the CAP theorem is: between consistency, availability and partition-tolerance, you can only have two of them. But really, what it actually is, is that if you have a distributed system, you have to choose between consistency or availability. Now, not to belabour the P part but P really is the keystone of all distributed system problems. If it weren't for the possibility of partitions, creating a consistent and available distributed database would actually be trivial. For concept of such painful importance, you would think it would deserve some fancy Greek letter like ∏, but now it's just P. Now I sort of glossed over consistency and availability, I'll try and explain using just a quick little story about why you can only have one of the two. So imagine you have three men that are sitting at a bar and the bartender gives each of the men a whiskey, then if you walk up to any of the man and ask what was your last drink, they would of course say whiskey. Now let's say one of the three men gets up to go to the bathroom. In distributed system's speak, he's been partitioned from the group. He can't communicate with them because he's in the hallway walking to the bathroom. Now while he's gone, walking away, the bartender gives the two men that are still sitting at the bar a beer but now we have a problem because depending on who I ask, either a man at the bar or the one on the way to the bathroom, I'll get a different answer to the question of what was your most recent drink. So what this means is the three men would give inconsistent answers because as a system what you want is a system where no matter who you ask, you always get the same answer. That's what consistency means. But the man on his way to the bathroom could say I'm a minority here, I can't reach consensus with my friends so I'll just refuse to answer. So in other words, he's now unavailable. He's unavailable to answer your question. Not that he's down or he's crashed or anything. It's a conscious decision on his part to not answer that question because that way he's not inconsistent: he's just not answering and that's the decision that must be made. You can either be consistent but unavailable, or you can be available but inconsistent in the face of a partition, but you can't be both at the same. So that's sort of my parable of the CAP theorem and on why you have to pick consistency or availability but you can't have both.
Kevin: So using your analogy, if that man chooses to be available, then his answer may not be correct.
Eric: Right, it could be inconsistent because the last drink he had was whiskey. But (for) the other guys, the last drink they had was beer so they would give you inconsistent answers if you ask the wrong one. The problem is you never know which one you're asking. Now remember I've said that the man can decide that he's a minority and can't reach consensus so he'll just refuse to answer. This is actually how some databases like Mongo work: they will vote on a master. The master is the one that gets to communicate. But a master can only be elected to be the one to answer questions if it's in a majority. So if a network partition occurs, the minority has just lost the ability to be elected. In the CAP theorem sense, hypothetically Mongo could remain consistent. I'm not going to speak to whether it will remain consistent because this is a technical issue and working for a competitor I probably shouldn't exactly go into any sort of rant about whether it's consistent but it could be.
Kevin: OK. But it can't be both consistent and available.
Eric: Right. That said, that doesn't mean that a system can't be neither. It could be neither consistent nor available and there are databases that are like that.
Kevin: What's the point for that?
Eric: Well, there are ones that were designed this way. PNUTS is one of the more famous examples and they do it for lower latency because they don't actually need either. They just want to be as fast as possible. But there are others that are neither consistent nor available just by broken design basically. These are really hard problems and it's one of those things. We're talking about edge cases so it really depends on the size of your clusters. Some people will run three machines and they'll be fine most of the time. I actually spoke with a potential customer a few weeks ago. When I even said what if one of your current setup breaks and they said, well we'll be down for a minute and we'll fire up another one and we'll be fine. That was acceptable to them. That was not acceptable to everybody: some people can't be down for even a minute. So what I find a lot of times in these cases where you have some of these databases that aren't as consistent as they lead on to be. People never in practice really run into them anyway because they are just not that big. Now, when you start talking about dozens of nodes or hundreds of nodes, that becomes a different case: failure is no longer an edge case. Failure is just common because if you think about the fact that if you have some small percentage, say even 1% chance that one of your nodes goes down and you have 100 nodes, there's pretty good odds that one of those nodes is going to be down at any given time. So different systems, like Riak, is designed with that sort of high availability in mind. Now of course, what it gives up is consistency. It's an eventually consistent system but it's not a fully consistent system.
Kevin: OK. When you talk about availability and consistency, it seems like most databases will optimize for one of them. Is it like a continuum between the two?
Eric: It's very much a continuum and it's not just a continuum between consistency and availability for design. In some cases, it's how you configure your network or how you configure your servers. HBase, for example, you can make it either a consistent or available system. Something like Riak, in theory, could be written in such a way that you could have a fully consistent but not available system. Cassadra could be possibly be written that way too. The reason it's not done that way is because people just don't want it. We kind of have different customers. People use different databases for different reasons generally speaking, and to answer your question, they do tend to lean one way or the other. But whether they succeed is kind of a different question.
Part 2: Relational Databases####
Kevin: OK. I think this is a good time for us to dive into different genres of databases and see where they fall into this theorem and also what are the trade-offs that they made. Or in other words, when we pick those databases, why do you want to choose one of the databases? We'll start with relational database cause that's been around since forever. Where do you think relational databases fall into this CAP theorem?
Eric: Well, the CAP theorem only applies to distributed systems generally. It's one of those thing that technically applies to non-distributed systems because if you give up partition-tolerance, you'll have consistency and availability, but who cares? At that point, it's like why are you even discussing an inherent problem of distributed systems in a non-distributed sense.
We can talk about Postgres for example. In any non-distributed systems, you have ACID compliance. It's Atomic, Consistent (completely different meaning than the CAP theorem of consistent by the way, which makes it very confusing to talk about), Isolated and Durable. These are what you want out of a transaction. It either succeeds or it doesn't and entirely linear isolability is another term for it. Now, when you distribute them. This is what people don't necessarily think about. If you have a backup of your relational database (which you should), you have effectively created a distributed system unwittingly and unknowingly. If you've ever had a relational database that has a backup and then your primary database crashes, you have to load from the backup and you've lost data. You have effectively lost consistency in the CAP theory sense. You've also lost availability because you were down for a while when you were trying to update. I'm not sure if that technically counts as being unavailable because your entire system is down then so there's nothing up to even answer requests. There is another case, when people have relational databases and they don't think about the CAP theorem does apply here, is cache: if you throw Memcached in the front of your relational database to more quickly answer requests, you again run into the CAP theorem: you have a consistency problem. Cache expiry is an eventually consistent issue. It's a tough thing to avoid. Simply claiming that I have one big server and I don't have to worry about these issues is completely missing the point. That's of course still assuming only one server (but you can distribute relational databases, of course master/slave applications really come and set up). Again, this is one of those cases where you're making some sort of consistency-availability trade-off. Very often it's still eventually consistent. You can lock and say okay we want full consistency but now you have lost your availability. When I say lock, I mean lock across the network, not just lock within a single database because that's effectively what you need: you need some sort of lock for full consistency. You need a consensus between your nodes. Usually what you're giving up at that time is latency. You're increasing your latency because you have to wait for all of these responses to happen.
Kevin: But latency is not part of the CAP trade-off.
Eric: It is not. This is actually something that has been pointed out several times. Eric Brewer has mentioned this fact as well. Like I said before about PNUTS for example, they gave up consistency and availability for lower latency.
Kevin: You talked about master/slave and replication type of scenario. What about sharding relational database? I know it's kind of ... You can also do that, but it comes with trade-off.
Eric: It does come with trade-off. When people talk about distributed, that's sort of a catch-all term for just having multiple computers. But when you're talking about databases, there's generally only two reasons you distribute: either you're looking to increase availability by just being up more often and being safe if a node crashes (redundancy) or if you're looking to expand resources so rather than just having one DB server, you horizontally scale. When you split one table to live on multiple servers, that's sharding, that's expanding the resources, whether it's disk or CPU or RAM or whatever. You're expanding the resources horizontally across multiple servers. Although that's a good thing, unfortunately you've introduced an element of danger now because remember when I mentioned earlier that the more servers you add, the odds of any one of those going down remains the same so for each server, out of the entire group, it's a high chance that at least one of them will be down. If you have 1000 nodes, one of them will probably go down. But the problem is, if you only have one copy, you have corrupted your data because say you've divided your table across 10 different servers and one of them goes down, you've lost 1/10 of your data. So you deal with that by also replicating and that's the redundancy; that's the safety part. But also, it's good for reads as well: you can read from multiple servers so the resource requirements of any individual one has gone down because presumably you are dividing your reigns across all of these redundant servers as well. That of course introduces the problems of having to choose between consistency and availability though because once you've replicated a value to multiple servers, you'll have to keep them all in sync: you'll have to reach consensus. But reaching consensus is a very difficult thing and a lot of times what you pay in is latency. A quick comment, I keep throwing this word consensus around. The idea of consensus is in two parts: consensus is safely and liveness. Safety is the promise that nothing bad will happen that liveness is the promise that something good will eventually happen. There's an old paper from the 80s called the FLP Impossibility Proof that shows you can't actually guarantee both full safety and full liveness in a distributed system. Before this, people just threw out two-phase commit, thinking that it will solve their consensus problems. But if a network partition occurs, again as I said partition is the crux of all distributed system problems, you can lose liveness and actually you can even lose safety. So then they created three-phase commit. It has better liveness guarantees but it's still potentially unsafe in the face of network partition. That's why, maybe it's been 10 years or more, Leslie Lamport created a consensus system algorithm called Paxos and it's very powerful and it's very flexible and it's also very, very complex. There are very few good implementations out there. The importance of these consensus algorithms are unless you're routing all of your traffic through one main master server, you can't really have consistency in your distributed system, not in the way that we want liveness in terms of availability. What we live with is eventual consistency, which itself is a liveness (in terms of availability) property. Peter Bailis came up with this concept called PBS (Probabilistically Bounded Staleness) and the point of it is to answer two questions about eventual consistency: first question is how eventual is eventual consistency and the second one is how consistent is eventual consistency. Again this is important because, if you've noticed, the very first word in there is probabilistically: what's the probability of you values being stale because that's really what it comes down to in most of these distributed systems. It's when you have an eventually consistent system, you're talking about probabilities now. This just cuts right through the heart of distributed databases because you want to be as consistent as possible but you also want to be as available and with the lowest latency, and to get all of those things is just not in the cards: you're giving up something at some point. So the best thing to do is just figure out what exactly it is you need from a business case and make a decision that way.
Kevin: When we do get to the specific parts, we'll ask you to give us some examples of use cases on how to choose. So we're still on relational database?
Eric: Oh yes.
Kevin: I kind of want to ask the question of open-source database like MySQL and Postgres v.s. proprietary database such as Oracle, DB2, Sybase and so on. Is anybody there still using non-open-source databases? Is there a reason not to use open-source databases?
Eric: Not really. That's about as boring and straight-forward as it can be: there's no good reason not to use an open-source database. There once was but these aren't true anymore. The fear of vendor lock-in is pretty well abated. It's not problem in the way that it was. If you use MySQL and you don't like it, try Postgres. If you don't like that, try Volt. There are so many relational databases out there that are open-source. You may as well just go for it and use them. I have been in charge of three conversions to MySQL or Postgres from a proprietary database. Two of them were Oracle and one was SQL server. In every case, we saw a significant improvement in performance largely around just because as we were doing this port we've been able to refactor everything but the point being that it was successful, it worked well and it was cheap. There's often a fear of using open-source systems as who's going to support this. I can't think of a single open-source database that doesn't have a company backing it right now. They all do. Maybe Couch doesn't, I don't know. There's Couchbase but that's kind of its own database. They all do. There's always somebody to call if you have problems. They all have books; they all have communities. You can always just email Stack Overflow. The reason I bring that up is the fear of using an open-source database I find is generally the fear of the lack of support like who do I call if something goes wrong. However, I've had problems with industrial commercial support as well. If it means that much to you to have someone in the blind to be paid $500,000 a year in licensing fee then you have more money to throw around than I do because I would rather just hire five engineers for that price or unless if you're in Silicon Valley then two engineers and just have them on the job.
Kevin: So out of the open-source relational databases, I know you have experience with Postgres so that's sort of front-runner nowadays.
Eric: That's my front-runner. I'll be honest, in the past three or four years, I haven't really been involved much in the relational database world. I pretty much spent my lifetime (10,000 hours) in front of my computer on relational databases and it's not something I'm interested in going back to. I still think they're amazing and I still think relational databases will solve the majority of people's problems. I think that far too many people rush to alternatives for no good reason. But I definitely still throw out Postgres because I just prefer the design and now again this is my entire biased opinion. It's the one that I'm the most comfortable with. It's the one that most of my knowledge of the internals peaked around 2004 and at the time its internals were far cleaner and much more flexible than something like MySQL at the time: you could actually write your own indexes rather painlessly. They have just implemented it so you could write your own B-tree style indexes with even less trouble than just flexible scripting almost. I just enjoyed the flexibility and the community.
Part 3: NoSQL databases####
Kevin: Let's move away from relational databases. There's this NoSQL basket for all of the non-relational ones. My first question is why did NoSQL emerge in the last 10 years? What has changed to give birth to this set of databases?
Eric: My theory is that there are really two reasons why NoSQL emerged: one is just the fact that as more and more people have got on the web, more and more people have constantly connected devices, we're just collecting more data and that amount of data was something that was never envisioned for relational databases. Relational databases can handle lots of data. They can go into TB range but it's one thing to be able to collect TBs of data and it's another thing to be able to respond quickly to requests once you've collected that data. So one is just the amount of data and two is that over the years, relational databases have been very specifically tuned for business intelligence: for doing analytics and data warehousing and things like that. Because they weren't dealing with such massive data sets most of the time and if they were it was some sort of just data that you could just put in the background and just archive it and then build a data warehouse star pattern (schema). I think those two reasons, more than any other, are responsible for companies having to effectively go along. The first few NoSQL databases that popped up were Amazon and Google because databases like Oracle just couldn't keep up. They needed something different. As more and more people started having these same problems, at the same time, virtualization was getting easier and cheaper to scale out horizontally. This is something that relational databases historically have been very bad at so they took advantage of all these virtualized systems and were able to scale out. So these databases were designed to do that. HBase, Dynamo were designed to do this. That's sort of my pet theory on why NoSQL has emerged recently. But you're starting to see in some cases relational (databases) try to catch up. I've mentioned VoltDB. It's made to be a horizontally-scalable database. I don't know anyone that uses it. It may well be great. Stonebraker is brilliant so I would definitely never bet against him. But I don't know if people have just decided that they don't need all of the relational trappings or if they believe that they can't have them. But for some reason, people have been flocking to NoSQL database world.
Dingding: So, you can build a distributed relational database with distributed nodes so you can have hundred of Postgres nodes and just build an upper layer to make it distributed. What's your opinion of that against a NoSQL solution?
Eric: It's interesting. If you think of that architecture and you work with the architecture of something like Mongo, you'll see a lot of similarities where you have a mater node that replicates and then you'll have your Mongo server that accesses a router and your configuration that track where different keys go, they'll feed Mongo to decide this shard goes to this replica set. Very similar to the way you would do manually in a relational database. But again your problem comes down to the CAP theorem. It comes down to the fact that you still have to make a consistency-availability trade-off. You can use something like Prosgres but you don't really get all the benefits anymore. You can't really easily or realistically join, for example, across a network. Really, the power of relational databases come down to the fact that you can normalize all of your data structures and if it's normalized, you can query it in pretty much any way you want and more specifically joining. You can join values and create new table types and relations that you get as response. You have effectively lost the ability to do both of those once you start distributing your relational database unless you're just doing straight-up replication. I know people do this a lot but sometimes I get the feeling they do it just because they just don't know any better. If you're doing that, I would recommend just trying something else. Just try a system that was designed to be distributed from the ground up and not shoehorn a system that was not designed to be distributed into becoming a distrusted system. The important thing to note about all of the databases that we've mentioned so far, with enough code, you can make them all do anything if you want to hack enough around it. The question is how much effort you really want to put into designing your own kind of database ecosystem, which effectively what you're doing. You also have to think about operational costs. Yes, you can create these sort of custom clusters of relational database to do the things you want but now you effective have to worry about is this a master node or is this a secondary or is this a configuration or is this a router. You've sort of offloaded your development costs and put it squarely on your operational costs. This is something developers don't often think about when they are coding: coding is the cheapest and fastest phase of your process. The more expensive and much long part of it is operations. Assuming you're successful as a developer, someone is going to be running and maintaining that code for years. If you design it in such a way that is complex to scale and maintain, you've effectively just given some future person a whole lot of work. If you're not distributing, just use relational database. They're easy and they have a ton of research behind them for years. SQL is an amazing language. It's the most successful language ever. Think of any language that's been around that long. You're talking about C and SQL in some form, obviously not in the SQL specification. That's much older than 40 years but the whole relational concept. If you're distributing, go after a NoSQL solution.
Part 4: Document Databases and Column Databaes####
Kevin: Talking about NoSQL, a lot of people equate them to document databases. That seems to be most popular one nowadays. So why don't we start with document databases? What are the trade-offs and maybe in the CAP theorem context?
Eric: Again, I wouldn't necessarily want to frame this as a CAP theorem thing because there's no reason we couldn't design a document database that's a consistent v.s available. If you look at Mongo for example. It's designed to be a consistent database. One thing I will say is that Kyle Kingsbury recently wrote a series of blog posts about many of the databases about many of the databases that we're talking about here where he stress-tests and he creates artificial partitions and queries them in different scenarios and looks at the failure rates. What he uncovered is that a lot of the claims that are made by some of these implementations don't necessarily hold up in a very tough scenario. Again, fairness, being what it is, in the real world, these may not come up quite as often. He really hammed it. One of the losers in his initial attempt was the default settings that come with Riak which was kind of painful for some of us at Basho to have to see but there are settings you can make it better and most of the databases did. There were configurations that you could follow. I will try to find that link and send it to you because they're definitely a fascinating read. As far as document datastores, the two that I covered in my book are Mongo and Couch. Mongo, for many people, is synonymous with NoSQL: if you say NoSQL, it's the first thing that pops into their head. There's a lot of reasons for this, one of which is it's the first popular one to come out and it still remains the most popular as far as just sheer number of users. Mongo, in the CAP theorem senses, is slated to be consistent and of course that means it's beat is availability, in the technical sense, if there's a partition. Although that used to be a selling point, Mongo actually is from the middle of the word humongous because it was supposed to deal with humongous data sets. By design it's very similar to what you would do manually if you took a relational database and wanted to create a master/slave setup yourself and shard keys and things like this. Part of the popularity is just the fact that it's really easy to use if you are used to a relational system. It comes with a very simple query language where you just put values in. (When) I say document, I don't mean a document like a PDF or Word file or anything like that. A document, in this case, is just JSON and is keyed by an id. Other than that, you can query by any value. This is actually really important if you are designing something and you're not entirely sure where you're going with it because unlike a relational database where you have a very structured schema and you have to say I'm going to have a person table and a person is going to have a first name and a last name then later on you decide oh I want to add middle name too. Now you've got to create a new column called middle name and you've got to actually tell the database what that schema is. In a document data store, like Mongo and Couch, you don't actually have to tell it anything in advance: you just give it a key-value, just like any JSON,
last:last_name. And then if you just decide you want to start adding middle name or initial, you just start adding middle name fields. You don't have to migrate your old data in any way. This obviously has a cost on retrieval: it's easy to put values in but when you pull values out, you code has to deal with the fact that there might be nothing there. So over time, your code could get crufty. But if you're going very fast, this is a trade-off that people, especially developers, don't mind paying. Couch has the same benefit. The difference being though the way you query Couch is you actually write MapReduce and that MapReduce is a view over your data in some way. For example, if you want to be able to query by last names, you would write a MapReduce that would extract last names out of your values and any time you would do a search, you search for values that have been extracted by this MapReduce view. Mongo doesn't require you to do that: for its query mechanism, it's much more like a relational database where you'll just say give me very last name that match this name or give me everything between this last name range. It feels much more like a relational database. That said, if not a relational database, you don't normalize it. It's almost inherently denormalized. Since it's JSON, you can nest values: your person could contain an array of pets. So in relational databases you might have to a separate pets table and then join it with a person. In this case, your person will just contain pets. That makes it really easy to build and it makes it really easy to put data in. It makes it slightly more difficult to get data out because if you just for example want the names of everyone's pets, you'll have to do a more complex query than you would in a relational database where you just say
select * from pets. You would have to say, get all the users, find all the pets, extract those names and then give them to me as a list or as a collection. That adds a little more complexity when you write your queries. This is something you'll find with most NoSQL databases: it's much easier to get data in than it is to get data out. And this is part of the reason in my opinion you're seeing popularity of other third-party analytics tools like Hadoop, which is like a MapReduce engine where it can perform queries across the system predicated on the idea that when you have TBs of data spread across multiple servers. It's cheaper to take the algorithm and just send it to the servers, let them compute the value and then give it to you than what you would do in a relational side where you stream data and then compute it because it doesn't matter because it's all on the same box anyway.
Kevin: So it's easier to ship your queries or the algorithm themselves to the data because you have a distributed system than to getting pieces of data from each place and try to join them together or get a result.
Eric: Yes, it's just the matter of reducing the amount of data going over the wire. When you're dealing with big data, you very often have to invert your way of thinking and this is one of those cases where you invert the order of operations you are normally used to doing. One of those inversions I mentioned before where very often you don't necessarily think about how you're querying the data. You just put it in there and you worry about querying it later. That obviously has its own cost. You're going to be paying these costs anyway. In a relational database, you just pay them in a very different way: you pay them upfront, in the design phase. While you're sitting down designing something, how many times have you done an application that you're using a relational database and you start with a whiteboard and you start drawing tables and saying okay we're going to want to do this and this is what our scheme is going to look like and you start coding and say oh crap, I need a different table or I need join tables to sit in-between these because the join needs to actually have another value hanging off of it, so I need to create a whole new table. These design decisions are very trivial when you're dealing with a lot of these NoSQL solutions but the querying is more difficult.
Kevin: Yeah, I guess the compliment with relation databases was querying is really powerful, really super easy. Whereas when you would start to add data, then you start to face the CAP theorems. I still want to talk a little bit more about Mongo because that seems to be the popular one. We talked about availability and some of the trade-offs but from a data modelling point of view, if you just compare relational database with a Mongo data, I feel like Mongo is a very opinionated way. It's like if you know exactly you want to traverse the data, then Mongo is perfect. In your example, a person and cat. Maybe cat has toys. If you know you always go from person to cat to get toys, you never would just aggregate some toys that way, then it's perfect. You can always go that way. But you're losing the flexibility. You may not anticipate you do need to do another scan.
Eric: Yes. The flexibility you lose is the query flexibility of a relational database. That said, no database has the query flexibility of a relational database. The relational database designed SQL is a query language. It's a declarative language. It's structured. But that's exactly its strength and largely that's its biggest strength. And relational databases are the only ones where you have a somewhat structured schema. Column-oriented databases like HBase, you define your schema upfront. Cassadra is another: it's topologically Dynamo-based like Riak is but its data structure is column-oriented like HBase is where you define what your column families are. They do have a little more flexibility than a relational database. The problem for relational databases is you put one value per row for example whereas in a column-oriented data store you can have as many individual discrete values as you want without adding rows: you're just adding that value because data is stored in columns rather than rows. A simple example would be a wiki where the key might be the title of the wiki page and you might have multiple revisions. In a relational database, unless you denormalized it, if you just said okay my page table will have one column for the title and one column for the contents of the page. What you'll find is the title never changes, so you'll just be replicating that a lot and the contents of the page change quite often. Whereas in a column data store, you would have just one column family page and the title would never change. That would be one column and another column would be the contents of the page and it would change. When you do a query, it's a row but it's almost like a pseudo row. You're just saying okay give me the most recent title and the most recent page contents. You actually can get a lot out of this: you can give them a time to live, which is very nice. There's a reason that Facebook's messaging system runs on HBase. The ability for messages to have a time to live is something I presume they are able to leverage and also it scales up crazy.
Kevin: Right. I guess if you do this on a SQL database, it'll be non-trivial or you'll have to write a lot of code to run it like that.
Eric: Yes, or you'll have to have some sort of custom extension. I wouldn't be surprise if something like Postgres had a time to live extension or a timestamp. Generally what you'll do is you'll just timestamp a value and then as part of query you'll just say give me this range and then maybe manually delete everything that's outside of that range. But that would be one trivial way of doing it but it's nice to have these things built-in, that's for sure. Again, like I said before, most of these databases, with enough code, you can make them do them to anything but how much code do you want to write?
Dingding: When we talked about document DB, you mentioned some column DB like HBase and Cassadra. Document DB to me is similar to column DB but with few limitation and more improvement. So what's your opinion with these two and the difference between them?
Eric: Actually a document DB is much closer to a key-value store in this way because Mongo or Couch, whether its queries or views, by default is not actually indexing the values. The only thing that's indexed is the id, which is just a key look-up. But if you query against a column that's not indexed, it's just a full table scan. The same is true with Mango and Couch. The same is not true with something like HBase where these are sparsely ordered. You're actually always scanning effectively. I guess you can do key-value look-ups as well but generally you scan ranges of values. To index, you'll effectively index manually. Again, depending on how much code you want to write, you can index with a key-value store as well. If you have Redis for example, which is a key-value store and you have a deeply nested value and you want to be able to query by certain values, you can just create another key type and say okay I'm going to be a quick-up and point to the, for example, say you have a
person:social_security_number and that'll contain all of the person data that we talked about earlier with our Mongo example: searched by last name. You can just create a key like
last:last_name and then it can point to the correct
person:social_security_number and then it's just a look-up. You can do this with a key-value store. In relational databases, obviously to do it effectively, you need some sort of ability to scan. So you could effective write your B-tree if you can't scan. It depends on how much code you want to write but I would say generally speaking, I actually find much more similararity between a key-value store and a document data store than I would in a column-oriented data store and a document data store.
Kevin: I would say that as well. I feel like one way you look at document databases, they're also key-value stores but the value can be nested.
Eric: Riak, for example, all values are opaque, meaning it doesn't care what the value is so you can actually put JSON as a value inside of Riak. It has a secondary indexing feature so if you want to index against some of those values, you just make it index and you query against that index. That would make it closer to Mongo in that respect as far as querybility. Or if you want to use MapReduce, you can do that too, which case it'll be something more close to Couch although Couch is considerably more efficient on the way that it builds its views cause it pre-builds them and just keeps them updated using partial MapReduce as you make updates. You're absolutely right. There's much more similarity than differences in that respect.
Part 5: Key-value Stores####
Kevin: We talked about document databases and column databases already and you mentioned Mongo, Couch, HBase and Cassandra. Let's move to key-value stores. We already touched on Redis already. Maybe you can talk about Redis and Riak in comparison? Those two key-value stores. What are the characteristics of each and so on?
Kevin: We talked Redis in the caching context. It's mostly read. I've seen some people that use Redis as an intermediary to their database so instead of all the operation touching database they can just put in memory and it's fast.
Eric: It's actually faster to write to Redis than it is to read from it. I just experimented once before and I was able to get 100,000 operations writing and 80,000 reading. I don't remember what the time frame was. Maybe a second. Part of it is because it has to convert the value into something when it reads. Actually, I often recommend this a lot. It's for data transformation using Redis as an intermediary. I actually did a sample project: in the last chapter of the Seven Databases book I used Redis as an intermediate data transform from a tab-separated value into a document that's suitable to be stored in Couch. And it's because it's much faster to find duplicates in a Redis than it was to try and do a write to Couch that may have conflicted version-wise. This is a multi-threaded application to go faster. If two writes are going at the same time then Couch would reject the attempt at one of them so it has to go back to the application and say okay Couch rejected this and it has to read Couch again to get the newest revision number and then attempt to write again and sometimes that round-trip will push it back again because another write has succeeded in the interim so you can fill up your write effort rather quickly. So I 100% agree with that as well. It's not just caching. That's just the most common use case. Data transform is fantastic for it. The other key-value store that I wrote about is Riak which we should definitely talk about. Riak is actually designed to be a highly available system. It's made for the case where you can't be down for any amount of time. We've actually have customers that have had 100% up-time as crazy as that might seem. Comcast mostly gave a talk. They said since they've installed Riak, for the past three years, that system has been up 100%, which is unheard of in the database world. Part of the reason is because of the way it's designed: it has built-in sharding and replication. You can lose many servers and still be up. One of our customers had 30 nodes, and if I recall correctly, 13 of the 30 nodes went down and they didn't realize it. It took them like a couple days to notice a lot of the servers were down and they spin them back up and everything was fine. This is really Riak sweet spot. It's not made to necessarily be the fastest store or the largest scale. It's really made to just be up all the time and you could survive a nuclear blast and Riak will be fine as long as you've got Multi-Datacenter Replication and you've replicated all your values in multiple datacenters.
Kevin: So does the replications happen automatically?
Eric: Well, Multi-Datacenter Replication is actually the one thing that we charge for. The single datacenter replication is built-in. When you write a value to Riak, it will by default replicate to three nodes. This is tunable, you can set its value to whatever you want. We usually recommend 3: it's usually sufficient for most cases. That means at least 2/3 of that value's nodes could go down and you still have data available. Then there's Multi-Datacenter Replication which means that you have multiple datacenters that themselves replicate through each other to keep themselves in sync by various means. It's very tunable. This is for a lot of reasons, either data locality: so you can have a cluster in the U.S and a cluster in China and you can have data local to your Chinese customers cause they can access it faster v.s other data that's local to your U.S customers or you can have two datacenters and one is just used for backup or whatever other reasons people have for choosing to replicate at multiple datacenters.
Kevin: By the way, I'm just curious, are you aware of the companies using Riak in China?
Eric: I'm not. Actually that's definitely something I'm very interested in. If anyone's interested in helping spread the word about Riak in China, I'd love to hear from them or if anybody that knows of any companies in China that are using Riak, I would love to hear from them. We are a company that was found in the U.S. All of our first customers are in the U.S, then we went to Europe and six months ago we opened an office in Japan so we're slowly spreading internationally. Any way to speed up that trend would be amazing cause we have not really spread to South America as far as I know as well. We may be everywhere. I just haven't heard about them.
Kevin: Is Riak itself an open-source database?
Eric: Riak is 95% open-source. Riak itself is entirely open-source and Riak CS, which is like sort of Amazon S3, object storage. You can run Riak CS where CS stands for Cloud Storage which acts like your own personal S3. It has an S3 interface so you can actually use all of your S3 code. There's a lot of regulatory reasons or just financial reasons that people might want to run their own S3 system and that's entirely free. The only thing we charge for is Multi-Datacenter Replication and of course if anybody wants to buy support. That's true for Riak and that's true for every other database I've mentioned. There's all sorts of companies behind these databses: Mongo has 10gen; Postgres has a consortium of independent contractors; Neo4j has Neo Technology.
Kevin: I was just thinking because when it comes to China, a lot of things are big data and it's typical that you walk into a bank and you have 10,000,000 customers or 100,000,000 customers very easily. Coming back to Riak, you talked about Riak automatically replicates the data records, does it hurt its consistency or how does it make that choice?
Eric: As we've mentioned previously about the CAP theorem being kind of a spectrum: consistency and availability are in some ways tunable. It doesn't have to be fixed. This comes back to what I've mentioned about Peter Bailiss and PBS (Probabilistically Bounded Staleness, how eventual is eventual consistency and how consistent is eventual consistency), you can tune your eventual consistency. You can tune your availability and your consistency and in some ways your latency as well is effectively this way. The way Riak does it is it has three values called N, R and W. N is the number of nodes you will replicate a value to eventually. By default, it's 3. W is the number of nodes that will return a successful affirmative before you return back to the client and say yes this write woks so what happens is your write to one of the Riak notes will coordinate the replication. So if you set W to 2, what it'll do is even though eventually all three nodes will have the value replicated that you want, it will only wait until two of them returned a success before it says the write is successful. R is the last one, which is the same thing but for reads. You'll attempt to read from all three nodes but if you set R to 2, only two of them need to return a success for value to get a result. You're not going to wait for all three. Now if you want to make a more consistent system, the consistency can be either write consistent or read consistent. So for example, say that I have my end value and I want to be certain that my write has been replicated to all three so I can set W to 3 and say don't even return until every value has been replicated. At this point, you can be pretty certain that successive reads are going to contain that value but you've of course slowed down. You paid in latency and availability because if one of those three replicas is down, your write will fail because what you told it is I need all three to work. One of them is down, you only two up, it can fail. That's what's called the quorum. The quorum means if your R(Read) value + your W(Write) value > number of nodes, then you're probably going to be consistent, assuming everything goes according to plan because there's going to be an overlap. It's an easy thought experiment. Say you have nodes A, B and C, and you write successfully to nodes A and C and you read successfully from nodes B and C, then even though only C has the most recent value, you've at least got the most recent value and you know this. There are details here. I'm not going to all of the problems why that's not exactly truly consistent, not in the linearizable sense but it's consistent enough for being a highly available system, which is the whole point. You can flip it over: if you want to write quickly but you don't care to wait, you can set W to 1 and say just write to one node, I don't care which one, and then return. Your latency would go down because you're not waiting for all three replicas to return: whenever the fastest won the return race, then you're just back and you say okay I won. Then when you do a read, if you want the most recent value, you can then set R to the number of nodes and you're effectively reading from all of them cause you're saying at least one of these is going to have the most recent write. But you've slowed down your read at that time and again you've paid in latency and if one of those nodes is down, you've paid in availability. I'm oversimplifying this because there actually is a little more detail to it than that: there's durable writes; there's primary writes, primary reads. Because Riak by default does something, and this is something Dynamo does as well called sloppy quorum where if a node you would normally write to or read from isn't available, it will then go to the secondary node, which is the next from the list. It's not a strict quorum in the fact that it would fail. It would actually try to do that right any way. And it would just elect another node to act as a temporary storage. This is like, if you've gone on vacation and while you're on vacation, you have your neighbour collect your all your mails and then once you've come back from vacation, your neighbour hands you all your mail. In Riak, this is called the hinted handoff. So once the node has rejoined because it had crashed or because there was a network partion, all of that data gets given back to the primary node, the one that should had it all along. It's tricks like this that allowed Riak to be highly available system. This is why you get these kinds of crazy up-times that you don't necessarily get with all other databases that choose consistency over this kind of availability.
Kevin: I guess the use cases for Riak is if the business requirement is such that availability is paramount. No way this can be down, money is at stake or ...
Eric: Yes, and that's generally the way I would pitch it: if being down costs you something, whether it's money or users, if people would just get tired and leave. This is actually why Amazon designed Dynamo. So Riak is based off the Amazon Dynamo database design as is Cassandra topologically, whereas HBase is designed off Google BigTable. Amazon designed it in this way because they had done research that found that, they had it down to a dollar value, that was for every millisecond of latency, it actually costs this many dollars because people would just get fed up and leave because they do so many transactions so if you're doing a lot of these transactions, you can't have latency, you can't be down. It's interesting because EC2 and S3 famously go down like once a year and since half of the internet is built on EC2, everybody suffers these consequences and everybody freaks out but what's interesting is: Amazon itself doesn't go down if you ever noticed this. A lot of it is because of their internal architecture, not the one that they give everyone else but their own.
Kevin: So DynamoDB is hosted right? It's Database as a Service where you can just spin up but Riak is a little bit different. It's not hosted.
Eric: I'm not entirely sold on what DynamoDB is for because my general feeling is if you require such extreme amount of up-time where you need to use something like Riak. Just completely handing that off to a third-party seems weird to me. Just giving someone else control over one thing, in exchange you get a lot of control over it. I think you can set your own R, W, N values but who cares; I mean you don't own these servers anyways. If I were going to use a host solution because I didn't care that much, I would just use a database that's easier to use. There's MongoHQ and MongoLab if you want to do that; There's Redis To Go if you want to use hosted Redis; There's a Horoku Postgres. There's a lot of other options for hosted databases. I'll probably going to get flagged for saying that but I'm not telling you that you don't use it. If you get a good use case, go ahead and use it, but it definitely wouldn't be my first choice.
Kevin: One scenario of getting this kind of infrastructure is if your infrastructure is solely based on EC2 then that's great; if you have infrastructure away not on Amazon's cloud then you'll have to pay extra latency to retrieve the data.
Eric: That's true. Of course they designed it that way. They absolutely want you to use all of their infrastructure for everything and they've very much set it up in that way. There are outline cases: if you have a lot of money and you're a huge customer you can use Amazon Direct Connect which costs like six grand a month or something and they'll co-locate a fat pipe for relatively low latency. You can do all of these things but if you're to that level, why are you using EC2 anyway? Why are you letting them host anything? There's a lot of customers that love it and I can't speak too poorly because people are using it successfully; Maybe I'm a little more paranoid than everyone else. We haven't covered Neo4j at all.
Kevin: It's coming up. It's the next one. But I want to ask the last question for Riak because I think you're more familiar with that. Other than e-commerce sites such as Amazon, what are other use cases or scenarios that can benefit from high availability?
Eric: There are all sorts: we have video games companies that are using our stuff to be the back-end for user data and session data, etc or even switching devices where for example on a lot of games you can play on multiple devices: you can save state in one and pick up another and continue playing. These kinds of things are at a very large scale. (They) seem trivial but they're very hard to get right and you can't really have a lot of latency in those cases. So it goes beyond just simple shopping carts. There's also Riak CS which is great for asset storage. People store videos and images. As long as your values are small, you can store anything in Riak. Like I said, values are opaque so a value can be an image, just basics before encoded in whatever. It could just be small thumbnails if you'd like although I would recommend (Riak) CS but you could even use regular Riak