DBPedias

Your Database Knowledge Community

Jeremiah Peschka NoSQL

  1. Copy, Paste, Cloud

    One of my favorite features of EC2 is the ability to create virtual machine templates and re-use them to create fresh copies of a virtual machine. This is great but things rapidly get onerous when you’re trying to duplicate infrastructure.

    Amazon recently unveiled a new service called AWS CloudFormation. There are currently many Amazon cloud offerings available: S3, Elastic Block Storage, EC2, and Elastic Beanstalk are just a few. AWS CloudFormation is more than just another member of the family: it ties them all together.

    The idea behind AWS CloudFormation is to make it easy to create a collection of AWS resources and then deploy them the same way every time. AWS CloudFormation is similar to using Chef recipes to deploy software configuration. In this case we’re deploying an entire infrastructure stack to multiple virtual machines via a recipe. We can design our infrastructure on AWS. As our business grows we will be able to quickly and easily duplicate crucial parts of our infrastructure.

    AWS CloudFormation makes it simpler to manage all of your infrastructure. Deployments of new infrastructure become a matter of pushing out a template. If there are problems with a deployment, the changes can be rolled back and a clean up happens to make sure you aren’t charged for anything that you’re not using.

    Deployments

    In a traditional IT department, there is a design, purchase, deploy cycle that can potentially take a very long time. In previous jobs, we’ve had to design the infrastructure based on obscure internal capacity planning metrics. Once we’d made predictions/guesses about our future growth, we would then wait for weeks or even months to acquire new hardware. Once we had the hardware, it might even sit around for days or weeks before we were finally able to provision, configure, and deploy the servers on the network. That doesn’t even include deploying our own software on the server.

    On the flip side of that coin, by combining AWS CloudFormation plus Chef/Puppet, we can now push a new batch of servers out into the cloud in a matter of minutes and have them running in a few hours. Our software can be automatically installed and configured with Chef or Puppet. While we still need to write templates, once we’ve created and tested our templates for specific purposes (blog, database, community site, whatever), we’re able to fully automate deployments.

    Scaling Out

    AWS CloudFormation can also ease the pain of scaling out our applications.

    Typically when we scale out an application, we’re starting from a monolithic application stack. All of the assets in our stack have been scaled up to a point where it’s cost prohibitive to keep scaling. At this point, we’d examine each layer of our application and determine the best place to add caching or scale out to use multiple application or database servers. As we keep scaling our stack, we need to add more load balancers, caching servers, and database read slaves until we’ve exhausted our options and have to revisit our application design.

    Rather than engaging in the exercise of attempting to scale all of our customers at once, why don’t we start out by sharding all application resources at the customer level? While this increases the overall cost of operating our business, it makes it easy to scale elastically in response to a changing customer base. The busiest customers will get larger servers and increased performance that meets their needs. It also becomes possible to locate our data close to your customer in one of several Amazon zones.

    For businesses offering software as a service, this makes a great deal of sense. They get an easy way to monitor usage per customer and can scale appropriately within known guidelines and with well known costs.

    Wrap Up

    AWS CloudFormation makes it possible to provision and deploy infrastructure using a set of templates. When you combine CloudFormation with Chef or Puppet, it becomes very easy to deploy infrastructure and then deploy additional configuration changes on top of the infrastructure. Ultimately, AWS CloudFormation makes it easy to quickly and easily deploy new infrastructure in response to changes in load or customer demand.

    If you’re interested in some of the discussion around AWS CloudFormation, be sure to check out the Hacker News thread on the subject.

  2. Introduction to Riak at Columbus Ruby Brigade

    I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak on February 21 at 6:30PM.

    Riak: An Overview

    This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

    Add it to your calendar!

  3. Querying Riak – Key Filters and MapReduce

    A while back we talked about getting faster writes with Riak. Since then, I’ve been quiet on the Riak front. Let’s take a look at how we can get data out of Riak, especially since I went to great pains to throw all of that data into Riak as fast as my little laptop could manage.

    Key filtering is a new feature in Riak that makes it much easier to restrict queries to a subset of the data. Prior to Riak 0.13, it was necessary to write MapReduce jobs that would scan through all of the keys in a bucket. The problem is that the MapReduce jobs end up loading both the key and the value into memory. If we have a lot of data, this can cause a huge performance hit. Instead of loading all of the data key filtering lets us look at the keys themselves. We’re pre-processing the data before we get to our actually query. This is good because 1) software should do as little as possible and 2) Riak doesn’t have secondary indexing to make querying faster.

    Here’s how it works: Riak holds all keys in memory, but the data remains on disk. The key filtering code scans the keys in memory on the nodes in our cluster. If any keys match our criteria, Riak will pass them along to any map phases that are waiting down the pipe. I’ve written the sample code in Ruby but this functionality is available through any client.

    The Code

    We’re using data loaded with load_animal_data.rb. The test script itself can be found in mr_filter.rb. Once again, we’re using the taxoboxes data set.

    The Results

                 user     system      total        real
    mr       0.060000   0.030000   0.090000 ( 20.580278)
    filter   0.000000   0.000000   0.000000 (  0.797387)
    

    MapReduce

    First, the MapReduce query:

    {"inputs":"animals",
     "query":[{"map":{"language":"javascript",
                      "keep":false,
                      "source":"function(o) { if (o.key.indexOf('spider') != -1) return [1]; else return []; }"}},
              {"reduce":{"language":"javascript",
                         "keep":true,
                         "name":"Riak.reduceSum"}}]}
    

    We’re going to iterate over every key value pair in the animals bucket and look for a key that contains the word ‘spider’. Once we find that key, we’re going to return a single element array containing the number 1. Once the map phase is done, we use the built-in function Riak.reduceSum to give us a sum of the values from the previous map phase. We’re generating a count of the records that match our data – how many spiders do we really have?

    Key Filtering

    The key filtering query doesn’t look that much different:

    {"inputs":{"bucket":"animals",
               "key_filters":[["matches","spider"]]},
     "query":[{"map":{"language":"javascript",
                      "keep":false,
                      "source":"function(o) { return [1]; }"}},
              {"reduce":{"language":"javascript",
                         "keep":true,
                         "name":"Riak.reduceSum"}}]}
    

    It’s not that much different – the map query has been greatly simplified to just return [1] on success and the search criteria has been moved into the inputs portion of the query. The big difference is in the performance: the key filter query is 26 times faster.

    This is a simple example, but a 26x improvement is nothing to scoff at. What it really means is that the rest of our MapReduce needs to work on a smaller subset of the data which, ultimately, makes things faster for us.

    A Different Way to Model Data

    Now that we have our querying basics out of the way, let’s look at this problem from a different perspective; let’s say we’re tracking stock performance over time. In a relational database we might have a number of tables, notably a table to track stocks and a table to track daily_trade_volume. Theoretically, we could do the same thing in Riak with some success, but it would incur a lot of overhead. Instead we can use a natural key to locate our data. Depending on how we want to store the data, this could look something like YYYY-MM-DD-ticker_symbol. I’ve created a script to load data from stock exchange data. For my tests, I only loaded the data for stocks that began with Q. There’s an a lot of data in this data set, so I kept things to a minimum in order to make this quick.

    Since our data also contains the stock exchange identifier, we could even go one step further and include the exchange in our key. That would be helpful if we were querying based on the exchange.

    If you take a look at [mr_stocks.rb][8] you’ll see that we’re setting up a query to filter stocks by the symbol QTM and then aggregate the total trade volume by month. The map phase creates a single cell array with the stock volume traded in the month and returns it. We use the Riak.mapValuesJson function to map the raw data coming in from Riak to a proper JavaScript object. We then get the month that we’re looking at by parsing the key. This is easy enough to do because we have a well-defined key format.

    function(o, keyData, arg) {
      var data = Riak.mapValuesJson(o)[0];
      var month = o.key.split('-').slice(0,2).join('-');
      var obj = {};
      obj[month] = data.stock_volume;
      return [ obj ];
    }
    

    If we were to look at this output we would see a lot of rows of unaggregated data. While that is interesting, we want to look at trending for stock trades for QTM over all time. To do this we create a reduce function that will sum up the output of the map function. This is some pretty self explanatory JavaScript:

    function(values, arg) {
      return [ values.reduce(function(acc, item) {
                   for (var month in item) {
                     if (acc[month]) { acc[month] += parseInt(item[month]); }
                     else { acc[month] = parseInt(item[month]); }
                   }
    
                   return acc;
                 })
      ]
    }
    

    Okay, so that might not actually be as self-explanatory as anyone would like. The JavaScript reduce method is a newer one. It will accumulate a single result (the acc variable) for all elements in the array. You could use this to get a sum, an average, or whatever you want.

    One other thing to note is that we use parseInt. We probably don’t have to use it, but it’s a good idea. Why? Riak is not aware of our data structures. We just store arrays of bytes in Riak – it could be a picture, it could be text, it could be a gzipped file – Riak doesn’t care. JavaScript only knows that it’s a string. So, when we want to do mathematical operations on our data, it’s probably wise to use parseInt and parseFloat.

    Where to Now?

    Right now you probably have a lot of data loaded. You have a couple of options. There are two scripts on github to remove the stock data and the animal data from your Riak cluster. That’s a pretty boring option. What can you learn from deleting your data and shutting down your Riak cluster? Not a whole lot.

    You should open up mr_stocks.rb and take a look at how it works. It should be pretty easy to modify the map and reduce functions to output total trade volume for the month, average volume per day, and average price per day. Give it a shot and see what you come up with.

    If you have questions or run into problems, you can hit up the comments, the Riak Developer Mailing List, or hit up the #riak IRC room on irc.freenode.net if you need immediate, real time help with your problem.

  4. Data Durability

    A friend of mine half-jokingly says that the only reason to put data into a database is to get it back out again. In order to get data out, we need to ensure some kind of durability.

    Relational databases offer single server durability through write-ahead logging and checkpoint mechanisms. These are tried and true methods of writing data to a replay log on disk as well as caching writes in memory. Whenever a checkpoint occurs, dirty data is flushed to disk. The benefit of a write ahead log is that we can always recover from a crash (so long as we have the log files, of course).

    How does single server durability work with non-relational databases? Most of them don’t have write-ahead logging.

    MongoDB currently has limited single server durability. While some people consider this a weakness, it has some strengths – writes complete very quickly since there is no write-ahead log that needs to immediately sync to disk. MongoDB also has the ability to create replica sets for increased durability. There is one obvious upside to replica sets – the data is in multiple places. Another advantage of replica sets is that it’s possible to use getLastError({w:...}) to request acknowledgement from multiple replica servers before a write is reported as complete to a client. Just keep in mind that getLastError is not used by default – application code will have to call the method to force the sync.

    Setting a w-value for writes is something that was mentioned in Getting Faster Writes with Riak. Although, in that article we were decreasing durability to increase write performance. In Amazon Dynamo inspired systems writes are not considered complete until multiple clients have responded. The advantage is that durable replication is enforced at the database and clients have to elect to use less security for the data. Refer to the Cassandra documentation on Writes and Consistency or the Riak Replication documentation for more information on how Dynamo inspired replication works. Datastores using HDFS for storage can take advantage of HDFS’s built-in data replication.

    Even HBase, a column-oriented database, uses HDFS to handle data replication. The trick is that rows may be chopped up based on columns and split into regions. Those regions are then distributed around the cluster on what are called region servers. HBase is designed for real-time read/write random-access. If we’re trying to get real-time reads and writes, we can’t expect HBase to immediately sync files to disk – there’s a commit log (RDBMS people will know this as a write-ahead log). Essentially, when a write comes in from a client, the write is first written to the commit log (which is stored using HDFS), then it’s written in memory and when the in-memory structure fills up, that structure is flushed to the filesystem. Here’s something cunning: since the commit log is being written to HDFS, it’s available in multiple places in the cluster at the same time. If one of the region servers goes down it’s easy enough to recover from – that region server’s commit log is split apart and distributed to other region servers which then take up the load of the failed region server.

    There are plenty of HBase details that have been grossly oversimplified or blatantly ignored here for the sake of brevity. Additional details can be found in HBase Architecture 101 – Storage as well as this Advanced HBase presentation. As HBase is inspired by Google’s big table, additional information can be found in Chang et al. Bigtable: A distributed storage system for structured data and The Google File System.

    Interestingly enough, there is a proposed feature for PostgreSQL 9.1 to add synchronous replication to PostgreSQL. Current replication in PostgreSQL is more like asynchronous database mirroring in SQL Server, or the default replica set write scenario with MongoDB. Synchronous replication makes it possible to ensure that data is being written to every node in the RDBMS cluster. Robert Haas discusses some of the pros and cons of replication in PostgreSQL in his post What Kind of Replication Do You Need?.

    Microsoft’s Azure environment also has redundancy built in. Much like Hadoop, the redundancy and durability is baked into Azure at the filesystem. Building the redundancy at such a low level makes it easy for every component of the Azure environment to use it to achieve higher availability and durability. The Windows Azure Storage team have put together an excellent overview. Needless to say, Microsoft have implemented a very robust storage architecture for the Azure platform – binary data is split into chunks and spread across multiple servers. Each of those chunks is replicated so that there are three copies of the data at any given time. Future features will allow for data to be seamlessly geographically replicated.

    Even SQL Azure, Microsoft’s cloud based relational database, takes advantage of this replication. In SQL Azure when a row is written in the database, the write occurs on three servers simultaneously. Clients don’t even see an operation as having committed until the filesystem has responded from all three locations. Automatic replication is designed into the framework. This prevents the loss of a single server, rack, or rack container from taking down a large number of customers. And, just like in other distributed systems, when a single node goes down, the load and data are moved to other nodes. For a local database, this kind of durability is typically only obtained using a combination of SAN technology, database replication, and database mirroring.

    There is a lot of solid technology backing the Azure platform, but I suspect that part of Microsoft’s ultimate goal is to hide the complexity of configuring data durability from the user. It’s foreseeable that future upgrades will make it possible to dial up or down durability for storage.

    While relational databases are finding more ways to spread load out and retain consistency, there are changes in store for MongoDB to improve single server durability. MongoDB has been highly criticized for its lack of single server durability. Until recently, the default response has been that you should take frequent backups and write to multiple replicas. This is still a good idea, but it’s promising to see that the MongoDB development team are addressing single server durability concerns.

    Why is single server durability important for any database? Aside from guaranteeing that data is correct in the instance of a crash, it also makes it easier to increase adoption of a database at the department level. A durable single database server makes it easy to build an application on your desktop, deploy it to the server under your desk, and move it into the corporate data center as the application gains importance.

    Logging and replication are critical technologies for databases. They guarantee data is durable and available. There are also just as many options as there are databases on the market. It’s important to understand the requirements of your application before choosing mechanisms to ensure durability and consistency across multiple servers.

    References

    Item Information

    Published
    Contributor
    peschkaj
    Comments
    0 comments
    Tags
    mongodb, sql-server, hbase, postgresql, nosql
    Content Type
    Entry
  5. Three Mistakes I Made With MongoDB

    When I initially started working with MongoDB, it was very easy to get started. I could create schemas, create data, and pretty much do everything I wanted to do very quickly. As I started progressing through working with MongoDB, I started running into more and more problems. They weren’t problems with MongoDB, they were problems with the way I was thinking about MongoDB.

    I Know How Stuff Works

    The biggest mistake that I made was carrying over ideas about how databases work. When you work with any tool for a significant period of time, you start to make assumptions about how other features will work. Over time, your assumptions get more accurate. On the whole, it’s a good thing. It gets tricky when you start to switch around your frame of reference.

    As a former DBA and alleged database expert, I’m pretty comfortable with relational databases. Once you understand some of the internals of one database, you can make some safe assumptions about how other databases have been implemented. One of the advantages of MongoDB is that it’s a heck of a lot like an RDBMS (sorry guys, that’s just the truth of it). It has collections, that look and act a lot like tables. There are indexes, there’s a query engine, there’s even replication. There are even more features and functionality that map really well between MongoDB and relational databases. It’s close enough that it’s painless to make a switch. The paradigms don’t match up exactly, but there are similarities.

    Unfortunately for me, the paradigms and terminology matched up enough that I felt comfortable making a large number of assumptions. Boy was I ever wrong. I’ve had to stop thinking that I know anything about MongoDB and look up the answers to questions, rather than assume I know anything. It’s been frustrating, but it’s also been educational.

    I Know How Data is Modeled

    It’s really easy to make assumptions about modeling data. This is another area where I made huge assumptions that caused a lot of problems. In a relational database, we know how to model data. Normalization is really well understood. I know how to design database structures to take advantage of best practices and techniques. I know how to work with O/R-Ms and I understand where there are tradeoffs to be made between normalization, denormalization, and the software that talks to the database. We all learn these things as we progress in our careers. Once you get used to normalization, it’s easy to fall into that pattern.

    Document database, like MongoDB, don’t work to their full potential when you normalize your data. Document databases, in my experience, do the opposite; they work best when the data is stored as a document. Document data is similar to the way data is used in the application child data is stored with a parent. An order’s line items are just a child collection of the order, there is no OrderHeader and OrderDetails table. Years of working within the same set of rules made it easy to slip into old habits.

    Let’s say I have a number of user created documents in a Documents collection. Documents have an author as well as a set of tags created by the author. In a relational database, we’d have something like:

    CREATE TABLE users (
        user_id INT PRIMARY KEY,
        username VARCHAR(30) NOT NULL,
        password VARCHAR(50) NOT NULL
    );
    
    CREATE TABLE documents (
        document_id INT PRIMARY KEY,
        title VARCHAR(30) NOT NULL,
        body VARCHAR(MAX) NOT NULL,
        user_id INT NOT NULL REFERENCES users(id)
    );
    
    CREATE TABLE tags (
       tag_id INT PRIMARY KEY,
       name VARCHAR(30) NOT NULL
    );
    
    CREATE TABLE document_tags (
       document_id INT NOT NULL,
       tag_id INT NOT NULL
    );
    

    For a lot of things, this makes perfect sense. With MongoDB, we’d have a documents collection and a users collection. We might have a separate tags collection as well, but that would be used as an inverted index for searching. The users collection would be used for validating logins and populating your user profile, but when a document is saved, we wouldn’t store a pointer to the appropriate record in the users collection. The appropriate thing to do would be to cache the data locally as well as store a pointer to the users collection. The same hold true for the document’s tags – why create a join construct between the two collections when we can store all of the tag data we need in the appropriate document? If we decide that we need to find documents by their tags we have two choices:

    1. Create an index on the document.tags property
    2. Create a tags inverted index.

    If you’re interested in indexes, the MongoDB documentation on the subject is a great place to start. There is a specific kind of index called a multikey that allows you to index arrays of values. Inverted indexes are interested and I covered them while talking about building secondary indexes in Riak.

    With a document database we’re trying to minimize the number of reads that we’re performing at any given time. A document should be a logical construct of whatever application entity we’re saving. A document would be a record of the document at the time it was saved – there will be cached information from the user, the document and its associated metadata, and a list of tags.

    Of course, since we’re talking about data modeling, there are arguably n(n+1) ways to accomplish this and at least n+1 correct ways to accomplish this. If you really don’t like it, feel free to comment about it.

    I Can Just Drop This In ###

    Despite appearances and claims to the contrary, MongoDB is not a drop in replacement for an RDBMS. A relational database provides a phenomenal number of features for free – indexing, declarative referential integrity, transaction support, multi-version concurrency control, and multi-statement transactions to name a few. These features come with a price. Likewise, MongoDB provides a different set of features and they also have a price.

    Believing the hype caused some sticky problems for me, not because I destroyed important data or anything like that, but because I assumed that I could use the same tools and tricks. I used MongoMapper to handle my database access. MongoMapper is a fine piece of code and it made a lot of things very easily. Using an O/R-M made it feel like I was using a relational database. It made things trickier, especially when I ran into situations where I was running into some of the blurry places I’ve already talked about. In hindsight, I should have used the stock drivers, built my own abstractions, and the replaced that when it became necessary. I don’t say that because I think I could do a better job, but because it makes more sense to build something from scratch yourself for the first time and then replace it when you’re building more plumbing than functionality.

    Would I Use MongoDB Again?

    Sure, if the project needed it. I don’t believe that MongoDB is a drop in replacement for an RDBMS. Thinking about MongoDB and RDBMSes that was does a disservice to both MongoDB and the RDBMS: they both have their own strengths and weaknesses.

  6. Querying Hive with Toad for Cloud Databases

    I recently talked about using the Toad for Cloud Databases Eclipse plug-in to query an HBase database. After I finished up the video, I did some work loading a sample dataset from Retrosheet into my local Hive instance.

    This 7 minute tutorial shows you brand new functionality in the Toad for Cloud Databases Eclipse plug-in and how you can use it to perform data warehousing queries against Hive.

  7. Querying HBase with Toad for Cloud Databases

    We recently released a new version of Toad for Cloud Databases as an Eclipse plug-in. While this functionality has been in Toad for Cloud since the last release, this video shows Toad for Cloud running in Eclipse and demonstrates some basic querying against HBase using Toad for Cloud’s ability to translate between ANSI compliant SQL code and native database calls.

    http://facility9presentations.s3.amazonaws.com/HBase%20Basic%20Querying.mov
  8. Open Sourcing Sawzall – What Does It Mean?

    For Data Analytics or automotive modification, you will find no finer tool.

    While perusing twitter, I saw that Google has open sourced Sawzall, one of their internal tools for data processing. WTF does this mean?

    Sawzall, WTF?

    Apart from a tool that I once used to cut the muffler off of my car (true story), what is Sawzall?

    Sawzall is a procedural language for analyzing excessively large data sets. When I say “excessively large data sets”, think Google Voice logs, utility meter readings, or the network traffic logs for the Chicago Public Library. You could also think of anything where you’re going to be crunching a lot of data over the course of many hours on your monster Dell R910 SQL Server.

    There’s a lengthy paper about how Sawzall works, but I’ll summarize it really quickly. If you really want to read up on all the internal Sawzall goodness, you can check it out on Google code – Interpreting the Data: Parallel Analysis with Sawzall.

    Spell It Out for Me

    At its most basic, Sawzall is a MapReduce engine, although the Google documentation goes to great pains to not use the word MapReduce, so maybe it’s not actually MapReduce. It smells oddly like MapReduce to me.

    I’ll go into more depth on the ideas behind MapReduce in the future, but here’s the basics of MapReduce as far as Sawzall is concerned:

    1. Data is split into partitions.
    2. Each partition is filtered. (This is the Map.)
    3. The results of the filtering operation are used by an aggregation phase. (This is the Reduce.)
    4. The results of the aggregation are saved to a file.

    It’s pretty simple. That simplicity makes it possible to massively parallelize the analysis of data. If you’re in the RDBMS world, think Vertica, SQL Server Parallel Data Warehouse, or Oracle Exadata. If you are already entrenched and in love with NoSQL, you already know all about MapReduce and probably think I’m an idiot for dumbing it down so much.

    The upside to Sawzall’s approach is that rather than write a Map program and a Reduce program and a job driver and maybe some kind of intermediate aggregator, you just write a single program in the Sawzall language and compile it.

    … And Then?

    I don’t think anyone is sure, yet. One of the problems with internal tools is that they’re part of a larger stack. Sawzall is part of Google’s internal infrastructure. It may emit compiled code, but how do we go about making use of those compiled programs in our own applications? Your answer is better than mine, most likely.

    Sawzall uses something called Protocol Buffers – PB is a cross language way to efficiently move objects and data around between programs. It looks like Twitter is already using Protocol Buffers for some of their data storage needs, so it might only be a matter of time before they adopt Sawzall – or before some blogger opines that they might adopt Sawzall ;).

    So far nobody has a working implementation of Sawzall running on top of any MapReduce implementations – Hadoop, for instance. At a cursory glance, it seems like Sawzall could be used in Hadoop Streaming jobs. In fact, Section 10 of the Sawzall paper seems to point out that Sawzall is a record by record analytical language – your aggregator needs to be smart enough to handled the filtered records.

    Why Do I Need Another Language?

    This is a damn good question. I don’t program as much as I used to, but I can reasonably write code in C#, Ruby, JavaScript, numerous SQL dialects, and Java. I can read and understand at least twice as many languages. What’s the point of another language?

    One advantage of a special purpose language is that you don’t have to worry shoehorning domain specific functionality into existing language constructs. You’re free to write the language the way it needs to be written. You can achieve a wonderful brevity by baking features into the language. Custom languages let developers focus on the problems at hand and ignore implementation details.

    What Now?

    You could download the code from the Google Code repository, compile it, and start playing around with it. It should be pretty easy to get up and running on Linux systems. OS X developers should look at these instructions from a helpful Hacker News reader. Windows developers should install Linux on a VM or buy a Mac.

    Outside of downloading and installing Sawzall yourself to see what the fuss is about, the key is to keep watching the sky and see what happens.

  9. Syndicated Bloggers at CloudDBPedia

    I’ve been quiet about the bloggers we’ve been adding over at CloudDBPedia, I’m going to fix that. Today I’m happy to announce that we are syndicating 11 RSS feeds.

    • 2 Blokes Marketing – Christian Hasker and Andy Grant should be familiar to fans of SQLServerPedia, they’ve been instrumental in SQLServerPedia’s growth and they bring an interesting marketing slant to the conversation.
    • Brent Ozar – Brent hasn’t been talking much about NoSQL recently, but when he does he comes to the table with the viewpoint of a SQL Server DBA.
    • Buck Woody – Buck Woody is a recent addition to the Azure team at Microsoft. He brings 20+ years of experience to the table with databases and software design.
    • Guy Harrison – Guy is the Director of Research and Development at Quest. He lives in Australia and creates some interesting software like Toad for Cloud Databases. Guy has also authored two books, one on Oracle and one covering MySQL.
    • Jeremiah Peschka – hooray for me! I like data. I’m interested in NoSQL and cloud solutions because they make some interesting changes to how we all think about information storage.
    • Kevin Kline – Kevin is another Database Expert at Quest. He is the author of SQL in a Nutshell and is an all around good guy.
    • Kyle Banker – Kyle is a developer at 10gen, makers of MongoDB. Kyle maintains the Ruby driver for MongoDB.
    • Ayende Rahien – Ayende is well known as a founding or contributing to a number of open source projects (NHibernate, Castle, Rhino Mocks, NHibernate Query Analyzer, Rhino Commons) and is a prolific blogger. In addition, he is also writing RavenDB – a document database implemented in .NET.
    • Randy Guck – Randy works at Quest Software and has worked on a variety of platforms.
    • Kristina Chodorow works for 10gen, makers of MongoDB. She works on the core MongoDB server as well as the Perl and PHP drivers and has recently published a book about MongoDB.
    • The Basho Blog is the collective blog of the makers of Riak. In addition to blogging about their own database, there are a lot of links to presentations that Basho team members are doing at user groups, conferences, and online.

    I want to say thank you to the bloggers who have generously agreed to syndicate their blog at CloudDBPedia.

  10. Cassandra Tutorial – Installing Cassandra

    I was feeling a little bit crazy the other day and I installed Cassandra on my home computer. Twice. Just because. It wasn’t really a “just because” moment. I installed Cassandra twice so I could get the hang of it in case anyone asked me how they could go ahead and get started. The process was relatively painless, but there are some prerequisites that you need to have installed before you can get started. I thought that it would be good to create this short Cassandra installation tutorial to help you get started.

    This video walks through everything you need to do get Cassandra installed and running on your computer. Apart from installing the OS. If you don’t have an OS, I don’t know how you’re reading this, probably witchcraft.

    [See post to watch Flash video]
  1. 1
  2. Next ›
  3. Last »