DBPedias

Your Database Knowledge Community

Jeremiah Peschka

  1. Building Our Own Federated Database

    We’ve already talked about The Promise and Failure of Federated Databases and Why Don’t We Have Federated Databases. At the end of the second post I concluded that the only real way to solve this problem is to build the federated database ourselves. Before you ask, “Does he really want us to roll our own database” take a deep breath and relax; nobody is going to be writing a database.

    What is a Federated Database?

    When I was looking at the definition for a federated database, it dawned on me that a federated database doesn’t have to be an actual RDBMS, or any other type of existing DBMS. A federated database is a meta-database management system (or so Wikipedia claims). Looking at the other major explanation of a federated database we end up with “A federated database architecture is described in which a collection of independent database systems are united into a loosely coupled federation in order to share and exchange information.”

    I’ve already talked about the implications of the first definition – it’s leading us down the path of a monolithic master server that must be aware of the other servers in the federation. New servers won’t be a part of the federation until we make the federating server aware of them. The other road, a loose collection of independent servers, is beginning to gain ground as companies bring more databases online in their data centers. When I say “more databases” I’m not just talking about a larger number of databases from a vendor, I’m also talking about databases from different vendors. Many people are exploring this route right now, some of them have attached the name of polyglot persistence to this approach.

    Why Do I Want a Federated Database?

    There are a couple of reasons that you’d want to roll your own federated database. I touched on them in the first part of this series: you may want to query across databases, you might have legacy systems, you may have merged with another company, or you might be using the most appropriate database for the job. Whatever you’re doing, you probably have a number of databases and you need to stitch them together.

    Where Do I Get Started?

    There are a number of ways we could go about creating a federated database. A lot of the ways to build a federated database solution are incredibly complex and involve creating meta-data databases as well as devising ways to link the databases together in an easily query-able way. I’m going to propose something different. Instead of designing something on your own, use the technology you already have and that your programming platform already comes with and understands: TCP/IP.

    Nearly every programming language is capable of talking to other programs over TCP/IP. Instead of creating custom databases and worrying about meta-data management and cross server querying, create common services that answer common questions. Break your monolithic application down into manageable services and write those services using the most appropriate technology. Over my career, I’ve found that very few users need the ability to run ad hoc reports over the entire corporate data set. If users don’t need to be able to interactively query the entirety of their corporate data set, what do they need?

    Almost all users need a small set of reports and data. Even when we expand the definition of “users” to include applications, services, APIs, and protocols, most activities are incredibly limited. Our users are asking the same sets of questions: How many accounts receivable have aged more than 30 days? What do the sales figures for the New England region look like for the last three years? Even when users are adding data to our databases they’re still performing a limited set of actions like saving an entire order, signing up for a new account, or adding a new accordion to their shopping cart. The activities that users perform data are very limited.

    Knowing that users only perform a few activities with our live data, we can safely make some assumptions about the type of data access people will need. Keeping that in mind, it’s a lot easier to see how we can build our own federated database: we’re not going to. We’re going to build our own system using what many people call polyglot persistence.

    Designing for Polyglot Persistence

    The idea behind polyglot persistence is that we keep our data in the best database for storing that particular kind of data. Achieving this goal is achievable, but if that were the end game, it wouldn’t be the most useful goal for the business – business users want to see reports and combine data across applications and business units.

    Going one step beyond the basics of polyglot persistence, we want to add another layer – a caching/service layer. It’s in this layer that we can start to really add rich functionality to the data that the business needs. Instead of having to replicate data across multiple data sources, we can query two separate servers and combine the data together before we return it to the client.

    We’ve been doing this for years – it’s nothing new. The only thing that is remotely new is storing our data in the most suitable database. Well, that and telling our caching/service layer to cache as much data as possible while writing in the background. If we keep most of our data in cache, we don’t have to worry as much about write performance in the back end. We can queue writes to make sure they commit during idle times, we can spread them across many servers, and we can write to many reporting databases at once to make sure that reports are up to date. By moving application and reporting logic into an application and reporting tier, we free the database to focus on the tasks that databases excel at: storing and retrieving data. Complex logic and strange data mucking can be handled in the application layer by simple (or highly specialized) algorithms.

    Polyglot persistence becomes incredibly valuable when we build mechanisms to load data from all of our disparate line of business systems into a single enterprise data warehouse. Once we have all of our data in a single warehouse, we’re able to write queries across business boundaries. The enterprise warehouse doesn’t need to be in a single monolithic RDBMS server; it could use Microsoft SQL Server Parallel Data Warehouse, Oracle ExaData, Postges-XC, HBase, Cassandra, or any other database that is up to the task.

    Wrapping it up

    Polyglot persistence seems to be the best answer to building a federated database. It doesn’t provide any kind of automated meta-data management or support for distributing queries automatically across many servers. Instead, polyglot persistence makes it easier to build a robust system that answers the questions business users both want and need while remaining fast and flexible. Is it the be all end all solution? No. Is it a step in the right direction? Yes.

  2. I’m Presenting at SQL Saturday 67

    No, this isn’t a re-run! I’ll be presenting about Refacatoring SQL at SQL Saturday 67 in Chicago this coming Saturday.

    I’m really excited about this opportunity. I had a blast presenting in Chicago last year and I’m looking forward to doing it again this year. There’s a great line up of speakers. If you’re in the Chicago area and want to get your learn on, I suggest you swing on by the DeVry Addison campus and check it out.

    Here’s the title and abstract:

    Refactoring SQL

    Refactoring SQL is not like refactoring application code. This talk will demonstrate proven SQL refactoring techniques that will help you identify where performance gains can be made, apply quick fixes, improve readability, and help you quickly locate places to make sweeping performance improvements. Jeremiah Peschka has years of hands on experience tuning SQL applications for performance, throughput, and concurrency.

  3. Why Don’t We Have Federated Databases?

    Federated databases are a dream that have not materialized. The SQL/MED extension to the ANSI/ISO SQL specification is a step in the right direction. In addition, both SQL Server and Oracle have proprietary extensions that make it possible to query external data sources. If all of this technology is available today, why aren’t more people using it?

    Why Don’t We Have a Federated Database?

    If federated databases are such a powerful thing, why hasn’t anyone built one? Surely this is something that many businesses are clamoring for, or at least in need of.

    There are a number of problems facing anyone attempting to implement a federated database. Frankly, federated databases are fraught with technical difficulties. Every database vendor supports a different subset of the ANSI/ISO SQL standard, different vendors use different data types and metadata, concurrency is a huge concern with dealing with remote resources, and technology is a moving target.

    Different Dialects of SQL

    Anyone who has attempted to port an application from one database engine to another knows about the pain involved in translating queries. Different vendors adhere to the SQL standard to varying degrees. This problem can be solved by creating wrappers to translate sub-queries between different querying languages, but it’s still a problem that exists. This problem could be partially solved in a federated database by limiting the database vendors to a small subset of the ANSI/ISO SQL standard, but this doesn’t solve the problem, it merely avoids it by limiting functionality.

    Different Metadata

    Different databases may have different metadata – different data types are used to represent the same data and different structures are used to describe data. To solve this problem it becomes necessary to create elaborate mappings between columns that represent the same data. There may be situations where such a mapping becomes computationally intensive or functionally impossible. SQL Server allows the creation of .NET data types with custom methods for data searching, access, and manipulation. PostgreSQL features several data types (notably tsvector, hstore, and arrays) that would be difficult to convert to data types in other databases.

    Metadata differences don’t just stop at the data type level. It’s possible to model data in a number of different ways; the type of an address could be indicated using an integer key value that references a lookup table in one database or as a VARCHAR column with values constrained by the database. It’s even possible for something as simple as Unicode text encodings to cause problems: SQL Server uses the NVARCHAR data type for storing Unicode strings while other databases do not use a separate data type.

    Concurrency

    Concurrency, depending on your database, may be a concern. Managing concurrent operations within a single database is a difficult task, much less managing concurrency across multiple databases. Unfortunately, correct handling of concurrency across all components of a federated database is critical.

    Many potential problems of a federated database can be solved through different trade offs. However, managing concurrency is a nearly impossible task. To properly and effectively manage concurrency across multiple databases is to ask the impossible. Not only would this require the federated database vendor be able to account for all possible concurrency issues in relational databases, but they would need to be able to account for potential concurrency issues in any database that integrates with the federated database.

    The Moving Target

    Even were a database vendor to take on this task, they’d be consistently aiming for a moving target. New features are added to relational databases all the time, and there are enough major players in the market to make it difficult for users to keep up to date, much less a federated database vendor. Once you factor in the wealth of other, non-relational databases, the idea of creating a federated database system to handle metadata mapping, concurrency control, and query language resolution trends toward impossible.

    The State of The Industry

    Where we stand now, there is almost no chance of any independent software vendor creating a true federated database. There is hope, but not from where you would expect it.

    Enterprise data warehouses can fulfill much of the function of a federated database, but they still require complex ETL and data mapping to be truly useful. Adding additional information to an enterprise data warehouse can require extra work to prepare the data warehouse and ETL processes for the new data. Unfortunately, enterprise data warehouses require too much manual intervention to be a candidate for a federated database.

    As we’ve discussed, SQL Server and Oracle provide ways to reference remote database servers. These methods have their own problems. SQL Server linked tables are prone to problems with some objects not being remoteable. When we’re querying a remote server, we need to make sure that the parts of our query going to the remote server are handled on the remote server. This is difficult to get right. On some occasions we might even see an entire remote table be streamed across the wire to be filtered on the originating server. This is something that we don’t want to see happening. For a federated database to be a tenable product there must be an easy way to offload queries to a remote table and a guarantee of adequate performance.

    Properly remoting queries is incredibly complex. Assume, for a moment, that we have a report that queries data on the sales department’s database server and we also need to include data from human resources data. Our query might look something like this:

    SELECT sp.first_name,
           sp.last_name,
           eh.employment_duration,
           ts.year,
           ts.total_sales_by_year,
           r.average_review_score
    FROM   public.sales_person sp
           JOIN HRDB01.employee_info.public.employee_history eh
                ON sp.employee_id = eh.employee_id
           JOIN ( SELECT o.employee_id,
                         o.year,
                         SUM(o.total) total_sales_by_year
                  FROM   public.orders AS o
                  GROUP BY o.employee_id, o.year
                ) AS ts ON sp.employee_id = ts.employee_id
           JOIN ( SELECT r.year,
                         r.employee_id,
                         AVG(r.score) AS average_review_score
                  FROM   HRDB01.employee_info.reviews.review r
                  GROUP BY r.year, r.employee_id
           ) AS rv ON sp.employee_id = rv.employee_id
                      AND ts.year = rv.year;
    

    Looking at this query we’re hitting two separate remote objects in one remote database. In order for this query to be effective, our query optimizer must b able to re-write the query in a way that lets it build an intelligent query for the two remote tables HRDB01.employee_info.public.employee_history and HRDB01.employee_info.reviews.review. But, in order to effectively re-write the local query to properly reference remote objects, we need to know everything about the remote objects – our calling server must be aware of as much metadata as possible so the remoted query can be re-written before being sent to the remote server. While this is doable, it puts additional load on the calling server. This server now has to maintain information about remote database objects. But that’s not all! If we want our queries to be truly optimal, our federating sever will need to be aware of how data types will behave on the remote servers and how the remote data types will interact with local data types.

    Once you examine the intricacies of a federated database, it becomes obvious why the federated database, as a boxed product, is beyond our reach. It’s not that the task is impossible; on the contrary such a task is very possible. The difficulty lies in coordinating all of the information available and using it to deliver data quickly. There are enough moving and potentially unknown parts that it’s non-trivial to create heterogeneous systems capable of filling out the promise of federated databases. Faced with this situation, the only viable solution is to build your own solution that answers the needs of the business.

    http://railstips.org/blog/archives/2011/01/27/data-modeling-in-performant-systems/

  4. The Promise and Failure of Federated Data

    One of the biggest problems facing businesses today is integrating data from multiple sources. The holy grail of data integration is called a federated database system. Basically, a federated database stores meta data about other databases and makes it easier to integrate them through a single interface. Many relational databases have features that support integrating with other relational databases through SQL Server’s linked servers or Oracle’s database links. One of the problems with these features is that they only allow relational databases to talk to other relational databases. As the volumes of data we collect every day increase, more and more of that data is being stored outside of relational databases in CSVs, spreadsheets, log files, PDFs, and plain text as well as in a variety of non-relational databases like MongoDB, HBase, Riak, and Cassandra.

    If the only thing we’re looking for is access to meta data for external data, SQL Server will provide the [FileTable][8] data type in SQL Server 2011. Admittedly, FileTable isn’t an acceptable solution because it’s really intended to make it possible to reference files in the database that are being managed by external applications through the filesystem and Win32 APIs. Clearly, this doesn’t suit our need for querying external data.

    It’s also possible to use ETL tools to move data into relational databases. One of the problems with ETL tools (SSIS, Pentaho Data Integration/Kettle, or Oracle Data Integrator) is that they are effectively batch operations. New data insertions will have to be triggered by some external event; the data isn’t available until it’s migrated into some master system.

    Luckily, there’s an extension to the SQL Standard to help us: SQL/MED. MED stands for Management of External Data. This is a way to link up any external data source to a database server. It doesn’t have to be another relational database – there’s already a twitter foreign data wrapper library. Unfortunately, PostgreSQL is currently the only major database player on the market with any potential for an implementation for SQL/MED.

    The implementation of SQL/MED just isn’t here, yet.

    Another promising project is HadoopDB. HadoopDB is a project coming out of Yale University. The aim is to make it possible to run analytical workloads in parallel across many commodity RDBMS servers. One of the goals of HadoopDB is to excel in areas where parallel data warehouses simple do not perform well. Many of these situations are outline in the paper HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. To summarize, parallel data warehouses provide near linear scaling up to several hundred nodes running on homogeneous hardware. Parallel data warehouse also operate under the assumption that failures are rare. Google and others have demonstrated that hardware failure is inevitable at scale. HadoopDB presents a phenomenal way to scale databases and integrate disparate technologies.

    Despite its promise of scaling databases, HadoopDB still doesn’t solve the problems that we face when trying to build a federated database system. The truth is a depressing one – there is currently no solution for building federated databases that incorporate data from across the enterprise. Database vendors, DBAs, and more traditional corporate IT departments will tell you that this is a Good Thing™. I’ve mentioned before that you should choose the database that is best suited for the task at hand.

    Where do we go from here? SQL/MED doesn’t meet its own promises – only one vendor is implementing the SQL/MED standard and that support is going to depend on third parties releasing drivers. HadoopDB isn’t a federated database so much as it is a way to avoid scaling a relational database into thousands of cheap nodes and paying millions of dollars in licensing fees for Teradata, Microsoft’s Parallel Data Warehouse, or Oracle’s Exadata. The unfortunate truth is that if we want a federated database we’re going to have to build it ourselves.

    What sounds like a Sisyphean task isn’t as difficult as it sounds. If we’re collecting data in multiple databases, the problem is already almost solved. Some of those technologies are already here. LINQ lets us treat all data sources equally; we can query an array of objects as easily as we can query a database. ARel is a relational algebra for Ruby. While ARel is specifically focused on working with relational databases, it could be extended to work with many different data sources. Business intelligence vendor Jaspersoft recently announced support for a number of non-relational databases to complement their existing business intelligence products. Quest Software, makes Toad for Cloud Databases – a tool for querying both relational and non-relational databases.

    A federated database may never materialize, but it’s already possible to build a hybrid database solution today.

    References

  5. Copy, Paste, Cloud

    One of my favorite features of EC2 is the ability to create virtual machine templates and re-use them to create fresh copies of a virtual machine. This is great but things rapidly get onerous when you’re trying to duplicate infrastructure.

    Amazon recently unveiled a new service called AWS CloudFormation. There are currently many Amazon cloud offerings available: S3, Elastic Block Storage, EC2, and Elastic Beanstalk are just a few. AWS CloudFormation is more than just another member of the family: it ties them all together.

    The idea behind AWS CloudFormation is to make it easy to create a collection of AWS resources and then deploy them the same way every time. AWS CloudFormation is similar to using Chef recipes to deploy software configuration. In this case we’re deploying an entire infrastructure stack to multiple virtual machines via a recipe. We can design our infrastructure on AWS. As our business grows we will be able to quickly and easily duplicate crucial parts of our infrastructure.

    AWS CloudFormation makes it simpler to manage all of your infrastructure. Deployments of new infrastructure become a matter of pushing out a template. If there are problems with a deployment, the changes can be rolled back and a clean up happens to make sure you aren’t charged for anything that you’re not using.

    Deployments

    In a traditional IT department, there is a design, purchase, deploy cycle that can potentially take a very long time. In previous jobs, we’ve had to design the infrastructure based on obscure internal capacity planning metrics. Once we’d made predictions/guesses about our future growth, we would then wait for weeks or even months to acquire new hardware. Once we had the hardware, it might even sit around for days or weeks before we were finally able to provision, configure, and deploy the servers on the network. That doesn’t even include deploying our own software on the server.

    On the flip side of that coin, by combining AWS CloudFormation plus Chef/Puppet, we can now push a new batch of servers out into the cloud in a matter of minutes and have them running in a few hours. Our software can be automatically installed and configured with Chef or Puppet. While we still need to write templates, once we’ve created and tested our templates for specific purposes (blog, database, community site, whatever), we’re able to fully automate deployments.

    Scaling Out

    AWS CloudFormation can also ease the pain of scaling out our applications.

    Typically when we scale out an application, we’re starting from a monolithic application stack. All of the assets in our stack have been scaled up to a point where it’s cost prohibitive to keep scaling. At this point, we’d examine each layer of our application and determine the best place to add caching or scale out to use multiple application or database servers. As we keep scaling our stack, we need to add more load balancers, caching servers, and database read slaves until we’ve exhausted our options and have to revisit our application design.

    Rather than engaging in the exercise of attempting to scale all of our customers at once, why don’t we start out by sharding all application resources at the customer level? While this increases the overall cost of operating our business, it makes it easy to scale elastically in response to a changing customer base. The busiest customers will get larger servers and increased performance that meets their needs. It also becomes possible to locate our data close to your customer in one of several Amazon zones.

    For businesses offering software as a service, this makes a great deal of sense. They get an easy way to monitor usage per customer and can scale appropriately within known guidelines and with well known costs.

    Wrap Up

    AWS CloudFormation makes it possible to provision and deploy infrastructure using a set of templates. When you combine CloudFormation with Chef or Puppet, it becomes very easy to deploy infrastructure and then deploy additional configuration changes on top of the infrastructure. Ultimately, AWS CloudFormation makes it easy to quickly and easily deploy new infrastructure in response to changes in load or customer demand.

    If you’re interested in some of the discussion around AWS CloudFormation, be sure to check out the Hacker News thread on the subject.

  6. Introduction to Riak … TONIGHT!

    I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak tonight at 6:30PM!

    There will be pizza and soda and Ruby and me. You can even stick around afterwards while we all go next door for drinks (you can buy my Diet Cokes all night if you really like the presentation).

    Riak: An Overview

    This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

    Add it to your calendar!

  7. Protecting Your Content – Copyright, Licensing, and You

    Why Should I Worry About Licensing?

    You probably just have a blog, or maybe you haven’t even started blogging yet. Maybe you’re just sharing your thoughts on Facebook Notes or Google Pages. However you look at it you’re probably certain that you don’t need to worry about your writing on the web. It is, after all, your content.

    Think again.

    In all fairness, Facebook’s terms of use are some of the more consumer friendly terms of service out there. Facebook does not claim copyright over your content, but using Facebook immediately grants Facebook “a non-exclusive, transferable, sub-licensable, royalty-free, worldwide license to use any IP content that you post on or in connection with Facebook (‘IP License’)”. Basically, Facebook can use any picture or thought you’ve posted at any time with no notice to you. Their ability to use your content continues as long as you have an account and you have not shared your content with anyone.

    Google’s terms of service are not as friendly as Facebook’s. To start with, Google’s terms use legal language while Facebook’s terms are in something reasonably close to plain English. Google’s terms get worse from there. Instead of allowing you to remove your content by permanently deleting it (and all copies), Google’s terms state that you’re giving them the right to use your thoughts until the end of time, or until you stop using all of Google’s services (see sections 11 and 13.2). Both companies’ terms of service contain my favorite legal provision: the terms are subject to change at any time. In short, if you aren’t hosting your own content, you don’t own it. Not completely. You can claim you’re copyrighting it, but someone else can use it because you’ve implicitly given them permission, and that permission may change.

    Back to my earlier question: why should you worry about licensing? You should worry about licensing because you want to be able to control your own content. There’s nothing that stops a third party from changing the terms of service to require their permission if you republish something. If you wanted to republish a blog post on another site, syndicate your content, or print something you wrote in a book you could suddenly find yourself in a legal mess. What if you don’t want pictures of yourself, your friends, or your children to appear in ads?

    Licensing comes down to control over your content and maintaining that control into the future. If you want to keep control, you need to examine the license that you have chosen for your content. This doesn’t just apply to the written word, it applies to your presentations, your photographs, and your code samples.

    Written Licensing

    What’s the best way to protect content you’ve written? That all depends on how you want people to be able to use and re-use your content.

    The Well Worn Path: Copyrighting Your Work

    The strictest way to protect your content is to copyright it. Stanford University have compiled a great list of copyright resources and it’s important to understand the rights around your work. A copyrighted work doesn’t need to be marked as such, but it will make it much easier to enforce your copyright. Very few people actively want to steal your work, by including a copyright notice with contact information you are making it easier for other authors to track you down and get your permission to use part of your work. The best part is that because of international treaties, there is very little difference in copyright laws between different countries.

    Keep in mind that copyrighting your work does not prevent others from reusing portions of your work under fair use principles. Fair use is a tricky thing and is subject to some vague criteria. If you aren’t competing with the author, copying wholesale, building a new work, and are not motivated by a desire for commercial gain, you’re on the way to falling under fair use rules. When in doubt, ask the original author for permission. If you can’t find the original author, find an attorney.

    While there are some nuances to copyright law, it is fairly straightforward. You mark your work as copyrighted and that’s it. Others can make use of portions of your work under fair use guidelines and they should ask for permission, but it isn’t strictly necessary.

    Flexible Designs for the Future: Creative Commons

    A Creative Commons license is, on the surface, not so different from traditional copyright. It’s a more flexible copyright. Rather than have a single, restrictive agreement between the copyright holder and the rest of the world, the Creative Commons license makes it easy for copyright holders to expressly allow certain behaviors.

    It all boils down to a few questions:

    • Do you want to allow commercial use of your work?
    • Do you want to allow adaptations of your work?
    • Do you want the terms of the license to continue?

    Saying “no” to any of these questions doesn’t prevent anyone from using your work in those ways, they just need to obtain your express permission. A Creative Commons license page clearly explains the terms of a particular license making it very easy for readers and other authors to learn how they can or cannot use your work.

    One of the most important aspects of the Creative Commons license is the ability to require future authors to share alike. Adding the share alike provision to your Creative Commons license requires future collaborators to distribute their derivative work under the same license; your work and all work that builds on it will always be available under the same terms you envisioned when you created the content.

    Learn more.

    The Public Domain

    I’ll admit it freely: when I started writing this, I didn’t know a whole lot about how the public domain worked and what it meant. I knew that everything in the public domain was free and couldn’t be taken under someone else’s control, but I didn’t know much more than that.

    A work enters the public domain when the intellectual property rights on the work expire or when those rights are forfeited. Basically, I can take anything I’ve previously written and decree that my work is now in the public domain. It belongs to everyone at that point. Work that has entered the public domain is free for anyone else to build upon. In many ways, the public domain is crucial for the advancement of science and the arts. It makes it possible to build on earlier works, to examine and expand upon the work of Isaac Newton or to re-arrange a symphony to be performed by kazoos and barking dogs. Works in the public domain carry no restrictions on their use.

    Unfortunately, the definition of public domain varies from country to country so there’s no reliable guarantee or best guess that you can make about how something can be used or re-used, even if the author states their work is in the public domain. When in doubt, consult an attorney (or Google).

    The biggest thing to remember about putting your own work in the public domain is that it’s out there for anyone to use and re-use. A less scrupulous person could collect your blog posts and arrange them into a coherent narrative and the publish it as a book. They could also make as many changes as they wanted and there would be nothing you could do to correct the situation.

    Software Licensing

    Why should we even be talking about software licensing? Software licensing is important if you want to release software for people to use, or even if you want to put sample code on your blog for others to re-use. Of course, you could state in your blog’s copyright that all of your source code is covered under the same restrictive copyright as the rest of your blog, but where’s the fun in that?

    Proprietary Software

    This is software that is exclusively licensed by the copyright holder. The copyright holder says “here, you can use this because you gave me money, but you have to abide by these rules.” After which they drop a license document the size of a phone book on your desk with an invoice stapled to the top.

    So it’s not really like that. How does it work?

    With proprietary software, the copyright holder grants you the right to use their software within certain conditions – you can’t modify it or sell it along to your buddies or reverse engineer it to make your own version. License terms vary from vendor to vendor. Some are incredibly permissive and others are very strict. It’s important to look at your software license if you’re ever in doubt of what you can or can’t do with your software.

    Likewise, if you’re going to be creating software, you need to be aware of what the terms of your license mean. Commercial software is best licensed under a proprietary license. After all, if I can download and compile your source code free of charge, why should I pay you for your software?

    Open Source Software

    There are some people who will take issue and say that I should call this section Free and Open Source Software (FOSS) or Free, Libre, and Open Source Software (FLOSS). To these people I say, “get your own blog!”

    Open Source Software (OSS) is a contentious area of software. In practice, OSS is software that is released under a specific license and the source code is distributed with the software. In fact, OSS software usually comes as nothing but source code with a license attached. Helpful people often provide compiled versions of the software for various hardware and software platforms.

    One of the greatest strengths of open source software is that future users of the software and given specific rights that are normally reserved for copyright holders. This is a lot like Creative Commons licensing in some ways. There are far too many open source licenses to examine them in any detail, so be sure to do your homework if you ever need to choose one.

    Now, why should you choose an open source license? I choose to release my demo code under an open source license because I want people to be able to use it, re-use it, and feel free to contribute back. Demo code should stand on its own, but it’s important to remember that your demo code is part of your reputation – keep it safe.

    Some good options for open source licenses are the Apache License, the MIT License, or the LGPL. Make sure you read the licenses and understand them before using them. Some licenses have more provisions than others, some restrict future commericial use, and some have almost no provisions at all (the MIT and BSD licenses are like this).

    Public Domain

    Public domain isn’t specific to the written word, art, and music – software can be covered under the public domain as well. The same legal ramifications apply to software released under the public domain. One of the more famous pieces of public domain software is SQLite.

    Why Should I Be Worried About This?

    Anyone worried about maintaining control of their own work should be worried about copyright and software licensing. Maintaining control of your work and how it can be distributed is an important part of producing content. If you want to be permissive about ho`w your work is used, you can grant rights to people in advance through the Creative Commons or through open source licensing. If you want people to request permission, you can use stricter copyright requirements and proprietary software licensing. There are many choices available.

  8. Database Restores – Where’s my Transaction Log Backup?

    Developers! DBAs! Has this ever happened to you?

    Surprise! It's a database migration error!

    You’re chugging along on a Friday night getting ready for your weekend deployment. Your 2 liter of Shasta is ice cold, you have your all Rush mix tape, and you’re wearing tube socks with khakis. Things are looking up. You open up your deployment script. You’re confident because you’ve tested it in the QA environment and everything worked. You press F5 and lean back in your chair, confident that the script is going to fly through all of the changes. Suddenly, there’s an error and you’re choking in surprise on Shasta.

    In an ideal world, you could pull out your trusty log backups and do a point in time restore, right? What if you’ve never taken a transaction log backup? What if you only have full database backups? Can you still recover from this situation? The answer, thankfully, is yes.

    Let’s break something!

    USE ftgu;
    GO
    
    -- at midnight, we took our initial back up
    BACKUP DATABASE ftgu TO DISK = 'C:\ftgu-1.bak'
    GO
    
    -- customer data from the business is inserted
    -- more customer data is inserted
    
    -- some kind of migration goes here
    
    -- insert a bad value
    INSERT INTO Bins (Shelf, Bin)
    VALUES ('B', 9)
    GO
    
    SELECT GETDATE();
    
    SELECT * FROM Bins WHERE Shelf = 'B' ORDER BY BinID DESC;
    GO
    
    -- wait for a bit
    WAITFOR DELAY '00:01:00';
    GO
    
    -- do something dumb
    DELETE p
    FROM Products p
    JOIN Bins b ON p.BinID = b.BinID
    WHERE b.Shelf = 'B';
    
    DELETE FROM Bins WHERE Shelf = 'B'
    GO
    
    SELECT GETDATE();
    GO
    

    We have a starting backup, no t-log backups, and we’ve gone and deleted some important data from the production database. How do we get it back? If we restored the database from our first backup we might lose a lot of data. Who knows when the last database backup was taken? Oh, midnight. So, in this case, we’d lose a day of data. Well, bugger. In a panic, we save the state of our broken database.

    -- ack!
    BACKUP DATABASE ftgu TO DISK = 'C:\ftgu-2.bak';
    

    And then we realize that we also need our transaction log:

    -- ah crap, I need to back up my log to get point in time recovery!
    BACKUP LOG ftgu
    TO DISK = 'C:\ftgu-log-1.trn';
    GO
    

    Here’s the kicker – the transaction log has never been backed up. (In my experience, this is all too common.) This database has been running for a week or a year or three years without any kind of transaction log backups. We’re screwed right? I mean, wouldn’t we have to apply all of the transactions from the log to the very first full backup we have? No.

    Let’s get started and restore our last good backup. We always have our backup with missing data, just in case we need it for some reason.

    -- switch to master (need to make sure nobody else is using that database)
    USE master;
    GO
    
    -- restore the last full backup with known good data
    -- make sure to specify NORECOVERY so we can
    -- apply our transaction log backup
    RESTORE DATABASE ftgu
    FROM DISK = 'C:\ftgu-1.bak'
    WITH REPLACE, NORECOVERY;
    GO
    

    SQL Server is cunning and records the log sequence number (LSN) from the last full backup (technically it’s the start and end LSN from the last full backup). If we have a log backup that encompasses the relevant LSNs, we’re good to go. Since our transaction logs were never backed up before today, we’re safe.

    We’re going to use something called

    -- restore the log backup until right before we started
    -- this is called "point in time recovery"
    RESTORE LOG ftgu
    FROM DISK = 'C:\ftgu-log-1.trn'
    WITH STOPAT = '2011-02-13 10:03:55.653';
    

    Even though we never took a transaction log backup before today, we’re able to take a backup and recover from what initially seemed like a bad situation.

  9. Introduction to Riak – Next Monday

    I’ll be speaking at the Columbus Ruby Brigade and giving an introduction to Riak next Monday, February 21, at 6:30PM.

    Riak: An Overview

    This presentation will lead you through an overview of Riak: a flexible, decentralized key-value store. Riak was designed to provide a friendly HTTP/JSON interface and provide a database that’s well suited for reliable web applications.

    Add it to your calendar!

  10. SQL Saturday 60 Resources

    SQL Saturday 60 was a week ago and I completely failed to post resources from the presentation in a timely manner.

    The SQL Server Internals resources have been available for a while: http://facility9.com/resources/sql-server-internals… You just had to know to look for them.

    The Modeling Muddy Data talk is available on GitHub: https://github.com/peschkaj/Muddy-Data. This presentation is released under a Creative Commons Attribution-ShareAlike license which means that we can all make things better by collaborating on the presentation materials. I’ll slowly be adding more information to the write up of the talk that is in the README.

  1. « First
  2. ‹ Previous
  3. 2
  4. Next ›
  5. Last »