In our previous post, we discussed Mad Men and how to design a data warehouse in the space age of Big Data. This post will take another step forward, or rather up, and examine how to design a data warehouse on the cloud.
Cloud Data Warehouses
One option is to use a DWH-as-a-Service such as Amazon Redshift. Another option is to setup your own relational database on a virtual cloud instance services such as with Amazon EC2, IBM Softlayer, or Rackspace. There are quite a few advantages to having a DWH on the cloud.
Scalability & Elasticity
Scaling a data warehouse on-premise means buying and installing new hardware which takes plenty of time. Cloud instances, however, can scale up immediately, so it’s easy to upgrade dedicated DWH services such as Redshift.
Unfortunately, this advantage doesn't apply to more traditional solutions such as an Oracle database on the cloud - while it’s relatively simple to setup a new instance, transferring the data there could be laborious, possibly more than on premise.
Cost
Due to virtualization and less hardware maintenance, cloud data warehouse prices are much cheaper than their bare-bones brothers. With pay-as-you-go pricing models and great elasticity, you need only pay for actual instance usage rather than having machines working 24/7 and eating up electricity. How much cheaper does it get? Amazon claims that Redshift costs around $1,000 per terabyte per year, while on-premise data warehouses cost $19,000-$25,000 per terabyte per year - that’s 19-25 times less money spent.
Operational Data Store, ETL, and Data Integration
Operational data stores, ETL, and data integration can enjoy the same benefits of high scalability, elasticity, and low costs on the cloud as discussed in the previous section. When using Apache Hadoop, they can also run on top of Hadoop-as-a-Service solutions such as Amazon EMR, or our very own Integrate.io.
Related reading: The Top Data Integration Software Tools and Platforms
One of the main issues with Hadoop, is that it requires specialized and complex skills. Not all companies have these skills at their disposal, or only a small number of experts within them do. This too has a solution with services such as Integrate.io. Integrate.io runs Hadoop clusters for you in the cloud. You literally run these clusters by clicking a button - no knowledge of setting up and managing Hadoop is required.
Data Storage
Instead of using local or virtual Hadoop instances to store data, file storage services such as Amazon S3 are available. They have better scalability, durability, and persistence than HDFS, not to mention extremely low prices - see our review of S3 vs. HDFS. If your company’s data is being kept on premise, tools such as Attunity’s CloudBeam could help to get it up on the cloud. Nonetheless, these days more and more data from different sources, such as web logs or social data, is already being stored on the cloud.
Reporting
The reporting department is not on par with the rest of its cloud counterparts at this time. You’ll probably need to setup servers on the cloud with the relevant tools and connect to them via remote desktop. Some companies though, such as Tableau and Chartio are starting to offer hosted versions of their products. Other BI and reporting platforms worth mentioning are GoodData, BIME, and Reporting Services that runs on Azure.
Summary
Designing your data warehouse infrastructure on the cloud can help to reduce costs while increasing scalability and elasticity. A wide array of solutions is available - from setting up your own virtual instances, to using hosted services. Data storage is extremely useful on the cloud, and even reporting tools are starting to make their way up there.
Related reading:
5 Big Data Problems and How to Solve Them
Processing Unstructured Data 101