Five Key Points Regarding Big Data Cloud Storage
-
Big data refers to the large sets of data that are put out by a variety of programs and are often too large to store on a regular computer.
-
Cloud computing refers to the processing of big data in the cloud.
-
The cloud refers to a set of high-powered servers that can store and query large data sets.
-
Many companies are turning from traditional storage systems to cloud storage services to handle their company's big data.
-
Integrate.io makes moving large amounts of data from traditional storage systems to cloud solutions and cloud storage providers simple through ETL capabilities and data management.
In the 21st century, data is the center of every business. Therefore, how companies handle, optimize, and automate their valuable business data is becoming more and more important. To make data storage a bit easier, many companies are moving their big data over to cloud-based data centers.
There are many different advantages of moving your business data to the cloud. However, one of the biggest concerns when it comes to finding cloud storage service providers is often the price. Unfortunately, calculating the exact costs of cloud computing services can prove challenging due to the complicated pricing models and the overall challenges that come with predicting how many resources your business will need for complete cloud data management.
When examining the cloud storage costs associated with logging big data, there are essentially three ways to store big data on the cloud. These include storing it directly in the database, uploading log files, or logging via S3/CloudFront.
All things considered, it’s challenging to come up with an accurate one-size-fits-all figure. To understand cloud storage costs a little better, this article will highlight how much storing big data in the cloud will really cost your organization. Read on to learn more.
Storing Big Data Pricing Assumptions
We first need to set some parameters for our storage pricing model and outline some use cases. Of course, these parameters won't reflect every situation out there—but for the purposes of this exercise, we want to establish a few simple assumptions.
Here's what we'll be working with:
-
Amazon Web Services, US East region
-
24/7 usage, 1-year reserved instance, heavy utilization
-
1 billion log lines per day, with an average of 1,000 bytes per logline; a total of 1 terabyte per day, or 30 terabytes per month
-
Only storage costs (processing not included)
-
Prices are calculated using the AWS Pricing Calculator (all prices in USD)
Again, a disclaimer: these are only estimates. Your needs may be different from those listed here, performance tweaks might change the required hardware, and Amazon can modify prices at any given time. If you collect data in the cloud, please feel free to let us know which method you use and how much it costs.
Storing Big Data Directly in the Database
AWS provides two options for running a relational database in the cloud: Relational Database Service (RDS) and a custom installation on Elastic Compute Cloud (EC2). In both cases, you'll need a log server to collect, generate, and store the logs.
Log Server
The log server should be able to handle 1 billion logs per day or roughly 11,000 logs per second on average. Some companies use in-house solutions, but they take plenty of time and money to develop and maintain. Let's go with an off-the-shelf logger like Fluentd and use a plugin to integrate it with a database.
4 Amazon EC2 r4large instances with Intel Xeon E5-2686 v4 (Broadwell) processors (4 vCPUs) and 30.5 gigabytes of RAM should be more than enough to handle logging, including during peak times, and writing the data into the Database.
The hourly rate for a 1-year reserved instance is $0.168: with four instances, that's 4 $0.168 24 * 365, or roughly $5,900 per year.
There are several possible extra charges, however:
-
Transferring 30 terabytes of data from US East to another AWS region costs $0.02 per gigabyte or $0.02 30 1000 = $600.
-
You'll need an elastic load balancer to balance between your instances. According to Amazon, a load balancer costs about $0.0225 per hour or about $16 per month if running full-time.
4 Amazon EC2 r4.xlarge instances: $5,900 / year Data transfers $7,200 / year Elastic Load Balancer $200 / year Total $13,300 / year
internal-cta-form
Amazon RDS
First, a note on Amazon RDS: RDS databases have a storage limit of 6416 terabytes, which means that they'll be too small for our needs in this example—we'll run out of room in just two months! There is an option to utilize Amazon RDS for sharding (see this link for details), but in this post, we'll focus on some other, more appropriate options.
Elastic Compute Cloud (EC2)
Running MySQL on EC2 requires a lot of space. A storage-optimized d2.8xlarge instance with a total of 48 terabytes of storage should do. With 30 terabytes of data generated each month, you'll need another instance every month and a half, for a total of 8 instances throughout the year. It's cheaper to book them in advance for one year than it is to work on-demand and keep scaling. The cost of a single d2.8xlarge instance is around $3.216 per hour, $2,300 per month, or $28,000 per year, with eight instances per year, which comes to roughly $225,000.
Unfortunately, you lose all your data when stopping instances (although the data stays when rebooting the virtual machine). To make sure it stays put, you'll need an Amazon EBS-provisioned IOPS SSD volume—and these don't come cheap.
A more affordable option is to keep only a month's worth of raw data and aggregations for older data while archiving the rest on S3. You can gzip log files with a ratio of 1:4, which means that 7.5 terabytes will suffice for local storage per month. On S3, that costs around $170 per month, or about $13,700 per year (12 months for the first 7.5 terabytes, 11 months for the next 7.5 terabytes, etc.). We'll only need one d2.8xlarge instance, which costs $28,000 per year, as mentioned above.
Together, these costs come to an estimated $55,000/year for storing big data directly in the Database.
Log Server $13,300 / year S3 Storage $13,700 / year d2.8xlarge instange $28,000 / year Total $55,000 / year
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Uploading Log Files
In this case, data is stored as big log files that are continuously uploaded into S3, DynamoDB, or Redshift.
Log Server
The requirements for the log server are the same as in the previous method, except that you'll be saving the data as files rather than in the Database. See the prices above.
S3
S3 standard storage for 7.5 terabytes per month costs $0.023 per gigabyte. As previously calculated, that's about $170 per month or $13,700 per year. This will be added to the cost of running the log server for a total of about $30,000/year. (The costs of transparent image file storage are marginal since each file is only 68 bytes.)
DynamoDB
DynamoDB charges $0.25 per gigabyte per month for data storage. If you're storing 30 terabytes per month with DynamoDB, the cost will be roughly $7,700 per month or a little over $600,000 per year. Combined with the cost of the log server, that's $613,300/year.
Redshift
To have 360 terabytes available for the entire year, you'll need 23 instances of ds2.8xlarge (16 TB of space each). Reserved for one year and paid upfront, that will cost you around $790,000.
A more cost-effective option is to only save one month's worth of data using two ds2.8xlarge instances while archiving the rest on Amazon S3. The cost will be roughly $69,000, paid upfront for a 1-year term, plus the costs of S3 calculated above.
The total cost for uploading log files will be $96,000/year.
Log Server $13,300 / year S3 Storage $13.700 / year d2.8xlarge instance $69,000 / year Total $96,000 / year
S3/CloudFront Logging
This method tracks events via HTTP requests to images from S3 directories, which automatically generate logs. It needs no extra logging servers and only 7.5 terabytes per month of storage. As previously calculated, 7.5 terabytes of storage per month on S3 is roughly $13,700 per year.
You'll need to use CloudFront as well, or features like logging via the query string won't work. CloudFront GET requests cost $0.0075 per 10,000 requests (see pricing). One billion HTTP requests will cost $750 per day or around $270,000 per year.
Traditionally there are charges for requests to S3 as well, but as long as you set caching headers, these charges will be minimal. Accessing the transparent images incurs a data transfer of 68 x 1 billion bytes: 68 gigabytes per day, or 2040 gigabytes per month. Outward data transfers from S3 cost $0.02 per gigabyte, which comes to about $40 per month, or about $500 per year.
Adding these costs together, we get a total price of $284,200/year for storing big data via S3/CloudFront logging.
S3 Storage $13,700 / year CloudFront Requests $270,000 / year CloudFront Data Transfers $500 / year Total $284,000 / year
Storing Big Data: What’s the Verdict?
As a reminder, here are the total costs of all the methods we've discussed:
-
Directly the Database is $55,000 / year
-
Uploading log files to S3 $30,000 / year
-
Uploading log files to DynamoDb $613,000 / year
-
Uploading log files to Redshift $96,000 / year
-
S3/Cloudfront Logging $284,000 / year
Based on these analyses, uploading log files to S3 is the cheapest way to store big data in the cloud. Contrary to some assumptions, S3/CloudFront logging is quite expensive.
Of course, this is just one example of how to calculate the cost of storing big data in the cloud—your mileage may vary based on your own business needs and objectives. Most importantly, there are a few big unknowns in the equation: since DBA and developer costs for implementation and maintenance are not included, and neither are costs for processing the data.
Nonetheless, we hope that this overview has helped you figure out a decent way to estimate the cheapest way to store data in the cloud.
Related Reading: Best 17 Data Warehouse Tools
How Integrate.io Can Help With Big Data Storage
Is your company in need of a little help when it comes to navigating the wild and complicated world of big data? If so, the Integrate.io platform is here to lend a helping hand.
Along with being a very fast CDC platform, Integrate.io is a top-level ETL platform with reverse ETL capabilities and deep e-commerce capabilities too. Whether your organization needs guidance with cloud computing, APIs, SQL, storage space/storage solutions, workloads, scalability, object storage, real-time data analytics, or data management, Integrate.io is here to help.
Are you ready to discover more about the many benefits the Integrate.io platform can provide to your company? If so, contact our team to schedule an intro call today. We look forward to helping you reach your cloud computing goals.