The data lakes concept has come back into popular focus with Amazon Athena, an innovative, serverless solution. But does it fit into your organization’s data stack? This article covers Amazon Athena capabilities, pros and cons, competitors, and use cases.
What Is Amazon Athena?
Amazon Athena is an interactive query service that’s designed to run standard SQL queries directly from Amazon Simple Storage Service data. The biggest differentiator with AWS Athena is that it’s a serverless solution. You don’t install or deploy anything, and the pricing is simply based on what data you query.
The foundation of Amazon Athena Data Lakes functionality is Presto, which is a distributed SQL query engine for big data. This open-source solution is fast, powerful, and scalable.
The service is set up to be quick and easy to use and only requires a few simple steps:
- Point Athena at the data you want to query in Amazon S3.
- Define your database schema.
- Use standard SQL to query this data.
Amazon Athena also includes native integration with AWS Glue Data Catalog to expand its functionality. When you use Athena and Glue together, you can develop a unified metadata repository and open up other powerful capabilities.
Amazon Athena's Data Lakes compatibility goes a long way towards making Amazon S3 an ideal environment for an organizational data lake. With Athena, data from all sources goes into S3. It then gets queried on an ad-hoc basis in a performant, scalable, and accessible manner. This service is a data professional's dream come true.
One important thing to remember: you still need to get that data into Amazon S3. It can come from other services you use on Amazon, such as RDS and EMR; services outside AWS, such as Salesforce and Google Analytics; and data stores on other platforms.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Benefits of Amazon Athena
-
No infrastructure setup, configuration, or management: The serverless approach that Athena takes is a game-changer and reduces infrastructure complexity in your data analytics ecosystem. Since Athena is designed specifically to work with S3, its operations are also highly optimized for this data store.
-
Query S3 data almost instantly: You can go from setup to querying in a few short minutes. The response time for most queries is measured in seconds.
-
Cost-effective: The pay per query model of Athena allows your costs to scale based on your usage. Since Athena works directly with S3 data without needing to store it elsewhere or go through data preparation processes, you avoid additional fees for computing or storage capabilities.
-
Highly available: Since it’s an AWS managed service, it’s backed by highly available infrastructure that distributes the workload across multiple devices and executes in parallel.
-
ANSI SQL support: You can continue to use SQL queries for working with your Amazon S3 data. Your data team doesn't need to learn a new query language, so onboarding is a simple process.
-
Broad unstructured, semi-structured, and structured data support: The formats supported on Athena include CSV, Parquet, ORC, Avro, and JSON. You can work with relational, object, custom, and non-relational sources for this data.
-
Supports ad-hoc and complex queries: Athena offers advanced capabilities that include arrays, large joins, and window functions.
-
User-friendly interface: Amazon Athena is intuitive to use, allowing the data team to spend more time focusing on the data and less time trying to get this service working properly.
-
Integrate your favorite business intelligence tool: Athena has a JDBC driver that allows you to bring this powerful query engine into your BI solutions. You can equip your data team with faster insights that allow them to react in-the-moment.
-
Strong security: You have many data access control options included in Amazon Athena, as well as support for querying encrypted data and then writing encrypted results to keep sensitive data safe.
-
Works with other AWS services: Some Amazon tools that Amazon Athena easily works with include Amazon Redshift, Amazon DynamoDB and AWS CloudWatch Metrics.
Amazon Athena Pros
The serverless infrastructure is AWS Athena’s greatest strength, especially if you’re already using Amazon S3 for your data. It’s quick and easy to implement, making it excellent for businesses that have limited development resources for setting up more complex solutions.
Being able to work with virtually any data source in a wide variety of formats is another feature that’s particularly helpful. You can keep all of this data in its native format, reducing the steps it takes to go from data storage to insight.
If you optimize your data compression and use column store, you can keep costs relatively low. You also realize cost savings by not needing to manage or maintain the system.
Amazon Athena Cons
Amazon Athena shines with ad-hoc, smaller-scale queries, but it can fall short with large data sets compared to other query engines. The service has several limitations that make it more difficult to scale seamlessly, and if you want to increase your capacity, you have to wait for someone at AWS to get back to you. The amount of data you work with plays a key role in whether this solution fits your needs.
You can only use Amazon S3 as your data store for this tool. If you prefer a different provider, you are out of luck. Vendor lock-in is a serious concern with Athena, which could be a large enough disadvantage to skip over the platform entirely.
Predicting your Athena costs can be challenging, as the pricing is based on how much data you scan with each query. Since this service works best for ad hoc queries, your data scientists may be querying vastly different data set sizes with each project.
What Is Amazon Athena Comparable To?
Athena is most comparable to Google BigQuery, as it has a similar focus. However, Athena requires that you handle the underlying data files, format, and directory structure. You get more flexibility with this approach.
Another service that’s similar to Athena is Amazon Redshift Spectrum. Like Athena, Redshift Spectrum is a serverless query processing engine. However, it’s designed to join data between an S3 bucket and an Amazon Redshift relational OLAP database. If you don't plan on using Redshift, then you won't get a lot of value out of this service.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
Who Should Use Amazon Athena?
Athena is most suitable for businesses that already use AWS S3 and have relatively basic query data needs. SMBs don’t have the same scale of data sets as enterprises, and they also don’t have the same technical resources. Serverless architecture removes the overhead duties from the platform, and the data team and other users can perform a variety of ad hoc queries.
Another use case for Athena is quickly getting insights out of your data. Since the data is analyzed as-is, where-is, you can query it without needing to wait for it to become available. Minutes and seconds can make a huge difference when it comes to taking advantage of a new trend or opportunity.
The Unified Stack for Modern Data Teams
Get a personalized platform demo & 30-minute Q&A session with a Solution Engineer
How Integrate.io Can Help
Get the most out of Amazon Athena with the help of Integrate.io’s ETL solution. Our platform can write data to Amazon S3 in all the data formats Athena supports and connects AWS S3 to more than 100 data sources and destinations so you can regularly pump data into your data lake from external data stores.
Like Athena, Integrate.io is also a service. Companies using our technology don’t have to worry about maintenance or administration. Learn more about how we can help with your data lake and AWS services with a fourteen-day demo of our platform