The requirement for big data analytics has become commonplace in most companies today. Data has always been an essential part of any business, but the need to extract useful information from that data has been on a steady rise in just the past decade.
With the amount of data coming in from numerous sources growing exponentially every day, it’s easy to understand why big data analytics can be a challenge. It’s not only about getting the information you need, it’s also about doing it in a way that will not exceed your budget. And then there’s security to think about, too.
As with anything data-related, there are other, simpler issues at hand—such as storing all that data coming in. At least this part has undergone a tremendous change with the introduction of public clouds like AWS, which allows for almost unlimited storage capabilities. More importantly, AWS offers managed services that can actually help us analyze data.
So let’s take a look at two such AWS services, and see which one would be a better fit for your business needs.
AWS Services for Big Data Analytics
When it comes to data analytics, especially when working with large data sets, there are two AWS services you should consider: AWS Redshift and AWS Athena. So let’s take a closer look at each one.
AWS Redshift
AWS Redshift is a cloud-based data warehouse and analytics service offered by Amazon’s public cloud and based on PostgreSQL. It’s a column-oriented database designed to work with SQL-based business intelligence tools, providing data to users in real time.
AWS Redshift is also a fully managed petabyte-scale solution that allows for various use cases. One of the most common of these is business intelligence, as Redshift provides high-performance queries on petabytes of semi-structured and structured data—and does so very cost-efficiently.
Analytics
More interestingly, AWS Redshift allows for operational analytics, so you can, for example, pull massive amounts of data from S3 (e.g., logs) and get real-time operational insights. You can also use Redshift for predictive analytics, whether to run your own machine learning models or to integrate it with other AWS services like SageMaker.
Redshift can even combine multiple data sources (structured data, semi-structured data, and unstructured data).
Fully Managed, Fast, Familiar
And the fact that Redshift is a fully managed service means there is almost no overhead required for maintenance. You let AWS take care of the Redshift cluster, while you focus on the data. If you require more storage capacity, you can scale Redshift as much as desired—a flexibility that separates cloud computing from the old days of bare metal hardware.
Not only will you get extra resources based on your needs, but the cluster will also scale down when necessary to keep costs as low as possible.AWS Redshift is fast and offers encryption for your data, plus, since it’s based on PostgreSQL, it provides familiarity, as all SQL queries are supported.
Pricing
With Redshift, you can choose between on-demand and reserved instance pricing. On-demand pricing lets you pay by the hour (based on the instance type of your requirements) and is very flexible, as there are no upfront commitments.
Reserved instance pricing, on the other hand, has you commit to a 1- or 3-year contract, with no upfront, partial upfront, or all upfront options and associated discounts. This can reduce your pricing by up to 75% and is an absolute must-have when you’ve already tested everything and know exactly what hardware is needed to handle your required tasks.
AWS Athena
AWS Athena works differently compared to AWS Redshift, as it is not a data warehouse but an interactive query service. With it, you can easily analyze data that was previously stored in S3 buckets, plus it is a serverless product, meaning there is no infrastructure to manage or even think about. Athena is also very easy to use, as all you have to do is choose the data, define the schema to be used, and start with the queries using SQL.
Fast Analytics on Simple Queries
There are a few visible benefits of using Athena right from the start. For example, you don’t need to think about data and whether it is prepared or not—you simply start querying. Also, Athena is very fast, with results returned sometimes within seconds.
Athena is used for various kinds of data analytics—ad-hoc analytics, some type of streaming analytics, data lake analytics, etc. But where AWS Redshift is designed with complex queries coming from multiple sources in mind, Athena is used more for the simple queries run on a single data source.
Security
As AWS Athena pulls data from S3 buckets, security is something to think about as well. Thankfully, Athena provides both encryption at rest (with server-side encryption, client-side encryption, and KMS encryption in place) and encryption in transit (TLS-level encryption is used for transit between S3 and Athena, and KMS is supported to encrypt various data sets from Athena query results).
Limitations
When it comes to AWS Athena, there are some limitations that apply. For example, it can only support one query and have five concurrent queries running for each account. Additionally, Athena does not support queries across all available regions, so make sure you are aware of the location of your data before you start using it.
Also, Athena does not yet support data queries for objects stored in AWS Glacier. For the full list of limitations, you can visit the AWS documentation page.
Pricing
AWS Athena is priced only for the data scanned by the query you’re performing. This can be a significant advantage, as you don’t need to have a cost strategy in place. The cost of using Athena in the US-East-1 region (North Virginia) is $5.00 per TB of data scanned.
But do note that by compressing, partitioning, and converting your data into columnar formats, you can achieve not only significant savings on Athena queries, but also better performance.
AWS Redshift and AWS Athena Comparison Chart
AWS Redshift | AWS Athena | |
Service type | Fully managed service, using nodes | Fully managed service, serverless model |
Use cases | Business intelligence, data analytics; designed for complex queries for multiple data sources | Data analytics; designed for querying a single resource |
Infrastructure | Runs on nodes within a cluster | Runs on serverless infrastructure, not visible or accessible to users |
Storage used | Nodes | S3 |
Pricing model | Cost per node used | Cost per amount of data scanned by the query |
Data types used | Wide variety supported | Wide variety supported |
Table performance and requirement | Uses PostgreSQL and is generally faster than AWS Athena | Uses Apache Hive Query Language (HQL) |
Scaling capabilities | Resizing options available | Managed by AWS |
Security | Data encryption supported | Data encryption supported |
Performance | Depends on the node size and whether data is sorted | Depends on the queries you are performing |
User-defined function (UDF) support | Supports UDFs with scalar and aggregate functions | Does not support UDFs |
N2WS Backup & Recovery
N2WS Backup & Recovery is a cloud-native product that can help you with your data backup needs. It has been around for a while now and is constantly evolving—with the 4.0 release even bringing support for Microsoft Azure Cloud.
N2WS Backup & Recovery can (among many other things) be used for RedShift backups, so if you’re looking for a quality product that will ensure business continuity, make sure you give it a try.
Tracking Big Data Analytics in AWS
AWS Redshift and AWS Athena approach data analytics from different angles. Redshift is primarily a data warehouse service, which—depending on your business needs—can be a benefit or an unnecessary overhead.
- Athena is designed for data analytics and requires no data preparation nor carries additional costs for storage, as the data is already located in S3 buckets.
- AWS Redshift is overall faster when it comes to performance and offers more complex query capabilities, but that additional speed comes at the price of having various instances running (and being paid for).
All this means that there are multiple factors to consider when trying to find the best solution for you. Query capabilities should steer your decision, but you also need to consider price. More importantly, your choice should depend on whether you’re already using Redshift (or even have data warehouse needs) or not.
Or, do you want to spend additional time working with a full data warehouse at all, instead of issuing simple queries that can do the job for you.If you’re not sure which way to go, your best option is to test them both. Athena is priced per query, so you can easily try it out. On the other hand, AWS Redshift provides a free 2-month trial option (for those creating their first cluster).
Laurent is a Senior System Engineer at N2WS and AWS Certified Solutions Architect with more than 10 years of experience. (He's also both bilingual and the lead singer of a French rock band in the UK, making him très cool.)