Data partitioning helps to speed up your Amazon Athena queries, and also reduces your cost, as you need to query less data.
Partitioning data means that we split the data up into related groups of data. Possible partitions could be date (time-based), zipcode, different types (contexts), etc. When querying the data, the query is constrained to a particular partition thus not querying the whole dataset. AWS Athena cost is based on the number of bytes scanned. Partitioning will have a big impact on the speed and cost of your queries.
To start using Amazon Athena, you need to define your table schemas in Amazon Glue. Amazon Glue is a managed ETL (extract, transform, and load) service that prepares and loads the data for analytics. To define the data schema you can either have a static schema or you can use a crawler. The crawler will scan the data and determine the schema of the data automatically. This will then be used to populate the AWS Glue Data Catalog with tables.
In our use case we had two problems to solve:
- Incoming event data from firehose needs to be partitioned in S3 to achieve lower costs and faster queries in Athena;
- When we partition our data, we need to make Athena aware of this newer partitioning schema in an automated way
Below we explain these 2 problems with solution in detail.