Querying S3 with Presto

This post assumes you have an AWS account and a Presto instance (standalone or cluster) running. We'll use the Presto CLI to run the queries against the Yelp dataset. The dataset is a JSON dump of a subset of Yelp's data for businesses, reviews, checkins, users and tips.

Configure Hive metastore

Configure the Hive metastore to point at our data in S3. We are using the…

Read more →

Creating a Presto Cluster

I first came across Presto when researching data virtualization - the idea that all of your data can be integrated regardless of its format or storage location. One can use scripts or periodic jobs to mashup data or create regular reports from several independent sources. However, these methods don't scale well, especially when the queries change frequently or the data…

Read more →
Page 1