Using R with AWS Athena

2019/02/08

AWS / Original Content / rstats

Summary:

Complete guide to setting up AWS S3, and Athena for use with #RSTATS

AWS User Account Creation:

User Account Step 1:

Log into AWS site and navigate to the IAM Management Console found under the (Security, Identity, & Compliance) section
Next goto Users and Choose Add User
Enter a name for the new user and select “Programmatic access” type

Alt text

User Account Step 2:

Create a group so that you can assign the same permissions to additional users in the future

Alt text

User Account Step 3:

In this example I created a group named “blogger”
Assign the following roles
- AWSQuicksightAthenaAccess
- AmazonS3FullAccess
- AmazonAthenaFullAccess

Alt text

User Account Step 4:

Make sure to add the new user to the group you just created

Alt text

User Account Step 5:

The last step #5 shows that you successfully created the new user and provides you with the accounts Access key ID and Secret access key
Write these down the access key will be available in the user account details screen but the secret key is only shown here and will need to be reset once you leave this screen if you loose it.

Alt text

AWS S3 bucket creation:

S3 Bucket Step 1:

Navigate to the S3 Console found under the (Storage) section
Choose create bucket

Alt text

S3 Bucket Step 2:

Give the bucket a unique and informative name
Take note of the region you are working in in this case US West Oregon
For the R configuration later this is “us-west-2”

Alt text

S3 Bucket Step 3:

You can choose versioning, logs, etc, but I’ve left them all blank for this example

Alt text

S3 Bucket Step 4:

Enable the 4 check boxes for keeping the bucket private if they are not already set

Alt text

S3 Bucket Step 5:

Review the settings you chose and click Create bucket to complete the process

Alt text

S3 Bucket Step 6:

Back on the S3 console you will now see the new bucket including that it is private, and in the US West region

Alt text

Let’s get to the fun part in R:

For the remainder of this post we will use R to:
- create a csv file
- connect to AWS and populate the S3 bucket we just created with the csv
- load the csv we added to S3 into a table in Athena
- then pull that data out of Athena back into R
We will use the following packages:
- tidyverse (CRAN)
- here (CRAN)
- knitr (CRAN)
- kableExtra (CRAN)
- mobstr (github) remotes::install_github(“themechanicalbear/mobstr”)

knitr::opts_chunk$set(message = FALSE, eval = FALSE, 
                      tidy.opts = list(width.cutoff = 100))
suppressWarnings(suppressMessages(suppressPackageStartupMessages({
  library_list <- c("tidyverse", "here", "mobstr", "knitr", "kableExtra")
  lapply(library_list, require, character.only = TRUE)})))

Add AWS Credentials to .Rprofile:

Open your .Rprofile and add the following with the values specific to your account we created above.
This will allow your scripts to connect to AWS resources without having to include your keys in the scripts themselves

Sys.setenv("AWS_ACCESS_KEY_ID" = "YOUR_AWS_ACCESS_KEY_ID",
           "AWS_SECRET_ACCESS_KEY" = "YOUR_AWS_SECRET_ACCESS_KEY",
           "AWS_DEFAULT_REGION" = "us-west-2") # Whatever region your S3 bucket you created is in

The mobstr package includes the write_2_S3 function to write stock data to AWS S3 for analysis.
- We can use this function to write any csv file we want.
- First let’s create a sample csv to upload
- Here I am saving this file to the static data directory to be included in the blog folder structure you don’t need to do that
- Let’s use the mtcars dataset to create a sample csv to upload

Athena requires there only be one csv per folder/bucket so the write_2_S3 function will create a new folder inside our blog-s3-test-example bucket to place our csv inside

readr::write_csv(mtcars, paste0(here::here(), "/static/data/mtcars.csv"))

mobstr::write_2_S3(file = "mtcars", 
                   path = paste0(here::here(), "/static/data/"),
                   bucket = "blog-s3-test-example")

Check AWS S3 for our file!

Alt text

Create the table in Athena

The mobstr package includes the athena_connect function to connect to the Athena service providing the database name. We are going to use the default database here.
The mobstr package also includes the athena_load function which will use the connection to populate out table
athena_load uses the following arguments:
- conn - The odbc connection created with athena_connect
- database - The database name
- s3_bucket - The name of our S3 bucket
- name - The name we want to assign to the table we are creating
- df - The dataframe name we are extracting the column names and types from
  (if you don’t have mtcars available in your environment you need to load it with data(mtcars)) or read in whatever csv you loaded to S3 instead so they match

athena <- mobstr::athena_connect("default")

mobstr::athena_load(athena, "default", "blog-s3-test-example", "mtcars", mtcars)

mobstr Function Code:

This package is in development as Mechanical Options Backtesting System w/ Tidy R (mobstr) so I’ll include the code here so that you will not need to load the package to follow this process.

# mobstr::write_2_S3
write_2_S3 <- function(file, path, bucket) {
  file_path <- paste0(path, file, ".csv")
  file_name <- paste0(file, ".csv")
  bucket_name <- paste0(bucket, "/", file)
  region_name <- Sys.getenv("AWS_S3_REGION")

  aws.s3::put_bucket(bucket_name)

  aws.s3::put_object(file = file_path,
                     object = file_name,
                     bucket = bucket_name,
                     region = region_name)
}

# mobstr::athena_connect
athena_connect <- function(database) {
  DBI::dbConnect(
    odbc::odbc(),
    driver = Sys.getenv("AWS_ATHENA_DRIVER_FILE"),
    Schema = database,
    AwsRegion = Sys.getenv("AWS_DEFAULT_REGION"),
    UID = Sys.getenv("AWS_ACCESS_KEY_ID"),
    PWD = Sys.getenv("AWS_SECRET_ACCESS_KEY"),
    S3OutputLocation = Sys.getenv("AWS_ATHENA_S3_OUTPUT")
  )
}

# mobstr::athena_load
athena_load <- function(conn, database, s3_bucket, name, df) {
  table_vars <- paste0("`", names(df), "` ", sapply(df, typeof), ",", collapse = '')
  table_vars <- sub(",$", "", table_vars)

  # SQL create table statement in Athena
  DBI::dbExecute(conn, paste0("CREATE EXTERNAL TABLE IF NOT EXISTS ",
                              database, ".", name, " (", table_vars, " )",
                              "ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'",
                              "WITH SERDEPROPERTIES ('serialization.format' = ',', 'field.delim' = ',' )",
                              "LOCATION 's3://", s3_bucket, "/", name, "/'",
                              "TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');"))
}

That’s It, did it work?…..

We can now connect read the table and bring the data back into R and View

athena_cars <- mobstr::athena_connect("default") %>%
  dplyr::tbl("mtcars") %>%
  dplyr::filter(mpg >= 25) %>%
  dplyr::collect()

knitr::kable(athena_cars, digits = 2, format = "html",
             caption = "Athena Cars",
             col.names = c("MPG", "CYL", "DISP", "HP", "DRAT", "WT", "QSEC",
                           "VS", "AM", "GEAR", "CARB"),
             escape = FALSE,
             align = rep("r", 11)) %>%
  kableExtra::kable_styling(., bootstrap_options = "striped", position = "center",
                            full_width = FALSE) %>%
  kableExtra::column_spec(., 1:11, width = "1.5in")

Table 1 Athena Cars
MPG	CYL	DISP	HP	DRAT	WT	QSEC	VS	AM	GEAR	CARB
32.4	4	78.7	66	4.08	2.20	19.47	1	1	4	1
30.4	4	75.7	52	4.93	1.62	18.52	1	1	4	2
33.9	4	71.1	65	4.22	1.84	19.90	1	1	4	1
27.3	4	79.0	66	4.08	1.94	18.90	1	1	4	1
26.0	4	120.3	91	4.43	2.14	16.70	0	1	5	2
30.4	4	95.1	113	3.77	1.51	16.90	1	1	5	2

If you have suggestions or any other feedback please reach out with the contact links on the sidebar

Summary:

AWS User Account Creation:

User Account Step 1:

User Account Step 2:

User Account Step 3:

User Account Step 4:

User Account Step 5:

AWS S3 bucket creation:

S3 Bucket Step 1:

S3 Bucket Step 2:

S3 Bucket Step 3:

S3 Bucket Step 4:

S3 Bucket Step 5:

S3 Bucket Step 6:

Let’s get to the fun part in R:

Add AWS Credentials to .Rprofile:

Check AWS S3 for our file!

Create the table in Athena

mobstr Function Code:

That’s It, did it work?…..

Best,

Jason