Using R with AWS Athena

Summary:

Complete guide to setting up AWS S3, and Athena for use with #RSTATS



AWS User Account Creation:



User Account Step 1:

  • Log into AWS site and navigate to the IAM Management Console found under the (Security, Identity, & Compliance) section
  • Next goto Users and Choose Add User
  • Enter a name for the new user and select “Programmatic access” type


Alt text


User Account Step 2:

Create a group so that you can assign the same permissions to additional users in the future


Alt text


User Account Step 3:

  • In this example I created a group named “blogger”
  • Assign the following roles
    • AWSQuicksightAthenaAccess
    • AmazonS3FullAccess
    • AmazonAthenaFullAccess


Alt text


User Account Step 4:

Make sure to add the new user to the group you just created

Alt text


User Account Step 5:

  • The last step #5 shows that you successfully created the new user and provides you with the accounts Access key ID and Secret access key
  • Write these down the access key will be available in the user account details screen but the secret key is only shown here and will need to be reset once you leave this screen if you loose it.


Alt text



AWS S3 bucket creation:



S3 Bucket Step 1:

  • Navigate to the S3 Console found under the (Storage) section
  • Choose create bucket

Alt text


S3 Bucket Step 2:

  • Give the bucket a unique and informative name
  • Take note of the region you are working in in this case US West Oregon
  • For the R configuration later this is “us-west-2”

Alt text


S3 Bucket Step 3:

You can choose versioning, logs, etc, but I’ve left them all blank for this example

Alt text


S3 Bucket Step 4:

Enable the 4 check boxes for keeping the bucket private if they are not already set

Alt text


S3 Bucket Step 5:

Review the settings you chose and click Create bucket to complete the process

Alt text


S3 Bucket Step 6:

Back on the S3 console you will now see the new bucket including that it is private, and in the US West region

Alt text



Let’s get to the fun part in R:



  • For the remainder of this post we will use R to:

    • create a csv file
    • connect to AWS and populate the S3 bucket we just created with the csv
    • load the csv we added to S3 into a table in Athena
    • then pull that data out of Athena back into R
  • We will use the following packages:

    • tidyverse (CRAN)
    • here (CRAN)
    • knitr (CRAN)
    • kableExtra (CRAN)
    • mobstr (github) remotes::install_github(“themechanicalbear/mobstr”)


knitr::opts_chunk$set(message = FALSE, eval = FALSE, 
                      tidy.opts = list(width.cutoff = 100))
suppressWarnings(suppressMessages(suppressPackageStartupMessages({
  library_list <- c("tidyverse", "here", "mobstr", "knitr", "kableExtra")
  lapply(library_list, require, character.only = TRUE)})))



Add AWS Credentials to .Rprofile:



  • Open your .Rprofile and add the following with the values specific to your account we created above.

  • This will allow your scripts to connect to AWS resources without having to include your keys in the scripts themselves


Sys.setenv("AWS_ACCESS_KEY_ID" = "YOUR_AWS_ACCESS_KEY_ID",
           "AWS_SECRET_ACCESS_KEY" = "YOUR_AWS_SECRET_ACCESS_KEY",
           "AWS_DEFAULT_REGION" = "us-west-2") # Whatever region your S3 bucket you created is in


  • The mobstr package includes the write_2_S3 function to write stock data to AWS S3 for analysis.
    • We can use this function to write any csv file we want.
    • First let’s create a sample csv to upload
    • Here I am saving this file to the static data directory to be included in the blog folder structure you don’t need to do that
    • Let’s use the mtcars dataset to create a sample csv to upload

Athena requires there only be one csv per folder/bucket so the write_2_S3 function will create a new folder inside our blog-s3-test-example bucket to place our csv inside


readr::write_csv(mtcars, paste0(here::here(), "/static/data/mtcars.csv"))

mobstr::write_2_S3(file = "mtcars", 
                   path = paste0(here::here(), "/static/data/"),
                   bucket = "blog-s3-test-example")



Check AWS S3 for our file!



Alt text



Create the table in Athena



  • The mobstr package includes the athena_connect function to connect to the Athena service providing the database name. We are going to use the default database here.
  • The mobstr package also includes the athena_load function which will use the connection to populate out table
  • athena_load uses the following arguments:
    • conn - The odbc connection created with athena_connect
    • database - The database name
    • s3_bucket - The name of our S3 bucket
    • name - The name we want to assign to the table we are creating
    • df - The dataframe name we are extracting the column names and types from
      (if you don’t have mtcars available in your environment you need to load it with data(mtcars)) or read in whatever csv you loaded to S3 instead so they match


athena <- mobstr::athena_connect("default")

mobstr::athena_load(athena, "default", "blog-s3-test-example", "mtcars", mtcars)


mobstr Function Code:

  • This package is in development as Mechanical Options Backtesting System w/ Tidy R (mobstr) so I’ll include the code here so that you will not need to load the package to follow this process.
# mobstr::write_2_S3
write_2_S3 <- function(file, path, bucket) {
  file_path <- paste0(path, file, ".csv")
  file_name <- paste0(file, ".csv")
  bucket_name <- paste0(bucket, "/", file)
  region_name <- Sys.getenv("AWS_S3_REGION")

  aws.s3::put_bucket(bucket_name)

  aws.s3::put_object(file = file_path,
                     object = file_name,
                     bucket = bucket_name,
                     region = region_name)
}

# mobstr::athena_connect
athena_connect <- function(database) {
  DBI::dbConnect(
    odbc::odbc(),
    driver = Sys.getenv("AWS_ATHENA_DRIVER_FILE"),
    Schema = database,
    AwsRegion = Sys.getenv("AWS_DEFAULT_REGION"),
    UID = Sys.getenv("AWS_ACCESS_KEY_ID"),
    PWD = Sys.getenv("AWS_SECRET_ACCESS_KEY"),
    S3OutputLocation = Sys.getenv("AWS_ATHENA_S3_OUTPUT")
  )
}

# mobstr::athena_load
athena_load <- function(conn, database, s3_bucket, name, df) {
  table_vars <- paste0("`", names(df), "` ", sapply(df, typeof), ",", collapse = '')
  table_vars <- sub(",$", "", table_vars)

  # SQL create table statement in Athena
  DBI::dbExecute(conn, paste0("CREATE EXTERNAL TABLE IF NOT EXISTS ",
                              database, ".", name, " (", table_vars, " )",
                              "ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'",
                              "WITH SERDEPROPERTIES ('serialization.format' = ',', 'field.delim' = ',' )",
                              "LOCATION 's3://", s3_bucket, "/", name, "/'",
                              "TBLPROPERTIES ('has_encrypted_data'='false', 'skip.header.line.count'='1');"))
}



That’s It, did it work?…..



We can now connect read the table and bring the data back into R and View


athena_cars <- mobstr::athena_connect("default") %>%
  dplyr::tbl("mtcars") %>%
  dplyr::filter(mpg >= 25) %>%
  dplyr::collect()

knitr::kable(athena_cars, digits = 2, format = "html",
             caption = "Athena Cars",
             col.names = c("MPG", "CYL", "DISP", "HP", "DRAT", "WT", "QSEC",
                           "VS", "AM", "GEAR", "CARB"),
             escape = FALSE,
             align = rep("r", 11)) %>%
  kableExtra::kable_styling(., bootstrap_options = "striped", position = "center",
                            full_width = FALSE) %>%
  kableExtra::column_spec(., 1:11, width = "1.5in")
Table 1 Athena Cars
MPG CYL DISP HP DRAT WT QSEC VS AM GEAR CARB
32.4 4 78.7 66 4.08 2.20 19.47 1 1 4 1
30.4 4 75.7 52 4.93 1.62 18.52 1 1 4 2
33.9 4 71.1 65 4.22 1.84 19.90 1 1 4 1
27.3 4 79.0 66 4.08 1.94 18.90 1 1 4 1
26.0 4 120.3 91 4.43 2.14 16.70 0 1 5 2
30.4 4 95.1 113 3.77 1.51 16.90 1 1 5 2


If you have suggestions or any other feedback please reach out with the contact links on the sidebar

Best,

Jason