Offloading R jobs to cloud infrastructure using analogsea

September 01, 2020

Being able to offload intensive data processing or analysis jobs to cloud infrastructure can come in handy when you are limited by the resources of your local machine or when you need a clean and reproducible execution environment. For most of my projects I rely on DigitalOcean infrastructure, and the analogsea package offers an easy way to interact with the DigitalOcean API from R. In this post I will show how to create and take down servers on demand, how to transfer data, and how to remotely execute your R code.

All code in this post is available at https://github.com/pieterprovoost/notebook-analogsea.

Getting started

To be able to use the analogsea package, you will need a DigitalOcean account. DigitalOcean does not have a free plan, but clicking this link will get you $100 of credit which will get you a long way if you are only spinning up servers when you actually need them.

Once you have created your account, go to https://cloud.digitalocean.com/settings/api/tokens/new to generate a new personal access token. This token needs to be stored in the DO_PAT environment variable. Environment variables can be set using a .Renviron file, or using Sys.setenv(). For example:

Sys.setenv(DO_PAT = "MY SUPER SECRET PERSONAL ACCESS TOKEN")

You will also need to add your public SSH key to https://cloud.digitalocean.com/ssh_keys. If you don't have a key or are not sure where to find it, try this guide.

Installing analogsea

At the time of writing the CRAN version of analogsea is a bit outdated, so I'm using the latest development version:

devtools::install_github("sckott/ropensci")

Spinning up a server

To spin up a server (or "droplet" in DigitalOcean lingo) we can use droplet_create(). By default this will start up a 1 CPU, 1 GB RAM, 25 GB storage server running Ubuntu 18.04 in the San Francisco region, which will cost you just $0.007 per hour to run. To get an overview of all possible server specifications, use sizes(per_page = 1000). Once the server is running, we want to update our package lists and install R. However, droplet_create() tends to not return a server IP address right away, and another call to the DigitalOcean is required to get one. This can be done by passing the droplet ID to the droplet() function. While that will get you an IP address, the server may not be accepting SSH connections yet, so I'm using this helper function to first get an IP address and then wait for port 22 to be available:

wait_for_droplet <- function(d) {
  repeat {
    d <- droplet(d$id)
    if (length(d$networks$v4) > 0) {
      message("Network up...")
      break  
    }
    Sys.sleep(1)  
  }
  repeat {
    con <- try(socketConnection(d$networks$v4[[1]]$ip_address, 22, blocking = TRUE, timeout = 1), silent = TRUE)
    if (!inherits(con, "try-error")) {
      message("Port 22 listening...")
      close(con)
      return(d)
    }
    Sys.sleep(1)  
  }
}

Now we can spin up a server and install R in one go:

d <- droplet_create() %>%
  wait_for_droplet() %>%
  droplet_ssh("apt-get update") %>%
  debian_install_r()

This will give you a droplet ready to accept R jobs:

> d
<droplet>FormativeScenery (207731869)
  IP:        138.197.200.42
  Status:    active
  Region:    San Francisco 2
  Image:     18.04 (LTS) x64
  Size:      s-1vcpu-1gb
  Volumes:   

Remotely executing R code

Just as an example, I will upload a CSV version of the Boston housing prices dataset to the server and fit a regression tree to predict the median housing price (read more about regression trees here). I'm also using the tictoc package to the log the execution time of my analysis.

results <- d %>%
  droplet_upload("boston.csv", "boston.csv") %>%
  droplet_execute({
    install.packages("rpart")
    install.packages("tictoc")
    library(rpart)
    library(tictoc)
    tic()
    fit <- rpart(medv ~ ., data = read.csv("boston.csv"), method = "anova", model = TRUE)
    timing <- toc()
  })

droplet_execute() will run our R code and return the resulting R environment as a list.

> results$timing
$tic
elapsed 
 10.848 

$toc
elapsed 
 10.883 

$msg
logical(0)

> results$fit
n= 506 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 506 42716.3000 22.53281  
   2) rm< 6.941 430 17317.3200 19.93372  
     4) lstat>=14.4 175  3373.2510 14.95600  
       8) crim>=6.99237 74  1085.9050 11.97838 *
       9) crim< 6.99237 101  1150.5370 17.13762 *
     5) lstat< 14.4 255  6632.2170 23.34980  
      10) dis>=1.5511 248  3658.3930 22.93629  
        20) rm< 6.543 193  1589.8140 21.65648 *
        21) rm>=6.543 55   643.1691 27.42727 *
      11) dis< 1.5511 7  1429.0200 38.00000 *
   3) rm>=6.941 76  6059.4190 37.23816  
     6) rm< 7.437 46  1899.6120 32.11304  
      12) lstat>=9.65 7   432.9971 23.05714 *
      13) lstat< 9.65 39   789.5123 33.73846 *
     7) rm>=7.437 30  1098.8500 45.09667 *
library(rpart.plot)
prp(results$fit)

rpart

Once we get the results back from the server, we can take it down using droplet_delete():

droplet_delete(d)