Being able to offload intensive data processing or analysis jobs to cloud infrastructure can come in handy when you are limited by the resources of your local machine or when you need a clean and reproducible execution environment. For most of my projects I rely on DigitalOcean infrastructure, and the analogsea package offers an easy way to interact with the DigitalOcean API from R. In this post I will show how to create and take down servers on demand, how to transfer data, and how to remotely execute your R code.
All code in this post is available at https://github.com/pieterprovoost/notebook-analogsea.
Getting started
To be able to use the analogsea package, you will need a DigitalOcean account. DigitalOcean does not have a free plan, but clicking this link will get you $100 of credit which will get you a long way if you are only spinning up servers when you actually need them.
Once you have created your account, go to https://cloud.digitalocean.com/settings/api/tokens/new to generate a new personal access token. This token needs to be stored in the DO_PAT
environment variable. Environment variables can be set using a .Renviron file, or using Sys.setenv()
. For example:
Sys.setenv(DO_PAT = "MY SUPER SECRET PERSONAL ACCESS TOKEN")
You will also need to add your public SSH key to https://cloud.digitalocean.com/ssh_keys. If you don't have a key or are not sure where to find it, try this guide.
Installing analogsea
At the time of writing the CRAN version of analogsea is a bit outdated, so I'm using the latest development version:
devtools::install_github("sckott/ropensci")
Spinning up a server
To spin up a server (or "droplet" in DigitalOcean lingo) we can use droplet_create()
. By default this will start up a 1 CPU, 1 GB RAM, 25 GB storage server running Ubuntu 18.04 in the San Francisco region, which will cost you just $0.007 per hour to run. To get an overview of all possible server specifications, use sizes(per_page = 1000)
. Once the server is running, we want to update our package lists and install R. However, droplet_create()
tends to not return a server IP address right away, and another call to the DigitalOcean is required to get one. This can be done by passing the droplet ID to the droplet()
function. While that will get you an IP address, the server may not be accepting SSH connections yet, so I'm using this helper function to first get an IP address and then wait for port 22 to be available:
wait_for_droplet <- function(d) {
repeat {
d <- droplet(d$id)
if (length(d$networks$v4) > 0) {
message("Network up...")
break
}
Sys.sleep(1)
}
repeat {
con <- try(socketConnection(d$networks$v4[[1]]$ip_address, 22, blocking = TRUE, timeout = 1), silent = TRUE)
if (!inherits(con, "try-error")) {
message("Port 22 listening...")
close(con)
return(d)
}
Sys.sleep(1)
}
}
Now we can spin up a server and install R in one go:
d <- droplet_create() %>%
wait_for_droplet() %>%
droplet_ssh("apt-get update") %>%
debian_install_r()
This will give you a droplet ready to accept R jobs:
> d
<droplet>FormativeScenery (207731869)
IP: 138.197.200.42
Status: active
Region: San Francisco 2
Image: 18.04 (LTS) x64
Size: s-1vcpu-1gb
Volumes:
Remotely executing R code
Just as an example, I will upload a CSV version of the Boston housing prices dataset to the server and fit a regression tree to predict the median housing price (read more about regression trees here). I'm also using the tictoc
package to the log the execution time of my analysis.
results <- d %>%
droplet_upload("boston.csv", "boston.csv") %>%
droplet_execute({
install.packages("rpart")
install.packages("tictoc")
library(rpart)
library(tictoc)
tic()
fit <- rpart(medv ~ ., data = read.csv("boston.csv"), method = "anova", model = TRUE)
timing <- toc()
})
droplet_execute()
will run our R code and return the resulting R environment as a list.
> results$timing
$tic
elapsed
10.848
$toc
elapsed
10.883
$msg
logical(0)
> results$fit
n= 506
node), split, n, deviance, yval
* denotes terminal node
1) root 506 42716.3000 22.53281
2) rm< 6.941 430 17317.3200 19.93372
4) lstat>=14.4 175 3373.2510 14.95600
8) crim>=6.99237 74 1085.9050 11.97838 *
9) crim< 6.99237 101 1150.5370 17.13762 *
5) lstat< 14.4 255 6632.2170 23.34980
10) dis>=1.5511 248 3658.3930 22.93629
20) rm< 6.543 193 1589.8140 21.65648 *
21) rm>=6.543 55 643.1691 27.42727 *
11) dis< 1.5511 7 1429.0200 38.00000 *
3) rm>=6.941 76 6059.4190 37.23816
6) rm< 7.437 46 1899.6120 32.11304
12) lstat>=9.65 7 432.9971 23.05714 *
13) lstat< 9.65 39 789.5123 33.73846 *
7) rm>=7.437 30 1098.8500 45.09667 *
library(rpart.plot)
prp(results$fit)
Once we get the results back from the server, we can take it down using droplet_delete()
:
droplet_delete(d)