Remote terminal environment using VS Code for Windows and Mac

On Windows machines, the application MobaXterm is a valuable tool for computing on virtual machines and working through SSH clients. David Gold’s blog post walks through the installation and use of this app, which works well in Windows environments.

Working remotely on my Mac laptop, I have been struggling to achieve the same workflow as in the office, with a Windows machine. Unfortunately, MobaXterm is not available for download on Mac OS. Looking for alternatives, I discovered that using VS Code with the “Remote – SSH” extension is a great replacement with significant advantages to MobaXterm, as it an SSH client interface and code editor in one.

A screenshot from my VS Code remote interface, with the graphical file browser on the left panel, the SSH server terminal on the bottom-right, and the VS Code editor on the top-right.

Here’s how you can set up a remote session on Mac (and Windows) using VS Code: 

  1. Install the VS Code application here. For installation help and a brief overview of the app, check out this video.
  2. With VS Code opened, go to View -> Extensions, and search “Remote – SSH.” Click on the extension and press the green “Install” button. You should see the message “This extension is enabled globally” appear. Check out this extension’s description below (I’ll run through the basics in this post).
  3. On the bottom left of your screen, there should be a small green box with two opposite pointing arrow heads. Click this.
The green box is the Remote – SSH extension.
  1. Choose the first pop-up option “Remote-SSH: Connect to host…” and then select “Add New SSH Host…”.
Click the first box and then the “Add New SSH Host” button to connect to your SSH client.
  1. Here, enter your remote SSH username@serverid (here at Cornell, this would be yournetid@thecube.cac.cornell.edu to connect to our remote computing cluster, the Cube).
  2. In the same pop-up window, click the remote server that you just added. A new window will open and prompt you to enter your password for the server.
  3. Now, you in are in your remote SSH environment. Click “Open folder…” and select “OK” to see your remote directory on the left. You can navigate through these files in your remote machine the same way as MobaXterm. Click View -> Terminal to see your SSH command line on the bottom of the screen (here’s where you can actually run the programs on your cluster).

Now using VS Code, you can install other extensions to aid in code editing in different languages (here’s an article with a few good ones for various uses). This environment has the same functionality as MobaXterm, without having to switch applications for editing code. Run your cluster programs in the terminal window and edit the code in the main VS Code editor!

Publication-Quality Illustrations in R

There are many ways to create high-quality figures for scientific publications, such as user-friendly and code-free tools like Adobe Illustrator. There are also some informative blog post in the Water programming blog that have gone over the basics of Illustrator and different tricks to make decent figures in that program (fore example, here and here). One issue about Adobe Illustrator is that it’s not free. However, there are some open-source design alternatives to Illustrator such as Inscape. One broader problem with using these types of programs for figure design, though, is the fact that figures can be challenging to produce and time-consuming to reproduce using these code-free methods.

The ggbur package in R provides a script-based way to create and combine high-quality figures. The pros of this method are that it is free, the figures are easily reproducible, and combining figure would not be very time-consuming if you are comfortable coding in R.

A few drawbacks do exist, however. For one, there are several ways to make figures in R, but, when using ggpubr, all of the figures need to have a ggplot format, which limits the application of other figure formats in R. The other disadvantage is that ggpubr might not be as flexible as Illustrator in customizing the details of figures. In this blogpost, I explain some of the main feature of ggpubr and will show some examples of how figures can be created using this method. Since ggpubr is a part of ggplot, if you are not familiar with ggplot, it might be helpful to take a look at some ggplot basics first (for example, here, here, and here).

In this blog post, I used R’s “airquality” dataset to showcase some of the applications of the ggpubr. You can use the following code to activate the airqulaity dataset.

library(MASS)
data(airquality)

# You can use the head() function to look at the first 6 columns of our dataset
head(airquality)

I used this dataset to create seven different ggplot2 figures and then I used ggpubr to merge them into a combined figure.

library(ggplot2)
# uncomment the next line if you haven't installed ggpubr yet 
# install.packages("ggpubr") 
library(ggpubr) 

p1<-ggplot(airquality, aes(x=as.factor(Month), y=Ozone, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() + theme(legend.title = element_blank()) +
  labs(x="Month", y="Ozone concentration (PPM)", title = "a. Monthly Ozone concentration in New York")

p2<-ggplot(airquality, aes(x=as.factor(Month), y=Solar.R, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() +theme(legend.title = element_blank()) +
  labs(x="Month", y="Solar Radiation (W/m2)", title = "b. Monthly Solar Radiation in New York")

p3<-ggplot(airquality, aes(x=as.factor(Month), y=Temp, fill=as.factor(Month))) + geom_boxplot() + theme_bw() + 
  scale_fill_hue() +theme(legend.title = element_blank()) +
  labs(x="Month", y="Temperature (Fahrenheit)", title = "c. Monthly Tmperature in New York")


p4<-ggplot(airquality, aes(x=Solar.R, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() +
scale_color_hue() +theme(legend.title = element_blank()) +
  labs(x="Solar Radiation (W/m2)", y="Ozone concentration (PPM)", title = "d. Ozone concentration vs. solar radiation")

p5<-ggplot(airquality, aes(x=Wind, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Wind Speed (Miles per Hour)", y="Ozone concentration (PPM)", title = "e. Ozone concentration vs. wind speed")

p6<-ggplot(airquality, aes(x=Temp, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Temperature (Fahrenheit)", y="Ozone concentration (PPM)", title = "f. Ozone concentration vs. temperature") 



p7<-ggplot(airquality, aes(x=Ozone)) + geom_density(fill="gold") + theme_bw() + theme(legend.title = element_blank()) +
  labs(x="Ozone concentration (PPM)", y="Density", title = "g. PDF of Ozone concentration in New York") +
  annotate("this is an example")

Here is an example of these plots (p1):

The main function of ggpubr (in my opinion, at least) is ggarrange. This function provides a nice and fast way to merge different graphs and tables. Therefore, if you use this function, you don’t need to worry about whether different panels of your plots are properly aligned. If it’s applicable to your plot, you can also include a common legend.

# ggarrange is used to combine p1 to p7
combined_plot<-ggarrange(
ggarrange(p1, p2, p3, p4, p5, p6, ncol=2, nrow = 3, common.legend =T # TRUE: remove individual legends and add a common legend), 
          p7, ncol=1, nrow =2,
          heights = c(2, 0.7))

# ggsave can be used to save ggplot and ggpubr figures.
ggsave("YOUR FAVORITE DIRECTORY/CombinedPlot.png", combined_plot, width = 12, height = 14) 

Keep in mind that, ggpubr provides some extra functions for generating plots that are slightly different from ggplot2 functions. For example, instead of geom_density, you can use ggdensity.

Also, although ggplot2 provides several different plot themes, ggpubr allows users to use its own publication style theme. The following can be used to apply that theme to the graph. Note that ggpubr’s original functions have this theme as their default plot theme.

Sometimes, we need to draw the attention of our readers to a specific part of the graph or add some extra information to the figure. You can use the Annotate function to do that.

p6<-ggplot(airquality, aes(x=Temp, y=Ozone, color=as.factor(Month))) + geom_point(size=2) + theme_bw() + 
  scale_fill_hue()  +theme(legend.title = element_blank()) +
  labs(x="Temperature (Fahrenheit)", y="Ozone concentration (PPM)", title = "f. Ozone concentration vs. temperature") 

p6<-p6+annotate( "rect", xmin = 80, xmax = 95, ymin = 50, ymax = 100, alpha = .3, color="black", fill="grey")+
  annotate("text", x=67, y=80, label="This is how you can add extra \n information to the plot", fontface=2)

Determining the appropriate number of samples for a sensitivity analysis

Sensitivity analysis aims at assessing the relative contributions of the different sources of uncertainty to the variability in the output of a model. There are several mathematical techniques available in the literature, with variance-based approaches being the most popular, and variance-based indices the most widely used, especially “total effects” indices. Literature also provides several estimators/approximators for these indices (reviews can be found here [1]), which typically need N = n × (M + 2) model evaluations (and resulting outputs), where M is the number of uncertain inputs and n is some factor of proportionality that is selected ad hoc, depending on the complexity of the model (e.g. linear or not, monotonic or not, continuous or not). [Note: Literature typically refers to n as the number of samples and to N as the number of model evaluations, and this is how I’ll be using them also.]

The generation of this set of model runs of size N is by far the most computationally demanding step in the calculation of variance-based indices, and a major limitation to their applicability, as it’s typically in the order of magnitude of thousands [2] (n typically >1000). For example, consider a model with 5 uncertain inputs that takes 1 minute to run. To appropriately estimate sensitivity indices for such a model, we would potentially need about N=7000 model evaluations, which would take almost 5 days on a personal computer, excluding the time for the estimator calculation itself (which is typically very short).

The aim is therefore to pick the minimum n needed to ensure our index calculation is reliable. Unfortunately, there is no hard and fast rule on how to do that, but the aim of this post is to illuminate that question a little bit and provide some guidance. I am going off the work presented here [3] and elsewhere, and the aim is to perform the estimation of sensitivity indices repeatedly, using an increasing number of n until the index values converge.

I will be doing this using a fishery model, which is a nonlinear and nonmonotonic model with 9 parameters. Based on previous results suggesting that 3 of these parameters are largely inconsequential to the variability in the output, I’ll be fixing them to their nominal values. I’ll be using SALib to perform the analysis. My full script can be found here, but I’ll only be posting the most important snippets of code in this post.

First, I set up my SALib ‘problem’ and create arrays to store my results:

# Set up dictionary with system parameters
problem = {
  'num_vars': 6,
  'names': ['a', 'b', 'c','h',
            'K','m'],
  'bounds': [[ 0.002, 2],
             [0.005, 1],
             [0.2, 1],
             [0.001, 1],
             [100, 5000],
             [0.1, 1.5]]
}

# Array with n's to use
nsamples = np.arange(50, 4050, 50)

# Arrays to store the index estimates
S1_estimates = np.zeros([problem['num_vars'],len(nsamples)])
ST_estimates = np.zeros([problem['num_vars'],len(nsamples)])

I then loop through all possible n values and perform the sensitivity analysis:

# Loop through all n values, create sample, evaluate model and estimate S1 & ST
for i in range(len(nsamples)):
    print('n= '+ str(nsamples[i]))
    # Generate samples
    sampleset = saltelli.sample(problem, nsamples[i],calc_second_order=False)
    # Run model for all samples
    output = [fish_game(*sampleset[j,:]) for j in range(len(sampleset))]
    # Perform analysis
    results = sobol.analyze(problem, np.asarray(output), calc_second_order=False,print_to_console=False)
    # Store estimates
    ST_estimates[:,i]=results['ST']
    S1_estimates[:,i]=results['S1']

I can then plot the evolution of these estimates as n increases:

Evolution of first order and total order indices with increasing number of samples (n)

What these figures tell us is that choosing an n below 1000 for this model would potentially misestimate our indices, especially the first order ones (S1). As n increases, we see the estimates becoming less noisy and converging to their values. For more complex models, say, with more interactive effects, the minimum n before convergence could actually be a lot higher. A similar experiment by Nossent et al. [3] found that convergence was reached only after n=12,000.

An observation here is that the values of the total indices (ST) are higher than those of the the first order indices (S1), which makes sense, as ST includes both first order effects (captured by S1) and second order effects (i.e. interactions between the factors). Another observation here is that the parameters with the most significant effects (m and K) converge much faster than parameters with less impact on the output (a and b). This was also observed by Nossent et al. [3].

Finally, sensitivity analysis is often performed for the purposes of factor prioritization, i.e. determining (often rank-ordering) the most important parameters for the purposes of, for example, deciding where to devote most calibration efforts in the model or most further analysis to reduce the uncertainty in a parameter. The figures below show the evolution of that rank-ordering as we increase n.

Evolution of parameter ranking based on first order and total order indices with increasing number of samples (n)

These figures show that with a number of samples that is too small, we could potentially misclassify a factor as important or unimportant when it actually is not.

Now, one might ask: how is this useful? I’m trying to minimize my n, but you’ve just made me run way too many model evaluations, multiple times, just to determine how I could have done it faster? Isn’t that backwards?

Well, yes and no. I’ve devoted this time to run this bigger experiment to get insight on the behavior of my model. I have established confidence in my index values and factor prioritization. Further, I now know that n>1500 would probably be unnecessary for this system and even if the model itself or my parameter ranges change. As long as the parameter interactions, and model complexity remain relatively the same, I can leverage this information to perform future sensitivity analyses, with a known minimum n needed.

References:
[1] A. Saltelli, P. Annoni, I. Azzini, F. Campolongo, M. Ratto, and S. Tarantola, “Variance based sensitivity analysis of model output. Design and estimator for the total sensitivity index,” Computer Physics Communications, vol. 181, no. 2, pp. 259–270, Feb. 2010, doi: 10.1016/j.cpc.2009.09.018.
[2] F. Pianosi and T. Wagener, “A simple and efficient method for global sensitivity analysis based on cumulative distribution functions,” Environmental Modelling & Software, vol. 67, pp. 1–11, May 2015, doi: 10.1016/j.envsoft.2015.01.004.
[3] J. Nossent, P. Elsen, and W. Bauwens, “Sobol’ sensitivity analysis of a complex environmental model,” Environmental Modelling & Software, vol. 26, no. 12, pp. 1515–1525, Dec. 2011, doi: 10.1016/j.envsoft.2011.08.010.

Parallel processing with R on Windows

Parallel programming can save you a lot of time when you are processing large amounts of data. Modern computers provide multiple processors and cores and hyper-threading ability; therefore, R has become compatible with it and enables multiple simultaneous computations on all resources. There are some discussions regarding when to parallelize, because there is no linear relationship between the number of processors and cores used simultaneously and the computational timing efficiency. In this blog post, I am going to utilize two packages in R, which allows parallelization, for a basic example of when each instance of computation is standalone and when there is no need for communication between cores that are being used in parallel.

Install.packages(“parallel”)
Install.packages(“doParallel”)
library(doParallel)
library(parallel)

If you enter Ctrl+Shift+Esc on your keyboard and click on the Performance tab in the Task Manager window, you will see how many actual logical processes, which are the combination of processors and cores, are available on your local Windows machine and can be used simultaneously for your analysis. We can also detect this number with the following command:

no_cores <- detectCores(logical = TRUE)  # returns the number of available hardware threads, and if it is FALSE, returns the number of physical cores

Now, we need to allocate this number of available cores to the R and provide a number of clusters and then register those clusters. If you specify all the cores to the R, you may have trouble doing anything else on your machine, so it is better not to use all the resources in R.

cl <- makeCluster(no_cores-1)  
registerDoParallel(cl)  

First, we are going to create a list of files that we want to analyze. You can download the example dataset here.

all_samples<- as.data.frame(list.files("your directory/R_Parallel_example/"))
seq_id_all<- seq_along(1:nrow(all_samples))

Then, we will create a function that we are will use for processing our data. Each file in this set has a daily value for several variables and 31 years. Columns 1, 2, and 3 are year, month, and day, respectively. I am going to extract the yield value for each year from column “OUT_CROP_BIOMYELD” and calculate the average yield for the entire period. All the libraries that you are going to use for your data process should be called inside the function. I am going to use “data.table” library to efficiently read my data into R. At the end of the function, I put the two outputs that I am interested in (“annual_yield” and “average_yield”) into one list to return from the function.

myfunction<- function(...) {
  i<-(...)
  library(data.table)
  setwd(paste("your directory/R_Parallel_example/"))
  sample<- fread(paste(all_samples[i,]))
      annual_yield<- subset(sample,sample$OUT_CROP_BIOMYELD>0)  # this column (OUT_CROP_BIOMYELD) is always zero except when the yield is reported which should be a value above zero.
      annual_yield$No<-  as.numeric(gsub("_47.65625_-117.96875","",all_samples[i,]))  # extract some part of the file name, use it as an identification for this dataset
      annual_yield<- annual_yield[,c(1,13,17)]  # extract just “Year”,”Yield” and “No” columns.
      colnames(annual_yield)<- c("Year","Yeild","No")
  average_yield<- colMeans(annual_yield[,c("Yeild","No")])  #calculate average year for each dataset
  return( list(annual_yield,average_yield))
}

Now, we need to export our function on the cluster. Because in the function we used “all_samples” data-frame, which was created outside the function, this should also be exported to the cluster:

clusterExport(cl,list('myfunction','all_samples'))

With the command line below, we are running the function across the number of cores that we specified earlier, and with “system.time,” the process time will be printed at the end:

system.time(
  results<- c(parLapply(cl,seq_id_all,fun=myfunction))
)

The function outputs are saved in the list “results” that we can extract:

k_1<- list()
k_2<- list()
for (k in 1: nrow(all_samples)){
  k_1[[k]]<- results[[k]][[1]]
  k_2[[k]]<- results[[k]][[2]]
}
annual_yield<- data.table::rbindlist(k_1)
period_yield<- data.table::rbindlist(k_2)

Using Docker with Virtual Machines

This post is a continuation of my previous post on accessing virtual machines on RedCloud. In this post, you will learn how to run a Docker container on a VM Ubuntu image, but you can also do this tutorial with Ubuntu on a local machine.

Why Containerize Code?

Let’s use an example: Say that you have three Python applications that you want to run on your computer, but they all use different version of Python or its packages. We cannot host these applications at the same time using Python on our computer…so what do we do? Think of another case- you want to share neural network code with a collaborator, but you don’t want to have to deal with the fact that it’s incredibly difficult for them to get TensorFlow and all its dependencies downloaded on their machine. Containerization allows us to isolate the installation and running of our application without having to worry about the setup of the host machine. In other words, everything someone would need to run your code will be provided in the container. Shown in Figure 1, each container shares the host OS’s kernel but has a virtual copy of some of the components of the OS. This allows the containers to be isolated from each other while running on the same host. Therefore, containers are exceptionally light and take only seconds to start [1].

Picture1

Fig 1. Docker Container Example [1]

Docker

Docker started as a project to build single-application LXC (linux) containers that were more portable and flexible, but is now its own container runtime environment. It is written in GO and developed by DotCloud (a PaaS company). A Docker engine is used to build and manage a Docker image, which is just a template that contains the application and all of its dependencies that are required for it to run. A Docker container is a running instance of a Docker image. A Docker engine is composed of three main parts: a server known as the Docker daemon, a rest API, and a client. The docker daemon creates and manages images and containers, the rest API helps to link the server and applications, and the client (user) interacts with the docker daemon through the command line (Figure 2).

Picture2

Fig.2 Docker Engine [2]

 

Running a Docker Container on Ubuntu

  1. We will be setting up and running a Docker container that contains code to train a rainfall-runoff LSTM model built in Python using Tensorflow. The Github repo with the LSTM code and data can be downloaded here.
  2. Spin up a VM instance (Ubuntu 18.04) or use your own machine if you have a partition.
  3. I use MobaXterm to SSH into the VM instance and drag the HEC_HMS_LSTM folder into my home directory.
  4. Within the directory, you will see 2 additional files that are not part of the main neural network code: requirements.txt and jupyter.dockerfile. A requirements.txt file contains a list of only the packages that are necessary to run your application. You can make this file with just two lines of code: pip install pipreqs and then pipreqs  /path/to/project. The created file will look like this:

requirements

requirements.txt file

The jupyter.dockerfile is our Dockerfile. It is a text file that contains all the commands a user could call on the command line to assemble an image.

dockerfile

Dockerfile

Here in our Dockerfile, we are using the latest Ubuntu image and setting our working directory to the HEC_HMS_LSTM folder that has the neural network code, Hourly_LSTM.py, that we would like to execute. We start by looking for updates and then install Python, pip, and jupyter. Then we need to take our working directory contents and copy them into the container. We copy the requirements.txt file into the container and then add the whole HEC_HMS_LSTM folder into the container. We then use pip to install all of the packages listed in requirements.txt. Finally, we instruct the docker daemon to run the python script, Hourly_LSTM.py.

5. Next we have to download docker onto Ubuntu using the command line. This is the only thing that anyone will have to download to run your container. The steps to download Docker for Ubuntu through the command line are pretty easy using this link. You may be asked to allow automatic restarts during the installation, and you should choose this option. When Docker is done downloading, you can check to see that the installation was successful by seeing if the service is active.

dockerdownload

Successful Docker Installation

6. Now that Docker is downloaded, we can build our Docker image and then run the container. We build the image using: 

sudo docker build -f jupyter.dockerfile -t jupyter:latest .

In this command, the -f flag denotes the name of the dockerflile and -t is the name that we would like to tag our image with. If you’re running multiple containers, tagging is helpful to distinguish between the containers. The build will go through each step of the dockerfile. Be cognizant of how much space you have on your disk to store the downloaded Docker, the python libraries and the data you will generate; you may have to go back and resize your VM instance. We can then run our image using:

sudo docker run jupyter:latest

You’ll see the neural network training and the results of the prediction.

NNFinal

Successfully training the neural network in a Docker Container on a virtual machine

 

Credits:

[1]https://www.freecodecamp.org/news/docker-simplified-96639a35ff36/

[2]https://docs.docker.com

 

 

 

 

Make your Git repository interactive with Binder

Have you ever tried to demo a piece of software you wrote only to have the majority of participants get stuck when trying to configure their computational environment? Difficulty replicating computational environments can prevent effective demonstration or distribution of even simple codes. Luckily, new tools are emerging that automate this process for us. This post will focus on Binder, a tool for creating custom computing environments that can be distributed and used by many remote users simultaneously. Binder is language agnostic tool, and can be used to create custom environments for R, Python and Julia. Binder is powered by BinderHub, an open source service in the cloud. At the bottom of this post, I’ll provide an example of an interactive Python Jupyter Notebook that I created using BinderHub.

BinderHub

BinderHub combines two useful libraries: repo2docker and JupyterHub. repo2docker is a tool to build, run and push Docker images from source code repositories. This allows you to create copies of custom environments that users can replicate on any machine. These copies are can be stored and distributed along with the remote repository. JuptyerHub is a scalable system that can be used to spawn multiple Jupyter Notebook servers. JuptyerHub takes the Docker image created by repo2docker and uses it to spawn a Jupyter Notebook server on the cloud. This server can be accessed and run by multiple users at once. By combining repo2docker and JupyterHub, BinderHub allows users to both replicate complex environments and easily distribute code to large numbers of users.

Creating your own BinderHub deployment

Creating your own BinderHub deployment is incredibly easy. To start, you need a remote repository containing two things: (1) a Jupyter notebook with supporting code and (2) configuration files for your environment. Configuration files can either be an environment.yml file (a standard configuration file that can be generated with conda, see example here) or a requirements.txt file (a simple text file that lists dependencies, see example here).

To create an interactive BinderHub deployment:

  1. Push your code to a remote repository (for example Github)
  2. Go to mybinder.org and paste the repository’s URL into the dialoge box (make sure to select the proper hosting service)
  3. Specify the branch if you are not on the Master
  4. Click “Launch”

The website will generate a URL that you can copy and share with users. I’ve created an example for our Rhodium tutorial, which you can find here:

https://mybinder.org/v2/gh/dgoldri25/Rhodium/master?filepath=DMDU_Rhodium_Demo.ipynb

To run the interactive Jupyter Notebook, click on the file titled “Rhodium_Demo.ipynb”. Happy sharing!