Apply Functions in R and Python

In this post, I will go over some simple tools which can be used to create more efficient and concise R and Python scripts. First, I will explain the apply function in R and Python. Then, I will briefly go over anonymous functions in Python.

The Apply Function in R

The apply function is used to manipulate data frames, matrices, and lists. It takes a data frame and a function as inputs, and applies that function to each row or column of the data frame. In essence, the apply function is an alternative to “for” loops. 

The apply function has three main inputs: a data object, a margin variable, and a function. As mentioned earlier, the data object can have different formats. The margin variable specifies if the function applies to rows (MARGIN=1) or columns (MARGIN=2). The function can either be an built-in R function (e.g., sum or max) or a function that the user defines. The function can be defined both inside and outside the apply function.  

Example Problem

Here I will define a simple problem as our test case. The task is to find the maximum of each column and divide all the elements of that column by the maximum. We will use the iris data set, because it is available in R and in Python’s seaborn package.

# Load the iris data set
data(iris)

# Assign first four columns of the iris data set to a data frame
iris_df<-as.data.frame(iris[,1:4])

# Use the apply function to do the calculations of the example problem
output_max=as.data.frame(apply(iris_df, MARGIN = 2, FUN = function (x) x/max(x)))

Sometimes there are other, easier ways to do these calculations. However, when what you want to do is more complicated, this method comes in handy. The apply function has some other variants such as lapply, sapply, and mapply. Refer to this post (here) for more information about these functions.

The Apply Function in Python

The pandas package for Python also has a function called apply, which is equivalent to its R counterpart; the following code illustrates how to use it. In pandas, axis=0 specifies columns and axis=1 specifies rows. Note that in this example I have defined a function outside of the apply, and imported it to calculate the maximum and the ratio-to-maximum. In the next section, I will present an alternative way of defining in-line functions in Python.

# The iris data set is available in the seaborn package in python
import seaborn as sns
import pandas

# The following script loads the iris data set into a data frame
iris = sns.load_dataset('iris')

# Define an external function to calculate the ratio-to-maximum 
def ratio_to_max (data):
    maximum=max(data)
    print(maximum)
    ratio=data/maximum
    return ratio

# Use the built-in apply function in Python to calculate the ratio-to-maximum for all columns
output_df=iris.iloc[:,0:4].apply(ratio_to_max, axis=0)


Anonymous Functions in Python

Python provides an easy alternative to external functions like the one used above. This method is called an anonymous or “lambda” function. A lambda is a tool to conduct a specific task on a data object, similar to a regular function; however, it can be defined within other functions and doesn’t need to be assigned a name. Therefore, in many cases, lambdas offer a cleaner and more efficient alternative to regular functions. A history of the lambda function can be found in this post (here), which also provides a comprehensive list of lambda’s functionalities. Here is an example of the lambda function used instead of the regular function defined before:

# The iris data set is available in the seaborn package in python
import seaborn as sns
import pandas

# The following script loads the iris data set into a data frame
iris = sns.load_dataset('iris')

# Here we use lambda to create an anonymous function and use that within panda's apply function 
output_df=iris.iloc[:,0:4].apply(lambda x:x/max(x), axis=0)

Note that, although R does not have a tool like lambda, it does provide a way of defining anonymous functions such as the one defined within the apply function. Also, there are other widely used Python built-in functions which work nicely with lambdas. For example, the map, filter, and reduce functions can take advantage of lambda’s simplicity in complex data mining tasks. You can refer to here and here for more information about these functions.

PyCharm and Git for productive multi-project workflows

I wanted to write this blogpost because I’ve seen great improvements to my workflow when I transitioned to this system and thought others might benefit also. My everyday research tasks require the following:

  • a Python development environment on my local machine
  • management of project-specific dependencies
  • version control my changes
  • execution on some high-performance computing resource.

My local machine runs on Mac OS, but everything I show here should be directly translatable to Windows or other operating systems. My setup is the following:

  • Anaconda – to manage my Python environments and packages
  • PyCharm – the Python development environment
  • Git(Hub) – for version control

These are the steps I follow every time I start a new project:

  1. Create an empty repository on GitHub
  2. Clone the empty repository on my local machine
  3. Open PyCharm and select the directory of the repository I just created

When it opens, the PyCharm project will be empty and will have a default Python interpreter associated with it. What I do is I create a separate Conda environment for each of my projects, so there’s a clean separation between the packages used by each.

4. Create python environment specific to this project, by going to Preferences and selecting your current project. There, you can define your project’s (Python) interpreter. Clicking on it just shows the default Python 2.7 interpreter, which we would like to change.

As you can see, I have a separate Conda environment for each of my projects, so I manage packages and dependencies for each one.

Here I create a new environment for my new project.

5. Manage packages needed. There’s two ways for this: either through PyCharm or through Anaconda. Through PyCharm, you can use the same page to install, uninstall or update packages as needed.

Through Anaconda, you can use the Navigator, which also allows you to customize several other things about your environment, like which applications you’d like to work with.

6. Set up version control and use code on other computing resources. PyCharm has Git features integrated (overviewed already in this blog here and here) and creating a project the way I showed also ensures that PyCharm knows which repository you’re working with, without you having to set it manually. I use the built-in PyCharm functionality to commit my changes to my repository, but you can also do it through the Terminal or other means.

7. Set up project on computing resources. To do so, you need two main components. A clone of your repository in the cluster you’re working on and an environment .yml file (I explain what this is and how to generate it with one command here), listing all your environment’s dependencies. Create a virtual environment for the project in the cluster and pull any updates from your local machine.

This is more or less all I do. I have virtual environments for each of my projects both locally and on the clusters I am working on and use PyCharm and Git to manage all the dependencies and versions. I have been using this setup for the past 5-6 months and I have seen a lot of improvements in my organization and productivity, so hopefully others will find it helpful also.

Introduction to Google Cloud Platform

This post is meant to serve as an introduction to Google Cloud Platform, which is a set of cloud services offered by Google. GCP is worth becoming familiar with as it is one of the three major cloud platforms that currently exist (aside from Microsoft Azure and Amazon Web services). In this post, I will go through a brief background of what services are offered by GCP and how one goes about accessing a virtual machine to get started with cloud computing. I wrote a post recently introducing Red Cloud, which is Cornell’s own cloud computing service. I found that my basic familiarity with Red Cloud allowed me to easily transition to using GCP.

GCP contains over 90 different services, but perhaps that most relevant to our group is the Compute Engine, which allows access to a variety of Windows and Linux virtual machines that can be customized for your computing needs. GCP also offers a Kubernetes Engine, which is a platform that can effectively automate many of the processes associated with running and scaling containerized code, including Docker containers. There are also engines for app development and deployment.

How do I get credits?

I’m currently using GCP for a class, in which our professor applied for computing credits through one of Google’s educational grants, but GCP does have a Google Cloud Free Program in which a user can acquire $300 worth of cloud credits over a 90-day trial period. All GCP products also have a free use provided you stay under the monthly usage limits.

Accessing a VM

The first step of accessing a VM is heading to the Google Cloud Console, which brings up a dashboard of your cloud activity.

Google Cloud Console Dashboard

First and foremost, you should establish a project, as this is necessary in order to access any of the resources. On the dashboard you will also see (3) a panel pertaining to your billing information and (4) the ability to access Google Cloud Shell, which is a lightweight terminal ( a micro VM instance) with 5 GB of persistent storage. You can do small development tasks right on the Google Cloud Shell, but likely will want to utilize a larger instance. In order to do this, you will want to select the  navigation menu (the three parallel bars in the upper left hand corner) and scroll down to “Compute Engine”. Here you can click on “VM instances” and then “create”.

Creating an Instance

A page is pulled up that allows you to customize you instance. Here we keep everything that is selected by default, but can change the machine type. The micro instance is the cheapest instance (valued at $6.25 a month) and the E2 type is best for general purpose computing. There are M instances and C instances for memory optimized and computing-optimized performance, respectively, though these instances are significantly more expensive. If it suits you, you can also choose the region that you are accessing your VM from. You can also specify certain traffic rules. While these rule specifications are mandatory in Red Cloud, they are optional in GCP. Finally we can click “Create”. The instance will go through a period of spinning up and then you can SSH into the VM.

Accessing your instance

It’s important to note that a VM will be completely void of programs, so you will need to install all the packages that you will use. This can include installing Git, build-essential, which installs GCC, Make, and other compiler packages, clang for the CLang compiler, libomp-dev for OpenMP+CLang, python, and google-perftools for profiling. If you want to configure Git, you can do the following:

git config –global user.email “you@example.com”

git config –global user.name “Your Name”

And that’s how you set up a VM in Google Cloud Platform. Happy computing!

Automate unit testing with Github Actions for research codes

This is the first waterprogramming post on Github Actions, a relatively new feature to the popular software hosting and source control platform. We assume a basic understanding of Git and source control in this post; there exist a lot of great introductions and tutorials to this area of software development, one of which can be found here.

What exactly is Github Actions? From the Actions documentation, this feature automates tasks within a software development cycle. DevOps, the continuous delivery and integration of software products, is the most common use case for Actions. From a research and scientific computing perspective, Actions is extremely valuable for automated testing with collaborative development. For example, if a remote team is developing a model or package that requires constant testing, defining a Github workflow to automatically test your code’s accuracy and portability can save time in testing, debugging, and coordination.

In this post, I will use an example Github project to illustrate the fundamentals of Actions and the value in automated testing for research codes. This project implements the Lake Model in Python; it can be found here. The structure is simple:

lake/ # implementation of the lake model
   - __init__.py
   - lakemodel.py

test.py  # unit tests

requirements.txt  # package requirements for Python

.github/ # GitHub specific 
   - workflows/
       - test.yml  # actions specific file for automated testing

First, we’ll take a look at test.py. This file uses the built in Python module unittest to implement Unit Tests for the lake model code. Unit testing, the practice of testing individual small units of software independently, is an important aspect of development and writing good code. test.py includes two TestCases, defined as Python classes that inherent from unittest.TestCase. These classes test the correctness of radial basis functions and the Lake model mass balance, the two major components to the Lake Problem DPS formulation, individually. These unit tests perform a calculation and compare it to what is expected. In code,

data = ...
expected = ...
result = (** do some calculation on data **)

assert result == expected

To run this test file, we use the pytest module, which automatically runs unit tests. Pytest looks for classes and methods that begin with the word test, so taking advantage of this package’s functionality requires class names like TestLake and method names like test_accuracy. Running pytest from the command line is simple and powerful.

python -m pytest test.py

This call will execute the unit tests starting with “test” and either pass or fail. An example output might look like this:

collected 2 items                                                                                                                                            

test.py ..                                                                                                                                             [100%]

================ 2 passed in 1.82s ================

At this point, we have:

  1. a model we are working on
  2. unit tests with input data and expected results
  3. an easy and concise way to execute these tests

If this project were independent, this might be sufficient – assuming the tests we were running are relatively quick. However, models and projects get exponentially more complex with the addition of collaborators. Code is changed on a regular basis; the test cases themselves may change as the scale increases; debugging becomes much more time intensive and involved. Github Actions presents a simple solution to the issues of scaling and collaboration.

The concept is simple: we will use Github’s remote compute resource to run our tests each time the code is changed, and we will grow the project based on these continual results.

Utilizing these free compute resources and Actions is extremely straightforward, thanks to the growing open source sample workflows and documentation. First, the main folder must contain a .github/ directory, which itself includes a workflows/ directory to tell Github we are defining an action. The workflow itself is defined in a YAML file, which we call tests.yml.

name: Lake model

on: [push]

jobs:
  build:

    runs-on: ubuntu-latest
    strategy:
      matrix:
        python-version: [3.8]

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v2
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install flake8 pytest
        if [ -f requirements.txt ]; then pip install -r requirements.txt; fi
    - name: Lint with flake8
      run: |
        # stop the build if there are Python syntax errors or undefined names
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
    - name: Test with pytest
      run: |
        python -m pytest test.py

There are a few important aspects to this file:

  1. on: [push] specifies that this action should occur every time code is pushed to the repository. Another common workflow runs these tests on pull requests, with on: [pull_request]
  2. runs on: ubuntu-latest specifies the operating system of the virtual machine you are running on. Including more/different operating systems can test the portability of your project.
  3. each run: specifies a command to execute. Here, we have two main jobs: install dependencies (downloading the required Python packages specified by in requirements.txt) and Test with pytest (execute the actual unit tests with pytest)

Including this workflow in your Python project is straightforward and powerful. Github provides a visual breakdown of each test/action.

To demonstrate the value of this continual testing workflow, I’ll use Actions to support code optimization in Python. Numba is a light-weight compiler that speeds up NumPy code drastically. Once the accuracy and correctness of the Python lake model is established, I want to decrease runtime for larger experiments, while maintaining correctness. Incorporating this new optimization effort into our workflow is simple.

  • Write the Numba optimized lake model (see lakemodel_fast.py)
  • Write a unittest file that compares the optimized model to original and ensures they are the same
  • Add another workflow to .github/workflows to execute this test

This method of testing is robust. If a teammate makes a change to the original model to reflect an overlooked variable or external factor, the optimized version will no longer be correct, and the comparison test will automatically fail. The entire team is then aware of this incompatibility – i.e. there is little room for communication errors as tests are continually executed and reported.

Actions provides a powerful, flexible framework for automating computation in your project. From a research perspective, this feature is valuable in ensuring diligent unit testing and coordination.