Automate remote tasks with Paramiko

This is a short blogpost to demonstrate a the Paramiko Python package. Paramiko allows you to establish SSH, SCP or SFTP connections within Python scripts, which is handy when you’d like to automate some repetitive tasks with on remote server or cluster from your local machine or another cluster you’re running from.

It is often used for server management tasks, but for research applications you could consider situations where we have a large dataset stored at a remote location and are executing a script that needs to transfer some of that data depending on results or new information. Instead of manually establishing SSH or SFTP connections, those processes could be wrapped and automated within your existing Python script.

To begin a connection, all you need is a couple lines:

import paramiko

ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname='remotehose',username='yourusername',password='yourpassword')

The first line creates a paramiko SSH client object. The second line tells paramiko what to do if the host is not a known host (i.e., whether this host should be trusted or not)—think of when you’re setting up an SSH connection for the first time and get the message:

The authenticity of host ‘name’ can’t be established. RSA key fingerprint is ‘gibberish’. Are you sure you want to continue connecting (yes/no)?

The third line is what makes the connection, the hostname, username and password are usually the only necessary things to define.

Once a connection is established, commands can be executed with exec_command(), which creates three objects:

stdin, stdout, stderr = ssh_client.exec_command("ls")

stdin is write-only file which can be used for commands requiring input, stdout contains the output of the command, and stderr contains any errors produced by the command—if there are no errors it will be empty.

To print out what’s returned by the command, use can use stdout.readlines(). To add inputs to stdin, you can do so by using the write() function:

stdin, stdout, stderr = ssh.exec_command(“sudo ls”)
stdin.write(‘password\n’)

Importantly: don’t forget to close your connection, especially if this is an automated script that opens many of them: ssh_client.close().

To transfer files, you need to establish an SFTP or an SCP connection, in a pretty much similar manner:

ftp_client=ssh_client.open_sftp()
ftp_client.get('/remote/path/to/file/filename','/local/path/to/file/filename')
ftp_client.close()

get() will transfer a file to a local directory, put(), used in the same way, will transfer a file to a remote directory.

PyCharm and Git for productive multi-project workflows

I wanted to write this blogpost because I’ve seen great improvements to my workflow when I transitioned to this system and thought others might benefit also. My everyday research tasks require the following:

  • a Python development environment on my local machine
  • management of project-specific dependencies
  • version control my changes
  • execution on some high-performance computing resource.

My local machine runs on Mac OS, but everything I show here should be directly translatable to Windows or other operating systems. My setup is the following:

  • Anaconda – to manage my Python environments and packages
  • PyCharm – the Python development environment
  • Git(Hub) – for version control

These are the steps I follow every time I start a new project:

  1. Create an empty repository on GitHub
  2. Clone the empty repository on my local machine
  3. Open PyCharm and select the directory of the repository I just created

When it opens, the PyCharm project will be empty and will have a default Python interpreter associated with it. What I do is I create a separate Conda environment for each of my projects, so there’s a clean separation between the packages used by each.

4. Create python environment specific to this project, by going to Preferences and selecting your current project. There, you can define your project’s (Python) interpreter. Clicking on it just shows the default Python 2.7 interpreter, which we would like to change.

As you can see, I have a separate Conda environment for each of my projects, so I manage packages and dependencies for each one.

Here I create a new environment for my new project.

5. Manage packages needed. There’s two ways for this: either through PyCharm or through Anaconda. Through PyCharm, you can use the same page to install, uninstall or update packages as needed.

Through Anaconda, you can use the Navigator, which also allows you to customize several other things about your environment, like which applications you’d like to work with.

6. Set up version control and use code on other computing resources. PyCharm has Git features integrated (overviewed already in this blog here and here) and creating a project the way I showed also ensures that PyCharm knows which repository you’re working with, without you having to set it manually. I use the built-in PyCharm functionality to commit my changes to my repository, but you can also do it through the Terminal or other means.

7. Set up project on computing resources. To do so, you need two main components. A clone of your repository in the cluster you’re working on and an environment .yml file (I explain what this is and how to generate it with one command here), listing all your environment’s dependencies. Create a virtual environment for the project in the cluster and pull any updates from your local machine.

This is more or less all I do. I have virtual environments for each of my projects both locally and on the clusters I am working on and use PyCharm and Git to manage all the dependencies and versions. I have been using this setup for the past 5-6 months and I have seen a lot of improvements in my organization and productivity, so hopefully others will find it helpful also.