Automate remote tasks with Paramiko

This is a short blogpost to demonstrate a the Paramiko Python package. Paramiko allows you to establish SSH, SCP or SFTP connections within Python scripts, which is handy when you’d like to automate some repetitive tasks with on remote server or cluster from your local machine or another cluster you’re running from.

It is often used for server management tasks, but for research applications you could consider situations where we have a large dataset stored at a remote location and are executing a script that needs to transfer some of that data depending on results or new information. Instead of manually establishing SSH or SFTP connections, those processes could be wrapped and automated within your existing Python script.

To begin a connection, all you need is a couple lines:

import paramiko

ssh_client = paramiko.SSHClient()
ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy())
ssh_client.connect(hostname='remotehose',username='yourusername',password='yourpassword')

The first line creates a paramiko SSH client object. The second line tells paramiko what to do if the host is not a known host (i.e., whether this host should be trusted or not)—think of when you’re setting up an SSH connection for the first time and get the message:

The authenticity of host ‘name’ can’t be established. RSA key fingerprint is ‘gibberish’. Are you sure you want to continue connecting (yes/no)?

The third line is what makes the connection, the hostname, username and password are usually the only necessary things to define.

Once a connection is established, commands can be executed with exec_command(), which creates three objects:

stdin, stdout, stderr = ssh_client.exec_command("ls")

stdin is write-only file which can be used for commands requiring input, stdout contains the output of the command, and stderr contains any errors produced by the command—if there are no errors it will be empty.

To print out what’s returned by the command, use can use stdout.readlines(). To add inputs to stdin, you can do so by using the write() function:

stdin, stdout, stderr = ssh.exec_command(“sudo ls”)
stdin.write(‘password\n’)

Importantly: don’t forget to close your connection, especially if this is an automated script that opens many of them: ssh_client.close().

To transfer files, you need to establish an SFTP or an SCP connection, in a pretty much similar manner:

ftp_client=ssh_client.open_sftp()
ftp_client.get('/remote/path/to/file/filename','/local/path/to/file/filename')
ftp_client.close()

get() will transfer a file to a local directory, put(), used in the same way, will transfer a file to a remote directory.

Getting started with API requests in Python

This is an introductory blogpost on how to work with Application Programming Interfaces (APIs) in Python. Seeing as this is our blog’s first post on this topic I’ll spend some time explaining some basic information I’ve had to learn in the process of doing this and why it might be useful in your research. There are several blogs and websites with tutorials online and there’s no point repeating, but I’ll try to explain this for an audience like me, i.e., (a) has no formal training in computer science/web servers/HTTP but competent with scripting, and (b) interested in using this for research purposes.

What is an API?

Many sites (e.g., Facebook, Twitter, many many more) make their data available through what’s called Application Programming Interfaces or APIs. APIs basically allow software to interact, and send and receive data from servers so as to typically provide additional services to businesses, mobile apps and the like, or allow for additional analysis of the data for research or commercial purposes (e.g., collecting all trending Twitter hashtags per location to analyze how news and information propagates).

I am a civil/environmental engineer, what can APIs do for me?

APIs are particularly useful for data that changes often or that involves repeated computation (e.g., daily precipitation measurements used to forecast lake levels). There’s a lot of this kind of data relevant for water resources systems analysts, easily accessible through APIs:

I’m interested, how do I do it?

I’ll demonstrate some basic information scripts using Python and the Requests library in the section below. There are many different API requests one can perform, but the most common one is GET, which is a request to retrieve data (not to modify it in any way). To retrieve data from an API you need two things: a base URL and an endpoint. The base URL is basically the static address to the API you’re interested in and an endpoint is appended to the end of it as the server route used to collect a specific set of data from the API. Collecting different kinds of data from an API is basically a process of manipulating that endpoint to retrieve exactly what is needed for a specific computation. I’ll demo this using the USGS’s API, but be aware that it varies for each one, so you’d need to figure out how to construct your URLs for the specific API you’re trying to access. They most often come with a documentation page and example URL generators that help you figure out how to construct them.

The base URL for the USGS API is http://waterservices.usgs.gov/nwis/iv/

I am, for instance, interested in collecting streamflow data from a gage near my house for last May. In Python I would set up the specific URL for these data like so:

response = requests.get("http://waterservices.usgs.gov/nwis/iv/?format=json&indent=on&sites=04234000&startDT=2020-05-01&endDT=2020-05-31¶meterCd=00060&siteType=ST&siteStatus=all")

For some interpretation of how this was constructed, every parameter option is separated by &, and they can be read like so: data in json format; indented so I can read them more easily; site number is the USGS gage number; start and end dates; a USGS parameter code to indicate streamflow; site type ‘stream’; any status (active or inactive). This is obviously the format that works for this API, for a different API you’d have to figure out how to structure these arguments for the data you need. If you paste the URL in your browser you’ll see the same data I’m using Python to retrieve.

Object response now contains several attributes, the most useful of which are response.content and response.status_code. content contains the content of the URL, i.e. your data and other stuff, as a bytes object. status_code contains the status of your request, with regards to whether the server couldn’t find what you asked for or you weren’t authenticated to access that data, etc. You can find what the codes mean here.

To access the data contained in your request (most usually in json format) you can use the json function contained in the library to return a dictionary object:

data = response.json()

The Python json library is also very useful here to help you manipulate the data you retrieve.

How can I use this for multiple datasets with different arguments?

So obviously, the utility of APIs comes when one needs multiple datasets of something. Using a Python script we can iterate through the arguments needed and generate the respective endpoints to retrieve our data. An easier way of doing this without manipulating and appending strings is to set up a dictionary with the parameter values and pass that to the get function:

parameters = {"format": 'json', "indent": 'on',
              "sites": '04234000', "startDT": '2020-05-01',
              "endDT": '2020-05-31', "parameterCd":'00060',
              "siteType": 'ST', "siteStatus":'all'}
response = requests.get("http://waterservices.usgs.gov/nwis/iv/", 
                        params=parameters)

This produces the same data, but we now have an easier way to manipulate the arguments in the dictionary. Using simple loops and lists to iterate through one can retrieve and store multiple different datasets from this API.

Other things to be aware of

Multiple other people use these APIs and some of them carry datasets that are very large so querying them takes time. If the server needs to process something before producing it for you, that takes time too. Some APIs limit the number of requests you can make as a result and it is general good practice to try to be as efficient as possible with your requests. This means not requesting large datasets you only need parts of, but being specific to what you need. Depending on the database, this might mean adding several filters to your request URL to be as specific as possible to only the dates and attributes needed, or combining several attributes (e.g., multiple gages in the above example) in a single request.

Parallel File Compressing in Linux

If you are dealing with big data and need to move them to different directories or archive them for long-term storage, you have to think about how you can do so efficiently. Without an efficient method, you will probably need to spend days organizing and finishing the work. There are several utilities that are popular for this task. I had more than 3 TB of data that I needed to move to free up some space on the disk, so I thought about a strategy for moving my files. My smallest subdirectory was about 175 GB, and it took about 2 hours to compress with normal tar and gzip. I realized that gzip has options for the level of compression and the speed of compression, which I did not know before. This can be helpful. I did a simple test with a smaller data set (about 1.94 GB) by applying different speed options from 1 to 9; 1 indicates the fastest but compression method, and 9 indicates the slowest but best compression method:

GZIP=-Option tar cvzf OUTPUT_FILE.tar.gz ./Paths_to_Archive/

Here is the result: the default is 6, but you can play with the speed option and, based on your priority, choose the timing and compression ratio. If your file or folder is much bigger than what I used here as an example, different speed options can really save you time. Here is the graph that I created from this experiment:

You can further speed it up using more than one core/processor. There is another available gzip version that compresses files/folders on multiple processors and cores.

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 20 > OUTPUT_FILE.tar.gz

In this case, I used “1” as the speed option and specified “20” possessors for the task, and the path where I downloaded the pigz. The good news is that you can write a simple bash script to run the same command on other nodes rather than on the login node. Therefore, you can use all the available possessors on a node without making the head node slow.

#!/bin/bash										
#SBATCH -t 24:00:00					
#SBATCH --job-name=test				
#SBATCH --mail-type=end				
#SBATCH -p normal					
#SBATCH --export=ALL				
#SBATCH --nodes=1			
#SBATCH --output="test.txt"				
#SBATCH --cpus-per-task=32

tar cf - ./Paths_to_Archive | ./pigz-2.4/pigz -1 -p 32 > OUTPUT_FILE.tar.gz

Save the lines above in a test.sh, and run it with sbatch test.sh.

I used a larger subset of my data (16.5 GB) to check different speed options on parallel, and here is the result: the slowest option was almost threefold faster (168 vs. 478 seconds) than my previous test on one possessor; however, my previous test folder was much smaller (1.96 vs. 16.5 GB).

Remote terminal environment using VS Code for Windows and Mac

On Windows machines, the application MobaXterm is a valuable tool for computing on virtual machines and working through SSH clients. David Gold’s blog post walks through the installation and use of this app, which works well in Windows environments.

Working remotely on my Mac laptop, I have been struggling to achieve the same workflow as in the office, with a Windows machine. Unfortunately, MobaXterm is not available for download on Mac OS. Looking for alternatives, I discovered that using VS Code with the “Remote – SSH” extension is a great replacement with significant advantages to MobaXterm, as it an SSH client interface and code editor in one.

A screenshot from my VS Code remote interface, with the graphical file browser on the left panel, the SSH server terminal on the bottom-right, and the VS Code editor on the top-right.

Here’s how you can set up a remote session on Mac (and Windows) using VS Code: 

  1. Install the VS Code application here. For installation help and a brief overview of the app, check out this video.
  2. With VS Code opened, go to View -> Extensions, and search “Remote – SSH.” Click on the extension and press the green “Install” button. You should see the message “This extension is enabled globally” appear. Check out this extension’s description below (I’ll run through the basics in this post).
  3. On the bottom left of your screen, there should be a small green box with two opposite pointing arrow heads. Click this.
The green box is the Remote – SSH extension.
  1. Choose the first pop-up option “Remote-SSH: Connect to host…” and then select “Add New SSH Host…”.
Click the first box and then the “Add New SSH Host” button to connect to your SSH client.
  1. Here, enter your remote SSH username@serverid (here at Cornell, this would be yournetid@thecube.cac.cornell.edu to connect to our remote computing cluster, the Cube).
  2. In the same pop-up window, click the remote server that you just added. A new window will open and prompt you to enter your password for the server.
  3. Now, you in are in your remote SSH environment. Click “Open folder…” and select “OK” to see your remote directory on the left. You can navigate through these files in your remote machine the same way as MobaXterm. Click View -> Terminal to see your SSH command line on the bottom of the screen (here’s where you can actually run the programs on your cluster).

Now using VS Code, you can install other extensions to aid in code editing in different languages (here’s an article with a few good ones for various uses). This environment has the same functionality as MobaXterm, without having to switch applications for editing code. Run your cluster programs in the terminal window and edit the code in the main VS Code editor!

Establishing an Effective Data Backup Strategy for Your Workstation

When determining your data management strategy for your workflow, considering a range of backup options for your data beyond just a single copy on your workstation or your external hard drive is paramount. Creating a seamless workspace that will easily transition between workstations and while maintaining durability and availability is easily achievable once you know what resources might be available and a general guideline.

General considerations for how you will be managing and sharing data is crucial, especially for collaborative projects when files must often be accessible in real time.

Considering how long you might need to retain data and how often you might need to access it will drastically change your approach to your storage strategy.

3-2-1 Data Backup Rule

If you walk away form this with nothing else, remember the 3-2-1 rule. The key to ensuring durability of your data—preventing loss due to hardware or software malfunction, fire, viruses, and institutional changes or uproars—is following the 3-2-1 Rule. Maintaining three or more copies on two or more different mediums (i.e. cloud and HDD) with at least one off-site copy.

3-2-1-Backup-Rule-1024x505Source: https://cactus-it.co.uk/the-3-2-1-backup-rule/

An example of this would be to have a primary copy of your data on your desktop that is backed up continuously via Dropbox and nightly via an external hard drive. There are three copies of your data between your local workstation, external hard drive (HD), and Dropbox. By having your media saved on hard drive disks (HDDs) on your workstation and external HD in addition to ‘the cloud’ (Dropbox), you have accomplished spreading your data across exactly two different mediums. Lastly, since cloud storage is located on external servers connected via the internet, you have successfully maintained at least one off-site copy. Additionally, with a second external HD, you could create weekly/monthly/yearly backups and store this HD offsite.

Version Control Versus Data Backup

Maintaining a robust version control protocol does not ensure your data will be properly backed up and vice versa. Notably, you should not be relying on services such as GitHub to back up your data, only your code (and possibly very small datasets, i.e. <50 MB). However, you should still maintain an effective strategy for version control.

  • Code Version Control
  • Large File Version Control
    • GitHub is not the place to be storing and sharing large datasets, only the code to produce large datasets
    • Git Large File Storage (LFS) can be used for a Git-based version-control on large files

Data Storage: Compression

Compressing data reduces the amount of storage required (thereby reducing cost), but ensuring the data’s integrity is an extremely complex topic that is continuously changing. While standard compression techniques (e.g. .ZIP and HDF5) are generally effective at compression without issues, accessing such files requires additional steps before having the data in a usable format (i.e. decompressing the files is required).  It is common practice (and often a common courtesy) to compress files prior to sharing them, especially when emailed.

7-Zip is a great open-source tool for standard compression file types (.ZIP, .RAR) and has its own compression file type. Additionally, a couple of guides looking into using HDF5/zlib for NetCFD files are located here and here.

Creating Your Storage Strategy

To comply with the 3-2-1 strategy, you must actively choose where you wish to back up your files. In addition to pushing your code to GitHub, choosing how to best push your files to be backed up is necessary. However, you must consider any requirements you might have for your data handling:

My personal strategy costs approximately $120 per year. For my workstation on campus, I primarily utilize DropBox with a now-outdated version control history plugin that allows for me to access files one year after deletion. Additionally, I instantaneously sync these files to GoogleDrive (guide to syncing). Beyond these cloud services, I utilize an external HDD that backs up select directories nightly (refer below to my script that works with Windows 7).

It should be noted that Cornell could discontinue its contracts with Google so that unlimited storage on Google Drive is no longer available. Additionally, it is likely that Cornell students will lose access to Google Drive and Cornell Box upon graduation, rendering these options impractical for long-term or permanent storage.

  • Minimal Cost (Cornell Students)
    • Cornell Box
    • Google Drive
    • Local Storage
    • TheCube
  • Accessibility and Sharing
    • DropBox
    • Google Drive
    • Cornell Box (for sharing within Cornell, horrid for external sharing)
  • Minimal Local Computer Storage Availability
    Access Via Web Interface (Cloud Storage) or File Explorer

    • DropBox
    • Google Drive (using Google Stream)
    • Cornell Box
    • TheCube
    • External HDD
  • Reliable (accessibility through time)
    • Local Storage (especially an external HDD if you will be relocating)
    • Dropbox
    • TheCube
  • Always Locally Accessible
    • Local Storage (notably where you will be utilizing the data, e.g. keep data on TheCube if you plan to utilize it there)
    • DropBox (with all files saved locally)
    • Cornell Box (with all files saved locally)
  • Large Capacity (~2 TB total)
    • Use Cornell Box or Google Drive
  • Extremely Large Capacity (or unlimited file size)

Storage Option Details and Tradeoffs

Working with large datasets can be challenging to do between workstations, changing the problem from simply incorporating the files directly within your workflow to interacting with the files from afar (e.g. keeping and utilizing files on TheCube).

But on a personal computer level, the most significant differentiator between storage types is whether you can (almost) instantaneously update and access files across computers (cloud-based storage with desktop file access) or if manual/automated backups occur. I personally like to have a majority of my files easily accessible, so I utilize Dropbox and Google Drive to constantly update between computers. I also back up all of my files from my workstation to an external hard drive just to maintain an extra  layer of data protection in case something goes awry.

  • Requirements for Data Storage
  • Local Storage: The Tried and True
    • Internal HDD
      • Installed on your desktop or laptop
      • Can most readily access data for constant use, making interactions with files the fastest
      • Likely the most at-risk version due to potential exposure to viruses in addition to nearly-constant uptime (and bumps for laptops)
      • Note that Solid State Drives (SSDs) do not have the same lifespan for the number of read/write as an HDD, leading to slowdowns or even failures if improperly managed. However, newer SSDs are less prone to these issues due to a combination of firmware and hardware advances.
      • A separate data drive (a secondary HDD that stores data and not the primary operating system) is useful for expanding easily-accessible space. However, it is not nearly as isolated as data contained within a user’s account on a computer and must be properly configured to ensure privacy of files
    • External Hard Drive Disk (HDD)
      • One-time cost ($50-200), depending on portability/size/speed
      • Can allow for off-line version of data to be stored, avoiding newly introduced viruses from preventing access or corrupting older versions (e.g. ransomware)—requires isolation from your workflow
      • May back up data instantaneously or as often as desired: general practice is to back up nightly or weekly
      • Software provided with external hard drives is generally less effective than self-generated scripts (e.g. Robocopy in Windows)
      • Unless properly encrypted, can be easily accessed by anyone with physical access
      • May be used without internet access, only requiring physical access
      • High quality (and priced) HDDs generally increase capacity and/or write/read speeds
    • Alternative Media Storage
      • Flash Thumb Drive
        • Don’t use these for data storage, only temporary transfer of files (e.g. for a presentation)
        • Likely to be lost
        • Likely to malfunction/break
      • Outdated Methods
        • DVD/Blu-Ray
        • Floppy Disks
        • Magnetic Tapes
      • M-Discs
        • Required a Blu-Ray or DVD reader/writer
        • Supposedly lasts multiple lifetimes
        • 375 GB for $67.50
  •  Dropbox
    • My experience is that Dropbox is the easiest cloud-storage solution to use
    • Free Version includes 2 GB of space without bells and whistles
    • 1 TB storage for $99.00/year
    • Maximum file size of 20 GB
    • Effective (and standard) for filesharing
    • 30-day version history (extended version history for one year can be purchased for an additional $39.00/year)
    • Professional, larger plans with additional features (e.g. collaborative document management) also available
    • Can easily create collaborative folders, but storage counts against all individuals added (an issue if individuals are sharing large datasets)
    • Can interface with both a web interface and across as operating system desktops
    • Fast upload/download speeds
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • Google Drive
    • My experience is that Google Drive is relatively straight forward
    • Unlimited data/email storage for Cornell students, staff, and faculty
    • Costs $9.99/mo for 1 TB
    • Maximum file size of 5 GB
    • Easy access to G Suite, which allows for real-time collaboration on browser-based documents
    • Likely to lose access to storage capabilities upon graduation
    • Google Drive is migrating over to Google Stream which stores less commonly used files online as opposed to on your hard drive
    • Google File Stream (used to sync files with desktop) requires a constant internet connection except for recently-used files
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup
  • Cornell Box
    • My experiences are that Cornell Box is not easy to use relative to other options
    • Unlimited storage space, 15 GB file-size limit
    • Free for Cornell students, staff, and faculty, but alumni lose access once graduating
    • Can only be used for university-related activities (e.g. classwork, research)
    • Sharable links for internal Cornell users; however, it is very intrusive to access files for external users (requires making an account)
    • Version history retains the 100 most recent versions for each file
    • Can connect with Google Docs
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • TheCube

Long-Term (5+ Years) Data Storage

It should be noted that most local media types degrade through time. Utilizing the 3-2-1 strategy is most important for long-term storage (with an emphasis on multiple media types and off-site storage). Notably, even if stored offline and never used, external HDDs, CDs, and Blu-Ray disks can only be expected to last at most around five years. Other strategies, such as magnetic tapes (10 years) or floppy disks (10-20 year), may last longer, there is no truly permanent storage strategy (source of lifespans).

M-Discs are a write-once (i.e. read only, cannot be modified) storage strategy that is projected to last many lifetimes and up to 1,000 years. If you’re able to dust off an old Blu-Ray disk reader/writer, M-Discs are likely the best long-term data strategy that is likely to survive the test of time—making two copies stored in two locations is definitely worthwhile. However, the biggest drawback is that M-Discs are relatively difficult to access compared to plugging in an external HD.

Because of the range of lifespans and how cheap storage has become, I would recommend maintaining your old (and likely relatively small) data archives within your regular storage strategy which is likely to migrate between services through time.

For larger datasets that you are required to retain and would like to easily access, I would maintain them on at least two offline external hard drive stored in separate locations (e.g. at home and your office) while occasionally (i.e. every six months) checking the health of the hard drives in perpetuity and replacing them as required.

Relying only on cloud storage for long-term storage is not recommended due to the possibility of companies closing their doors or simply deactivating your account. However, they can be used as an additional layer of protection in addition to having physical copies (i.e. external HD, M-Discs).

Windows 7 Robocopy Code

The script I use for backing up specific directories from my workstation (Windows 7) to an external HD is shown below. To set up my hard drive, I first formatted it to a format compatible with multiple operating systems using this guide. Note that your maximum file size and operating system requirements require different formats. Following this, I used the following guide to implement a nightly backup of all of my data while keeping a log on my C: drive. Note that I have only new files and versions of files copied over, ensuring that the back up does not take ages.

@echo off
robocopy C:\Users\pqs4\Desktop F:\Backups\Desktop /E /XA:H /W:0 /R:3 /REG > C:\externalbackup.log
robocopy E:\Dropbox F:\Backups\Dropbox /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Downloads F:\Backups\Downloads /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Documents F:\Backups\Documents /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4 F:\Backups\UserProfile /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy E:\Program Files\Zotero F:\Backups\Zotero /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log

Globus Connect for Transferring Files Between Clusters and Your Computer

I recently learned about a service called Globus Online that allows you to easily transfer files to the cluster.  It’s similar to WinSCP or SSH, but the transfers can happen in the background and get resumed if they are interrupted.  It is supported by the University of Colorado: https://www.rc.colorado.edu/filetransfer and the NSF XSEDE machines: https://www.xsede.org/globus-online. Also, courtesy of Jon Herman, a note about Blue Waters: There is an endpoint called ncsa#NearLine where you can push your data for long-term storage (to avoid scratch purges). However on NearLine there is a per-user quota of 5TB. So if you find yourself mysteriously unable to transfer any more files, you’ll know why.

To get started, first create a Globus Online account.  Then, you’ll need to create “endpoints” on your account.  The obvious endpoint is, say, the cluster.  The University of Colorado for example has instructions on how to add their cluster to your account.  Then, you need to make your own computers an endpoint!  To do this, click Manage Endpoints then click “Add Globus Connect.”  Give your computer a name, and then it will generate a unique key that you can then use on the desktop application for the service.  Download the program for Mac, Unix, or Windows.  The cool thing is you can do this on all your computers.  For example I have a computer called MacBookAir, using OSX, and another one called MyWindows8 or something like that, that uses Windows.

File transfers are then initiated as usual, only you’re using a web interface instead of a standalone program.

As usual feel free to comment in the comments below.

Software to Install on Personal Computers

I’ve been working lately with some visiting students, people outside the research group, and new students within the group that need to install some software on a personal laptop to get started.  Here is a guide to what you need to install. (Last Updated January 20, 2012).

  • (Required, Available to ANGEL Group Members Only)  If you haven’t already done so, contact Josh Kollat to become part of the AeroVis user group on Penn State’s ANGEL course management system, http://cms.psu.edu/.  There, you can download AeroVis, which is the visualization software that we use.  Also grab a copy of the sample data files and documentation which are really helpful.
  • (Required, Available to Everyone) Cygwin is used to connect to the cluster and see visual programs such as Matlab that are running on the cluster.  It can be found at http://www.cygwin.com/.  Download setup.exe and save it in a place (such as your Documents folder) where you will remember where to find it, since you may need to run it again.  Double click on setup.exe to get started.  You can either select the “default” packages when it asks which packages to install, or go ahead and select “all” to install all packages.  Either way, you may have to go back and select additional packages in subsequent runs of setup.exe if everything doesn’t install right.  The main package we’ll be using is “X-Window”, but according to their website, you install X-Window by following the same process of installing the cygwin package.  This is probably the hardest of any of the software packages to install, and it takes a long time.  Let us know if you need any help, and you can leave a comment if you have any tips on how to make the install process easier.  Afterwards, change your windows environment variables to add ;C:\cygwin\bin to your windows path. Note: Ryan has some additional ideas about the best way to access remote computers. Stay tuned for his post on the subject, and he can edit this post too with more details.
  • (Required, Available to Everyone) You’ll need a text editor, such as Notepad++ or PSPad.  Try both and see which one you like better, they are both free.  While you can use notepad or wordpad to do a lot of the text editing, these programs are a lot more comfortable for working with data and programming.
  • (Required, Available to Everyone) Use WinSCP for file transfer to the cluster.Ryan’s suggestions will probably pertain to this advice too. I know Matt and Ryan use different software packages so I’d love their input here. See comments to this post for additional discussion about this.
  • (Required Only for Visiting Students who Need College of Engineering Wireless) A group member can contact Bob White to get access for you.  A Penn State student must download software at the college’s site and load it on the visitor’s laptop.
  • (Required for Most Students Publishing Papers and Writing Theses Internally in the Group) You’ll need a LaTeX system. LyX is an open source “document processor” that has compatability with LaTeX. Please add additional suggestions for LaTeX environments and editors, preferably ones with syntax highlighting and some graphical features (such as adding equations using symbols). We have a license for WinEdt, but it’s not free for personal use.
  • (Optional, Available to Everyone)  Open source tips:  When in a Penn State computer lab or at a computer in our office, you can use Adobe Photoshop and Illustrator for figure editing and Microsoft Office products.  If you want access to some nice programs on your personal computer, though, for free, try Inkscape (for vector images), GIMP (for raster images and pictures), and Open Office (an alternative to Microsoft) which are all freely available.
  • (Optional, Available to Penn State Students Only) Some good software is available at http://downloads.its.psu.edu/.  Secure Shell Client, under “File Transfer” at that site, is a file transfer/terminal program used to connect to the cluster.  Different people have varying preferences for file transfer and cluster stuff.  I personally recommend WinSCP and running terminal commands on Cygwin, so Secure Shell is not really required.

As far as software that costs money or for computers in the office, we would generally need Microsoft Office, Microsoft Visual Studio, Matlab, and the Adobe Suite.  Students shouldn’t have to worry about installing those programs on their personal computers.  If you get Cygwin working correctly, your cluster access will allow you to use Matlab, Mathematica, programming compilers, and other software, so even if you don’t have access to your own copy of Matlab, you can use it interactively on the cluster.

Let me know if you have any questions by emailing jrk301 at psu.edu.

Software to Install on Personal Computers – For Mac Users!

It turns out there are some pros and cons to running on Macs for doing these activities.  Here are some updates on how to efficiently work on a mac:

  • You don’t need Cygwin at all since X11/XWindow is included in the operating system already!
  • You don’t need something like WinSCP, since you can use SFTP to transfer files from a local computer to the remote computer.  Here’s how:
    1. Open a terminal window on your local computer.
    2. Use cd and ls to get to the local directory on your hard drive where you have files you want to send to the remote computer.
    3. Type sftp user@host to connect to the remote computer.  Enter your password.
    4. Use cd and ls to get to the directory on the remote system where you want to put the files from your local system.
    5. Type put filename to transfer a file to the remote system. You can also use mput *abc to transfer multiple files (in this example, everything ending in abc). The asterisk is a wildcard; it matches any character, any number of times.
  • If you want to transfer in the other direction, i.e. from the remote machine to the local machine, use the get and mget commands, which work just like put and mput.
  • Summary of useful sftpcommands:
    • get filename Copy a file from the remote computer to the local computer.
    • mget filenames Copy several files from the remote computer to the local computer. Can use wildcards.
    • put filename Copy a file from the local computer to the remote computer.  Use -r if you want to upload a whole directory.  Note that the command can’t create a directory that already exists, so when you’re on the remote computer, use mkdir to make a new directory that matches the one you want to copy first.
    • mput filenames Copy several files from the local computer to the remote computer. Can use wildcards.
    • cd path Change directories on the remote computer.
    • ls List the files in the current directory on the remote computer.
    • pwd Display the path of the current directory on the remote computer.
    • lpwd Display the path of the current directory on the local computer.
    • lcd Change directories on the local computer.