From MATLAB to Julia: Insights from Translating an Opensource Kirsch-Nowak Streamflow Generator to Julia

A quick look into translating code: speed comparisons, practicality, and comments

As I am becoming more and more familiar with Julia—an open-source programming language—I’ve been attracted to translate code to not only run it on an opensource and free language but also to test its performance. Since Julia was made to be an open source language made to handle matrix operations efficiently (when compared to other high-level opensource languages), finding a problem to utilize these performance advantages only makes sense.

As with any new language, understanding how well it performs relative to the other potential tools in your toolbox is vital. As such, I decided to use a problem that is easily scalable and can be directly compare the performances of MATLAB and Julia—the Kirsch-Nowak synthetic stationary streamflow generator.

So, in an effort to sharpen my understanding of the Kirsch-Nowak synthetic stationary streamflow generator created by Matteo GiulianiJon Herman and Julianne Quinn, I decided to take on this project of converting from this generator from MATLAB. This specific generator takes in historical streamflow data from multiple sites (while assuming stationarity) and returns a synthetically generated daily timeseries of streamflow. For a great background on synthetic streamflow generation, please refer to this post by Jon Lamontagne.

Model Description

The example is borrowed from Julie’s code utilizes data from the Susquehanna River flows (cfs) at both Marietta (USGS station 01576000) and Muddy Run along with lateral inflows (cfs) between Marietta and Conowingo Damn (1932-2001). Additionally, evaporation rates (in/day) over the Conowingo and Muddy Run Dams (from an OASIS model simulation) utilized. The generator developed by Kirsch et al. (2013) utilizes a Cholesky decomposition to create a monthly synthetic record which preserves the autocorrelation structure of the historical data. The method proposed by Nowak et al. (2010) is then used to disaggregate to daily flows (using a historical month +/- 7 days). A full description of the methods can be found at this link.

Comparing Julia and MATLAB

Comparison of Performance between Julia and MATLAB

To compare the speeds of each language, I adapted the MATLAB code into Julia (shown here) on as nearly of equal basis as possible. I attempted to keep the loops, data structures, and function formulation as similar as possible, even calling similar libraries for any given function. julia_matlab_streamflow

When examining the performance between Julia (solid lines) and MATLAB (dashed lines), there is only one instance where MATLAB(x) outperformed Julia(+)—in the 10-realization, 1000-year simulation shown in the yellow dots in the upper left. Needless to say, Julia easily outperformed MATLAB in all other situations and required only 53% of the time on average (all simulations considered equal). However, Julia was much proportionally faster at lower dimensions of years (17-35% of the time required) than MATLAB. This is likely because I did not handle arrays optimally—the code could likely be sped up even more.

Considerations for Speeding Up Code

Row- Versus Column-Major Array Architecture

It is worth knowing how a specific language processes its arrays/matrices. MATLAB and Julia are both column-major languages, meaning the sequential indexes and memory paths are grouped by descending down row by row through a column then going through the next column. On the other hand, Numpy in Python specifically uses row-major architecture. The Wikipedia article on this is brief but well worthwhile for understanding these quirks.

This is especially notable because ensuring that proper indexing and looping methods are followed can substantially speed up code. In fact, it is likely that the reason Julia slowed down significantly on a 10-realization 1000-year simulation when compared to both its previous performances and MATLAB because of how the arrays were looped through. As a direct example shown below, when exponentiating through a [20000, 20000] array row-by-row took approximately 47.7 seconds while doing the same operation column-by-column only took 12.7 seconds.

5555

Dealing with Arrays

Simply put, arrays and matrices in Julia are a pain compared to MATLAB. As an example of the bad and the ugly, unlike in MATLAB where you can directly declare any size array you wish to work with, you must first create an array and then fill the array with individual array in Julia. This is shown below where an array of arrays is initialized below.  However, once an array is established, Julia is extremely fast in loops, so dealing with filling a previously established array makes for a much faster experience.

# initialize output
qq = Array{Array}(undef, num_sites) #(4, 100, 1200)

for i = 1:num_sites
     qq[i] = Array{Float64}(undef, nR, nY * 12)
end

Once the plus side  when creating arrays, Julia is extremely powerful in its ability to assign variable types to the components of a given array. This can drastically speed up your code during the day. Shown below, it is easy to the range of declarations and assignments being made to populate the array. There’s an easy example of declaring an array with zeros, and another where we’re populating an array using slices of another. Note the indexing structure for Qd_cg in the second loop–it is not technically a 3-D array but rather a 2-D array nested within a 1-D array–showing the issues mentioned prior.

delta = zeros(n_totals)
for i = 1:n_totals
     for j = 1:n_sites
          delta[i] += (Qtotals[month][j][i] - Z[j]) ^ 2
     end
end

q_ = Array{Float64, 2}(undef, num_realizations[k], 365 * num_years[k])
for i = 1: Nsites
     # put into array of [realizations, 365*num_yrs]
     for j = 1: num_realizations[k]
          q_[j, :] = Qd_cg[j][:, i]'
     end
end

Code Profiling: Order of Experiments

An interesting observation I’ve noticed is that Julia’s first run on a given block of code is substantially slower than every other attempt. Thus, it is likely worthwhile to run a smaller-scale array through to initialize the code if there are plans to move on to substantially more expensive operations (i.e. scaling up).

In the example below, we can see that the second iteration of the same exact code was over 10% faster when calling it a second time. However, when running the code without the function wrapper (in the original timed runs), the code was 10% faster (177 seconds) than the second sequential run shown below. This points to the importance of profiling and experimenting with sections of your code.

4444

Basic profiling tools are directly built into Julia, as shown in the Julia profiling documentation. This can be visualized easily using the ProfileView library. The Juno IDE (standard with Julia Pro) allegedly has a good built-in profile as well. However, it should be expected that most any IDE should do the trick (links to IDEs can be found here).

Syntax and Library Depreciation

While Julia is very similar in its structure and language to MATLAB, much of the similar language has depreciated as Julia has been rapidly upgraded. Notably, Julia released V1.0 in late 2018 and recently released V1.1, moving further away from similarities in function names. Thus, this stands as a lesson for individuals wishing to translate all of their code between these languages. I found a useful website that assists in translating general syntax, but many of the functions have depreciated. However, as someone who didn’t have any experience with MATLAB but was vaguely familiar with Julia, this was a godsend for learning differences in coding styles.

For example, creating an identity matrix in MATLAB utilizes the function eye(size(R)) to create an nxn matrix the size of R. While this was initially the language used in Julia, this specific language was depreciated in V0.7. To get around this, either ‘I’ can be used to create a scalable identity matrix or Matrix{Float64}(I, size(R), size(R)) declare an identity matrix of size(R) by size(R) for a more foolproof and faster operation.

When declaring functions, I have found Julia to be relatively straightforward and Pythonic in its declarations. While I still look to insert colons at the ends of declarations while forgetting to add ‘end’ at the end of functions, loops, and more, the ease of creating, calling, and interacting with functions makes Julia very accessible.  Furthermore, its ability to interact with matrices in without special libraries (e.g. Numpy in Python) allows for more efficient coding without having to know specific library notation.

Debugging Drawbacks

One of the most significant drawbacks I run into when using Julia is the lack of clarity in generated error codes for common mistakes, such as adding extra brackets. For example, the following error code is generated in Python when adding an extra parenthesis at the end of an expression.3333

However, Julia produces the follow error for an identical mistake:

2222

One simple solution to this is to simply upgrade my development environment from Jupyter Notebooks to a general IDE to more easily root out issues by running code line-by-line. However, I see the lack of clarity in showing where specific errors arise a significant drawback to development within Julia. However, as shown in the example below where an array has gone awry, an IDE (such as Atom shown below) can make troubleshooting and debugging a relative breeze.

1111

Furthermore, when editing auxiliary functions in another file or module that was loaded as a library, Julia is not kind enough to simply reload and recompile the module; to get it to properly work in Atom, I had to shut down the Julia kernel then rerun the entirety of the code. Since Julia takes a bit to initially load and compile libraries and code, this slows down the debugging process substantially. There is a specific package (Revise) that exists to take care of this issue, but it is not standard and requires loading this specific library into your code.

GitHub Repositories: Streamflow Generators

PyMFGM: A parallelized Python version of the code, written by Bernardo Trindade

Kirsch-Nowak Stationary Generator in MATLAB: Developed by Matteo GiulianiJon Herman and Julianne Quinn

Kirsch-Nowak Stationary Generator in Julia: Please note that the results are not validated. However, you can easily access the Jupyter Notebook version to play around with the code in addition to running the code from your terminal using the main.jl script.

Full Kirsch-Nowak Streamflow Generator: Also developed by Matteo GiulianiJon Herman and Julianne Quinn and can handle rescaling flows for changes due to monsoons. I would highly suggest diving into this code alongside the relevant blog posts: Part 1 (explanation), Part 2 (validation).

Introducing Julia: A Fast and Modern Language

In this blog post, I am introducing Julia, a high-level open-source dynamic programming language. While it has taken root in finance, machine learning, life sciences, energy, optimization, and economic realms, it is certainly not common within the water programming realm. Through this post, I will give an overview of the features, pros, and cons of Julia before comparing its performance to Python. Following this, I give an overview of common Julia development environments as well as linking to additional resources.

Julia: What’s Going On?

Julia is a high-level open-source dynamic programming language built for scientific computing and data processing that is Pythonic in syntax to ensure accessibility while boosting computational efficiency.  It is parallelizable on high performance computing resources utilizing CPUs and GPUs, allowing for its use in large-scale experiments.

With Version 1.1.0 recently released, Julia now stands amongst the established and stable programming languages. As such, Julia can handle matrices with ease while performing at speeds nearly comparable to C and Fortran. It is comparable to Cython/Numba and faster than Numpy, Python, Matlab, and R. Note that further benchmarking is required on a project-specific basis due to the speed of individual packages/libraries, but a simple case is shown in the following section.

The biggest case for implementing Julia over C/C++/Fortran is its simplicity in writing and implementation. Its language is similar Python, allowing for not only list comprehension but also ‘sloppy’ dynamic variable type declarations and use. Julia has the ability to move between dynamic and static variable types. Furthermore, Julia allows users to create unique variable types, allowing for maximum flexibility while maintaining efficiency.

Installing and implementing packages from the start is nearly seamless, simply requiring the name of the package and an internet connection. Most packages and libraries are hosted on GitHub and are relatively straightforward to install with just a couple lines of code. On the same hand, distributing and contributing to the opensource community is straightforward. Furthermore, Julia can access Python modules with wrappers and C/Fortran functions without, making it a very versatile language. Likewise, it can easily be called from Python (PyJulia) to speed up otherwise cumbersome operations.

Julia has relatively straightforward profiling modules built in. Beyond being able to utilize Jupyter Notebook commands, Julia has a range of code diagnostic tools, such as a @Time command to inform the user of wall time and memory allocation for a function.

One of the most obvious downsides to Julia when compared to Python is the relatively small community. There is not nearly the base of documentation that I am accustomed to; nearly every issue imaginable in the Python world has been explored on Stack Overflow, Julia is much more limited. However, because much of its functionality is based on MATLAB, similar issues and their corresponding solutions bleed over.

Additional transition issues that may be encountered is that unlike Python, Julia’s indexing begins with 1. Furthermore, Julia uses column-major ordering (like Matlab, Fortran, and R) unlike Python and C (row-major ordering), making iterating column-by-column substantially faster. Thus, special care must be taken when switching between languages in addition to ensuring consistent strategies for speeding up code. Another substantial transition that may be required is that Julia is not directly object-oriented since objects do not directly have embedded methods; instead, a given object must be passed to a function to be modified.

Overall, I see Julia as having the potential to improve computational efficiency for computationally-intensive models without the need to learn a more complex language (e.g. Fortran or C/C++). Between its ease of integration between other common languages/libraries to its ability to properly utilize HPC CPU/GPU resources, Julia may be a useful tool to utilize alongside Python to create more efficient large-scale models.

Julia Allows for Faster Matrix Operations

As a simple test to compare scalability of Julia and Python, I computed the dot product of increasingly large matrices using the base Julia language and Numpy in Python (both running in Jupyter Notebooks on a Windows desktop). You can find the code at the following link on Github.

julia_vs_python_timing.png

As you can see in the generated figure, Julia outperforms Python for large matrix applications, indicating that Julia is more efficient and may be worth using for larger and larger operations. If you have any suggestions on improving this code for comparison, please let me know; I am absolutely game for improvement.

Development Environments for Julia

The first and most obvious variables on what development environment is best is you, the user. If you’re wanting all the bells and whistles versus a simple text-editing environment, the range of products to fulfill you desires exists.

Juno—an extension to Atom—is the default IDE that is incorporated with JuliaPro, a pre-packaged version of Julia that includes a range of standard packages—Anaconda is the parallel in the Python world.

Jupyter Notebooks  are my go-to development environment for experimenting with code in both Python and Julia, especially for testing small sections of code.  Between being able to easily create visuals and examples inline with your code and combining code with Markdown, it allows for clean and interactive approach to sharing code or teaching. Note that installation generally requires IJulia but can allow for easy integration into already existing Jupyter workflows.

JetBrains IDEs (e.g. CLion, PyCharm) support a Julia plugin that supports a wide range of the same features JetBrains is known for. However, it is still in beta and is working to improve its formatter and debugging capabilities.

JuliaBox is an online web interface using Jupyter Notebooks for code development on a remote server, requiring no local installation of Julia. Whether you are in a classroom setting, wanting to write code on your phone, or are just wanting to experiment with parallel computing (or don’t have access to HPC resources), JuliaBox allows for a nearly seamless setup experience. However, note that this development environment limits each session to 90 minutes (up to 8 hours on a paid subscription) and requires a subscription to access any resources beyond 3 CPU cores. Note that you can access GPU instances with a paid subscription.

Julia Studio is a default IDE used by a range of users. It is a no-frills IDE based on Qt Creator, resulting in a clean look.

For anyone looking to use Visual Studio, you can install a VS Code extension.

Vim is not surprisingly available for Julia.  And if you’re feeling up the challenge, Emacs also has an interface.

Of course, you can just use a texteditor of your choice (e.g. Sublime Text with an extension) and simply run Julia from the terminal.

Julia Resources

Acknowledgements

Thanks to Dave Gold for his assistance in giving guidance for benchmarking and reminding me of the KISS principle.

Interacting with Plotted Functions Using Jupyter Notebooks ipywidgets with matplotlib

Interacting with Plotted Functions Using Jupyter Notebooks ipywidgets with matplotlib

When exploring the properties of an individual function or a system of equations, I often find myself quickly writing up code in a Jupyter Notebook (overview) to plot functions using matplotlib. While a previous blogpost by Jazmin has gone over creating interactive plots with matplotlib, being able to actively change variables to quickly explore their sensitivity or even teach others is crucial.

The primary reason I started playing with interactive widgets in Jupyter Notebook was to  teach individuals the basic mechanics of a simple crop allocation model that utilizes Positive Mathematical Programming (PMP) (Howitt, 1995). Beyond showing the simple calibration steps via code, allowing students to interact with individual variables allows for them to not only explore the sensitivity of models but also to find where such a model might break.

In this specific case, the marginal cost and revenues for a specific commodity (corn) versus the number of acres planted are graphed. A breaking case where the crop’s input data (the price per ton of corn) causes the model to produce unrealistic results (a negative marginal cost) is produced below.

Plotting Functions in matplotlib

One of the features that is often overlooked in matplotlib is that you can (indirectly) plot functions without much effort. The following commented code is used to demonstrate plotting marginal cost and revenue curves on the same graph.

import matplotlib.pyplot as plt
from numpy import *

#fixed inputs from calibrated model
#alpha and gamma for quadratic formulation from Howitt, 1995
input_params = [743.0, 2.57]

#crop revenue per acre, known
crop_revenues = 1500

#Creating the x-axis domain, controls range of x inputs
t = linspace(0, 350)

#t is the independent variable, corn_MC is dependent variable
corn_MC = input_params[0]  + input_params[1] * t 

#note that you have to multiply t by 0 to populate entire line
corn_MR = crop_revenues + 0 * t

plt.figure(dpi=400) #set resolution of graph, can change. 

#label axes
plt.ylabel('Marginal Cost ($/acre)')
plt.xlabel('Acres of Corn Planted')

#plot as many lines as you'd like, expand them here
plt.plot(t, corn_MC, 'r', label='Marginal Cost')
plt.plot(t, corn_MR, 'b', label='Marginal Revenue')

#change size of ticks on axes
plt.tick_params(labelsize=8)

legend = plt.legend(loc=4, shadow=True)
plt.show()

 

The resulting graph is generated inline in a Jupyter Notebook. Note that the resolution and size of the graphic is adjusted using plot.figure(dpi=400)—the default is much smaller and makes for hard-to-read graphs!

matplotlib_output.png

Interactive Inputs for Graphics in Jupyter Notebook

While graphing a function in Jupyter Notebook using matplotlib is rather straightforward, interactive plots with functions are rarely utilized. Using iPython widgets allows us to easily create these tools for any type of situation, from vertical and horizontal slider bars to progress bars to dropdown to boolean ‘click me’ buttons. You can find any of these features here. For the sake of simplicity, I have simply included three slider bars to demonstrate these features.

Notably, as an expansion of the example above, I have to run scipy.minimize multiple times whenever inputs are recalibrated (for questions on this end, I am more than happy to explain PMP if you contact me directly). To do this, I have to imbed multiple functions into a larger function to carry this out. However, I have not noticed any considerable slowdowns when running this much code.

To carry out the entirety of the example above, I have almost 300 lines of code to fully encapsulate a 3-crop allocation problem. For simplicity’s sake, I defined the function ‘pmp_example_sliders’ to contain all of the code required to calibrate and produce the graphic above (note that input parameters are what are calculated) using Scipy.optimize.minimize. Shadow prices are then derived using scipy.optimize.minimize and used to create the parameters necessary for the  marginal cost curve shown in this example. Note that the full model can be found in the GitHub file.

from ipywidgets import *
import scipy.optimize
%matplotlib inline
from numpy import *
import matplotlib.pyplot as plt

def pmp_example_sliders(corn_price=250, corn_initial_acres=200, x_axis_scale=350):
return plot
###Code for Running Calibration and Graphing Resulting Graph###
#create sliders for three variables to be changed
#have to have function with changing parameters (shown above)
#note that scale for corn_price is form 10 to 500 and stepping by 10
slider = interactive(pmp_example_sliders, corn_price=(10, 500, 10),
corn_initial_acres=(0.01,400, 10), x_axis_scale=(30, 650, 10))

#display the resulting slider
display(slider)

The following graphic results with the aforementioned sliders:

interact_1.PNG

We can easily change the input slider on the corn price from 250 to 410 to change to the following results:

interact_2.PNG

Because of this interactive feature, we not only see how the MC and MR curves interact, but we can observe that the Alpha Parameter (shown below the graph) becomes negative. While not important for this specific blog post, it shows where the crop allocation model produces negative marginal costs for low allocations of corn. Thus, utilizing interactive graphics can help highlight dynamics and drawbacks/limitations for models, helping assist in standalone tutorials.

If you would like to run this yourself, please feel free to download the Jupyter Notebooks from my GitHub. All code is written in Python 3. If you need assistance opening Jupyter Notebook, please refer to my blog post introducing Jupyter Notebook (link) or how to create a shortcut to launch Jupyter Notebook from your current working directory (link).

Google Dataset Search (beta): Revolutionize your Data Searches and Database Exposure

Google just released the (beta) version of a dataset search a couple days ago (Dataset Search Link), allowing users to effectively and efficiently search publicly available databases. With the use of Google-fu (a quick  graphic guide) or even regular Google skills, discovering data sets that might be buried in a web page are more readily discoverable.

Conversely, better standardization of your datasets allows for an increased likelihood that it will reach broader audiences in the near future. Google has set forth guidelines that you should consider following to ensure exposure.

Example Dataset Search

Say we’re interested in finding potential databases that might contain information on crop allocation in California without being aware of any well-known sources; if you already have a database to access the data you need, just head there first. With a simple search, we can identify any potentially available dataset and explore its contents. example_search_labeledIn this specific example, we were able to discover that the National Agricultural Statistics Service has a raster dataset called the Cropland Data Layer that shows annual crop-specific land cover for the entire continental United States that was created using moderate-resolution satellite imagery and validated/corrected with agriculture ground truth. (Notably, CropScape has its own web-based GIS viewer, shown below)crop_example.PNG

Database Guidelines Overview

As previously mentioned, Google has established guidelines that will increase the likelihood that your dataset may be found via this new tool. Notably, they recommend that your site should be marked up using JSON-LD (Microdata and RDFa are also supported). For help, use the standardized Schema.org to ensure you have the proper setup for your website’s markup (i.e. the HTML code describing webpage that can be “read” by Google and web browsers). Dataset-specific vocabulary is shown here and is also shown on as examples on the Google Datasets Guide.

Google has a created Structured Data Markup Helper to help generate the necessary HTML code. They note that you can test your data set using the following tool either using the code snippet for your webpage or by fetching the URL itself (link to tool). If you need additional help, Google’s forums are a valuable resource.

Establishing an Effective Data Backup Strategy for Your Workstation

When determining your data management strategy for your workflow, considering a range of backup options for your data beyond just a single copy on your workstation or your external hard drive is paramount. Creating a seamless workspace that will easily transition between workstations and while maintaining durability and availability is easily achievable once you know what resources might be available and a general guideline.

General considerations for how you will be managing and sharing data is crucial, especially for collaborative projects when files must often be accessible in real time.

Considering how long you might need to retain data and how often you might need to access it will drastically change your approach to your storage strategy.

3-2-1 Data Backup Rule

If you walk away form this with nothing else, remember the 3-2-1 rule. The key to ensuring durability of your data—preventing loss due to hardware or software malfunction, fire, viruses, and institutional changes or uproars—is following the 3-2-1 Rule. Maintaining three or more copies on two or more different mediums (i.e. cloud and HDD) with at least one off-site copy.

3-2-1-Backup-Rule-1024x505Source: https://cactus-it.co.uk/the-3-2-1-backup-rule/

An example of this would be to have a primary copy of your data on your desktop that is backed up continuously via Dropbox and nightly via an external hard drive. There are three copies of your data between your local workstation, external hard drive (HD), and Dropbox. By having your media saved on hard drive disks (HDDs) on your workstation and external HD in addition to ‘the cloud’ (Dropbox), you have accomplished spreading your data across exactly two different mediums. Lastly, since cloud storage is located on external servers connected via the internet, you have successfully maintained at least one off-site copy. Additionally, with a second external HD, you could create weekly/monthly/yearly backups and store this HD offsite.

Version Control Versus Data Backup

Maintaining a robust version control protocol does not ensure your data will be properly backed up and vice versa. Notably, you should not be relying on services such as GitHub to back up your data, only your code (and possibly very small datasets, i.e. <50 MB). However, you should still maintain an effective strategy for version control.

  • Code Version Control
  • Large File Version Control
    • GitHub is not the place to be storing and sharing large datasets, only the code to produce large datasets
    • Git Large File Storage (LFS) can be used for a Git-based version-control on large files

Data Storage: Compression

Compressing data reduces the amount of storage required (thereby reducing cost), but ensuring the data’s integrity is an extremely complex topic that is continuously changing. While standard compression techniques (e.g. .ZIP and HDF5) are generally effective at compression without issues, accessing such files requires additional steps before having the data in a usable format (i.e. decompressing the files is required).  It is common practice (and often a common courtesy) to compress files prior to sharing them, especially when emailed.

7-Zip is a great open-source tool for standard compression file types (.ZIP, .RAR) and has its own compression file type. Additionally, a couple of guides looking into using HDF5/zlib for NetCFD files are located here and here.

Creating Your Storage Strategy

To comply with the 3-2-1 strategy, you must actively choose where you wish to back up your files. In addition to pushing your code to GitHub, choosing how to best push your files to be backed up is necessary. However, you must consider any requirements you might have for your data handling:

My personal strategy costs approximately $120 per year. For my workstation on campus, I primarily utilize DropBox with a now-outdated version control history plugin that allows for me to access files one year after deletion. Additionally, I instantaneously sync these files to GoogleDrive (guide to syncing). Beyond these cloud services, I utilize an external HDD that backs up select directories nightly (refer below to my script that works with Windows 7).

It should be noted that Cornell could discontinue its contracts with Google so that unlimited storage on Google Drive is no longer available. Additionally, it is likely that Cornell students will lose access to Google Drive and Cornell Box upon graduation, rendering these options impractical for long-term or permanent storage.

  • Minimal Cost (Cornell Students)
    • Cornell Box
    • Google Drive
    • Local Storage
    • TheCube
  • Accessibility and Sharing
    • DropBox
    • Google Drive
    • Cornell Box (for sharing within Cornell, horrid for external sharing)
  • Minimal Local Computer Storage Availability
    Access Via Web Interface (Cloud Storage) or File Explorer

    • DropBox
    • Google Drive (using Google Stream)
    • Cornell Box
    • TheCube
    • External HDD
  • Reliable (accessibility through time)
    • Local Storage (especially an external HDD if you will be relocating)
    • Dropbox
    • TheCube
  • Always Locally Accessible
    • Local Storage (notably where you will be utilizing the data, e.g. keep data on TheCube if you plan to utilize it there)
    • DropBox (with all files saved locally)
    • Cornell Box (with all files saved locally)
  • Large Capacity (~2 TB total)
    • Use Cornell Box or Google Drive
  • Extremely Large Capacity (or unlimited file size)

Storage Option Details and Tradeoffs

Working with large datasets can be challenging to do between workstations, changing the problem from simply incorporating the files directly within your workflow to interacting with the files from afar (e.g. keeping and utilizing files on TheCube).

But on a personal computer level, the most significant differentiator between storage types is whether you can (almost) instantaneously update and access files across computers (cloud-based storage with desktop file access) or if manual/automated backups occur. I personally like to have a majority of my files easily accessible, so I utilize Dropbox and Google Drive to constantly update between computers. I also back up all of my files from my workstation to an external hard drive just to maintain an extra  layer of data protection in case something goes awry.

  • Requirements for Data Storage
  • Local Storage: The Tried and True
    • Internal HDD
      • Installed on your desktop or laptop
      • Can most readily access data for constant use, making interactions with files the fastest
      • Likely the most at-risk version due to potential exposure to viruses in addition to nearly-constant uptime (and bumps for laptops)
      • Note that Solid State Drives (SSDs) do not have the same lifespan for the number of read/write as an HDD, leading to slowdowns or even failures if improperly managed. However, newer SSDs are less prone to these issues due to a combination of firmware and hardware advances.
      • A separate data drive (a secondary HDD that stores data and not the primary operating system) is useful for expanding easily-accessible space. However, it is not nearly as isolated as data contained within a user’s account on a computer and must be properly configured to ensure privacy of files
    • External Hard Drive Disk (HDD)
      • One-time cost ($50-200), depending on portability/size/speed
      • Can allow for off-line version of data to be stored, avoiding newly introduced viruses from preventing access or corrupting older versions (e.g. ransomware)—requires isolation from your workflow
      • May back up data instantaneously or as often as desired: general practice is to back up nightly or weekly
      • Software provided with external hard drives is generally less effective than self-generated scripts (e.g. Robocopy in Windows)
      • Unless properly encrypted, can be easily accessed by anyone with physical access
      • May be used without internet access, only requiring physical access
      • High quality (and priced) HDDs generally increase capacity and/or write/read speeds
    • Alternative Media Storage
      • Flash Thumb Drive
        • Don’t use these for data storage, only temporary transfer of files (e.g. for a presentation)
        • Likely to be lost
        • Likely to malfunction/break
      • Outdated Methods
        • DVD/Blu-Ray
        • Floppy Disks
        • Magnetic Tapes
      • M-Discs
        • Required a Blu-Ray or DVD reader/writer
        • Supposedly lasts multiple lifetimes
        • 375 GB for $67.50
  •  Dropbox
    • My experience is that Dropbox is the easiest cloud-storage solution to use
    • Free Version includes 2 GB of space without bells and whistles
    • 1 TB storage for $99.00/year
    • Maximum file size of 20 GB
    • Effective (and standard) for filesharing
    • 30-day version history (extended version history for one year can be purchased for an additional $39.00/year)
    • Professional, larger plans with additional features (e.g. collaborative document management) also available
    • Can easily create collaborative folders, but storage counts against all individuals added (an issue if individuals are sharing large datasets)
    • Can interface with both a web interface and across as operating system desktops
    • Fast upload/download speeds
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • Google Drive
    • My experience is that Google Drive is relatively straight forward
    • Unlimited data/email storage for Cornell students, staff, and faculty
    • Costs $9.99/mo for 1 TB
    • Maximum file size of 5 GB
    • Easy access to G Suite, which allows for real-time collaboration on browser-based documents
    • Likely to lose access to storage capabilities upon graduation
    • Google Drive is migrating over to Google Stream which stores less commonly used files online as opposed to on your hard drive
    • Google File Stream (used to sync files with desktop) requires a constant internet connection except for recently-used files
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup
  • Cornell Box
    • My experiences are that Cornell Box is not easy to use relative to other options
    • Unlimited storage space, 15 GB file-size limit
    • Free for Cornell students, staff, and faculty, but alumni lose access once graduating
    • Can only be used for university-related activities (e.g. classwork, research)
    • Sharable links for internal Cornell users; however, it is very intrusive to access files for external users (requires making an account)
    • Version history retains the 100 most recent versions for each file
    • Can connect with Google Docs
    • Previous version control can allow access to previous versions if ransomware becomes an issue
    • Supports two-factor authentication
    • Requires internet access for online storage/backup, but has offline access
  • TheCube

Long-Term (5+ Years) Data Storage

It should be noted that most local media types degrade through time. Utilizing the 3-2-1 strategy is most important for long-term storage (with an emphasis on multiple media types and off-site storage). Notably, even if stored offline and never used, external HDDs, CDs, and Blu-Ray disks can only be expected to last at most around five years. Other strategies, such as magnetic tapes (10 years) or floppy disks (10-20 year), may last longer, there is no truly permanent storage strategy (source of lifespans).

M-Discs are a write-once (i.e. read only, cannot be modified) storage strategy that is projected to last many lifetimes and up to 1,000 years. If you’re able to dust off an old Blu-Ray disk reader/writer, M-Discs are likely the best long-term data strategy that is likely to survive the test of time—making two copies stored in two locations is definitely worthwhile. However, the biggest drawback is that M-Discs are relatively difficult to access compared to plugging in an external HD.

Because of the range of lifespans and how cheap storage has become, I would recommend maintaining your old (and likely relatively small) data archives within your regular storage strategy which is likely to migrate between services through time.

For larger datasets that you are required to retain and would like to easily access, I would maintain them on at least two offline external hard drive stored in separate locations (e.g. at home and your office) while occasionally (i.e. every six months) checking the health of the hard drives in perpetuity and replacing them as required.

Relying only on cloud storage for long-term storage is not recommended due to the possibility of companies closing their doors or simply deactivating your account. However, they can be used as an additional layer of protection in addition to having physical copies (i.e. external HD, M-Discs).

Windows 7 Robocopy Code

The script I use for backing up specific directories from my workstation (Windows 7) to an external HD is shown below. To set up my hard drive, I first formatted it to a format compatible with multiple operating systems using this guide. Note that your maximum file size and operating system requirements require different formats. Following this, I used the following guide to implement a nightly backup of all of my data while keeping a log on my C: drive. Note that I have only new files and versions of files copied over, ensuring that the back up does not take ages.

@echo off
robocopy C:\Users\pqs4\Desktop F:\Backups\Desktop /E /XA:H /W:0 /R:3 /REG > C:\externalbackup.log
robocopy E:\Dropbox F:\Backups\Dropbox /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Downloads F:\Backups\Downloads /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4\Documents F:\Backups\Documents /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy C:\Users\pqs4 F:\Backups\UserProfile /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
robocopy E:\Program Files\Zotero F:\Backups\Zotero /E /XA:SH /W:0 /R:3 /REG /XJ >> C:\externalbackup.log
Tips for Creating Animated Graphics (GIFs)

Tips for Creating Animated Graphics (GIFs)

Creating a silent movie to tell a narrative is a powerful yet challenging endeavor. Perhaps with inspirations to create an eye-catching graphic for your social media followers or to better visualize intertemporal and interspatial changes, one can utilize GIFs to best meet their goals. However, with great (computational functionality and) power comes great responsibility. Creating a clean, minimalistic, and standalone graphic that quickly conveys information is key, so knowing what tips to follow along with a few linked tutorials will be a good starting point for creating your own effective animated visual.

I recently utilized a GIF in a presentation to show changes in perennial crop coverage in Kern County, CA alongside changes in drought intensity. While the GIF (shown at the end) wasn’t perfect, general tips for animation creation along with a quick critique to point out specific issues in the generated animation can help you when you plan out the creation of your own animation.

Considerations and Tips for Image Quality Control

Making an effective static graphic that easily conveys information without being overwhelming is tough, but creating a GIF can lead to disaster even faster than the blink of an eye. The most straightforward way to create your animation is to generate a incremental series of individual images you will be combining to create a short movie.

The main considerations for creating an effective movie visual is ensuring it is readable, engaging, and informative. The first question you should ask is whether the story you want to tell can be fully and effectively explained with a static image. If it can be, stop and make it before wasting your time on creating an animation.

Assuming you’ve passed the first step, it’s time to storyboard your visual. Draw out what you’re wanting to show so you have a template to follow. (If you find that what you’re wanting to animate can be effectively told with a static image, please revisit the previous paragraph.)

Much like a paper, there should be an introduction or orienting feature that allows the consumer to easily understand the thesis idea of the graphic at a glance in a standalone setting. With this key point in mind, driving the central theme of your graphic (likely) allows for the visual to appear randomly in public without fear of it being without a caption to fully tell a story.

Here is an example of a well-made, informative, and engaging GIF that shows how to make effective tables:ZY8dKpA

 

It is clear and to the point, stepping the audience through a range of steps in a clear and entertaining manner. It is well labeled, showing the author’s website, and is very concise in its messaging. Furthermore, it has been very memorable for me after initially seeing it on Twitter years ago.

This website has plenty of examples for effective animations that you could use for inspiration. Additionally, for those Pinterest users out there, DataGIFs has plenty of good examples. On Reddit, plenty of beautiful (and ugly) animated and static graphics alike appear on /r/dataisbeautiful.  If you have a narrative to tell, there are likely similar animations that exist that you can use as a base to convey your tale.

With the big picture stuff out of the way, the next few points to consider can make or break your animation. While not every tip applies to every situation, covering them helps ensure your visual doesn’t have any glaring issues.

  • Looping and Timing
    • Make sure the timing of the gif isn’t too long or short.
      • 20 seconds is generally the maximum recommended length for audience engagement
      • Make sure there is enough time on each component of the GIF to be understood within two to three loops
    • Ensure the individual frames can be read in a timely fashion
    • Don’t have a jarring transition from the end of the GIF back to the start
      • Make sure the first and last frames match up in style
      • You can pull the user away from the original frame, but bring them back in since they’ll likely repeat the loop two to three times
      • Consider starting with a base slide
      • Individually add features so individuals are not overwhelmed
      • Much like a slide presentation, if the reader is overwhelmed with too much information from the start, they are likely not going to understand the progression of the story being conveyed
    • Slow is boring and may cause people to prematurely end their engagement with the animation
  • Content Layout
    • Consider the story you want to portray
    • Much like any figure, ensuring the animation can stand alone and be posted online while out of context without killing its meaning is essential
    • Select a “master frame” that is descriptive enough to be the “title slide” of the GIF—this is what will be seen when your slide deck is printed
  • Image Manipulation
  • Consistent sizing and coloring
    • Ensure labels are readable and are in the same location through all of the images. Having labels twitching between locations is distracting
    • Framing of the edges of objects should not drift. If making a transition between different sizing (e.g. zooming out), ensure it is incremental and not shocking
    • Customize components of a graphic to avoid bland experiences
  • Effective colors
    • Unless you’re artistically talents (i.e. not me), stick to generated color palates. ColorBrewer is a good resource
    • Consider what your image might look like when printed out—try to make sure it converts well to grayscale. Note that generally the first image of a GIF’s sequence is what will be printed
    • Watch out for colorblind individuals. They are people too and are worth catering to (use this website to ensure your images are compatible)
    • The fewer, the better: selection of individual colors to make certain messages “pop” is worthwhile
  • Fonts and Labels
    • Give credit to yourself within the graphic
    • Make sure that all words can be read at a glance
    • If you plan to post the gif online, make sure individuals can get a decent idea of what’s going on before they blow it up
    • Assume your image will be posted on twitter: consider making a dummy (private) account to make sure everything important is readable from the preview along with the actual gif being readable from a regular smartphone
    • If distributing online, ensure the animation can be linked to a central website for citations/narratives/methods
  • Resolution and Size
    • Bigger is rarely better. Try to limit your size to around 1 MB max
    • Size your GIF based on your platform. The size is not nearly as important if you’re using it in a presentation you keep on a jump drive compared to if you’re going to upload it online.
    • Test your visual on multiple platforms (i.e. PowerPoint on another computer after moving the .ppt file) prior to use.
    • If posting something that will be shared on Twitter, ensure the narrative of the GIF can be transferred when consumed on a mobile device. (~80% of all Twitter interactions come from a mobile devices)
  • Creating the GIF
    • Make sure your images aren’t too high of a resolution going into GIMP. It turns out that bigger isn’t always better
    • I have run into issues with Python-based libraries creating GIFs with artifacts.
      • imageio is commonly considered to be the best option when creating GIFs using a Python due to its compatibility with PIL/PILLOW (source)
    • I am a firm believer in spending a bit of extra time to follow this tutorial using GIMP to create your GIF
      • Photoshop is the standard (tutorial), but it is not open source
    • This post does a good job on showing how to make animated GIFs in R 

Crop-Coverage Animation Critique

As mentioned above, I created a GIF to show changes in perennial crop coverage in Kern County, CA alongside changes in drought intensity. Each frame of the map was created in ArcGIS (which will eventually be migrated over to Python using Julie’s tutorial) and exported as a PNG while the graph along the bottom is a screenshot from the U.S. Drought Monitor. These individual images were combined, a black rectangle overlaying the graph was added, and then saved as a single image for each given year. I then used GIMP to combine, order, and create a GIF (tutorial link).

2000_to_2017_export2.gifThe Good:

  1. The timing of the picture sequence (~650 ms between frames) makes for easy reading
  2. The box highlighting the given year’s drought conditions along the bottom graphic is effective at pointing out current conditions
  3. Coupling the moving box with a clear title indicating the current year shows a natural progression through time
  4. It is relatively straightforward to see where in California Kern County is located
    • The map includes a scale and north arrow to assist with orientation
    • The pop-out box for California isn’t very intrusive
  5. The title is large and easy to read, with the given year being clearly indicated
  6. The data for the drought monitor is labeled as required by the source

The Bad:

  1. The title shifts a few pixels in a couple of frames
  2. The graph does not stretch to match the size of the entire GIF, making the graphic feel unbalanced
  3. There is too much white space in random locations
    • The map itself could have been zoomed in a bit closer to leave out a lot of the white space in the eastern part of the county
    • Having too much white space on the right side of the map makes the graphic feel even further imbalanced
  4. There is not a website on the GIF to indicate the author/creator (me)

The Ugly:

  1. The choice in color schemes on the map jump out in a bad way
    • The districts drawn on the background map cannot be easily differentiated
    • Using red as a plotted color can be a bit jarring and is not colorblind friendly
  2. The labels in the legend hop around and are not consistent for three (3) of the years
    • The “master” or initial blank slide only has three categories while the rest of the slides have four
  3. The labeling for the graph is not clear (title, axes labels)
  4. The axis labels are blurry and small

Code showing how to make the drought monitor graphic graphic (with updates) is posted at this GitHub link. Note that the images were compiled into a GIF using GIMP.  The revised graphic is shown below.

kern_drought_upload_blog_post.gif

High definition images can be found on imgur here.

Setting Up and Customizing Python Environments using Conda

Typing ‘python’ into your command line launches the default global Python environment (which you can change by changing your path) that includes every package you’ve likely installed since the dawn of man (or since you adopted your machine).

But what happens when you are working between Python 2.7 and Python 3.x due to collaboration, using Python 3.4 because the last time you updated your script was four years ago, collaborating with others and want to ensure reproducibility and compatible environments, or banging your head against the wall because that one Python library installation is throwing up errors (shakes fist at PIL/Pillow)?

Creating Python environments is a straightforward solution to save you headaches down the road.

Python environments are a topic that many of us have feared through the years due to ambiguous definitions filled with waving hands. An environment is simply the domain in which users run software or scripts. With this same train of thought, a python environment is the domain with all of the Python packages are installed where a user (you!) is executing a script (usually interfacing through an IDE or Terminal/Command Prompt).

However, different scripts will work or fail in different environments  avoid having to use all of these packages at once or having to completely reinstall Python, what we want to do is create new and independent Python environments. Applications of these environments include:

  • Have multiple versions of Python (e.g. 2.7 and 3.4 and 3.6) installed on your machine at once that you can easily switch between
  • Work with specific versions of packages and ensure they don’t update for the specific script you’re developing
  • Allow for individuals to install the same, reproducible environment between workstations
  • Create standardized environments for seamless collaboration
  • Use older versions of packages to utilize outdated code

Creating Your First Python Environment

One problem that recent arose in Ithaca was that someone was crunching towards deadlines and could only run PIL (Python Imaging Library) on their home machine and not their desktop on campus due to package installation issues. This individual had the following  packages they needed to install while using Python 2.7.5:

  • PIL
  • matplotlib
  • numpy
  • pandas
  • statsmodels
  • seaborn

To start, let’s first create an environment! To do this, we will be using Conda (install Anaconda for new users or MiniConda for anyone who doesn’t want their default Python environment to be jeopardized. If you want to avoid using Conda, feel free to explore Pipenv). As a quick note on syntax, I will be running everything in Windows 7 and every command I am using can be found on the Conda Cheatsheet. Only slight variations are required for MacOS/Linux.

First, with your Command Prompt open, type the following command to create the environment we will be working in:

conda create --name blog_pil_example python=2.7.5

conda_create.PNG

At this point, a new environment titled blog_pil_example with Python 2.7.5 has been created. Congrats! Don’t forget to take screenshots to add to your new environment’s baby book (or just use the one above if it’s not your first environment).

From here, we need to activate the environment before interacting with it. To see which environments are available, use the following:

conda env list

Now, let’s go ahead and activate the environment that we want (blog_pil_example):

activate blog_pil_example

To leave the environment you’re in, simply use the following command:

deactivate

(For Linux and MaxOS, put ‘source ‘ prior to these commands)

conda_env_list.PNG

We can see in the screenshot above that multiple other environments exist, but the selected/activated environment is shown in parentheses. Note that you’re still navigating through the same directories as before, you’re just selecting and running a different version of Python and installed packages when you’re using this environment.

Building Your Python Environment

(Installing Packages)

Now onto the real meat and potatoes: installing the necessary packages. While you can use pip at this point, I’ve found Conda has run into fewer issues over the past year.  (Read into channel prioritization if you’re interested in where package files are being sourced from and how to change this.) As a quick back to basics, we’re going to install one of the desired packages, matplotlib, using Conda (or pip). Using these ensures that the proper versions of the packages for your environment (i.e. the Python version and operating system) are retrieved. At the same time, all dependent packages will also be installed (e.g. numpy). Use the following command when in the environment and confirm you want to install matplotlib:

conda install matplotlib

Note that you can specify a version much like how we specified the python version above for library compatibility issues:

conda install matplotlib=2.2.0

If you wish to remove matplotlib, use the following command:

conda remove matplotlib

If you wish to update a specific package, run:

conda update matplotlib

Or to update all packages:

conda update

Additionally, you can prevent specific packages from updating by creating a pinned file in the environment’s conda-meta directory. Be sure to do this prior to running the command to update all packages! 

After installing all of the packages that were required at the start of this tutorial, let’s look into which packages are actually installed in this environment:

conda list

conda_list.PNG

By only installing the required packages, Conda was kind and installed all of the dependencies at the same time. Now you have a Python environment that you’ve created from scratch and developed into a hopefully productive part of your workflow.

Utilizing Your Python Environment

The simplest way to utilize your newly created python environment is simply run python directly in the Command Prompt above. You can run any script when this environment is activated (shown in the parentheses on the left of the command line) to utilize this setup!

If you want to use this environment in your IDE of choice, you can simply point the interpreter to this new environment. In PyCharm, you can easily create a new Conda Environment when creating a new project, or you can point the interpreter to a previously created environment (instructions here).

Additional Resources

For a good ground-up and more in depth tutorial with visualizations on how Conda works (including directory structure, channel prioritization) that has been a major source of inspiration and knowledge for me, please check out this blog post by Gergely Szerovay.

If you’re looking for a great (and nearly exhaustive) source of Python Packages (both current and previous versions), check out Gohlke’s webpage. To install these packages, download the associated file for your system (32/64 bit and then your operating system) then use pip to install the file (in Command Prompt, navigate to the folder the .whl file is located in, then type ‘pip install ,file_name>’). I’ve found that installing packages this way sometimes allows me to step around errors I’ve encountered while using

You can also create environments for R. Check it out here.

If you understand most of the materials above, you can now claim to be environmentally conscious!

Launching Jupyter Notebook Using an Icon/Shortcut in the Current Working Directory Folder

Launching Jupyter Notebook Using an Icon/Shortcut in the Current Working Directory Folder

A petty annoyance I’ve encountered when wanting to open Jupyter Notebook (overview) is that I couldn’t find a way to instantly open it in my current Windows Explorer window. While one trick you can use to open the Command Prompt in this folder is by typing ‘cmd’ in the navigation bar above (shown below) and pressing Enter/Return, I wanted to create a shortcut or icon I could double-click in any given folder and have it open Jupyter Notebook in that same working directory. cmd.png

This method allows you to drag-and-drop the icon you create into any folder and have it launch Jupyter Notebook from the new folder. It works for Windows 7, 8, and 10. Please feel free to let me know if you encounter any errors!

A great application for this shortcut may be to include this shortcut in GitHub folders where you wish to direct someone to launch Jupyter Notebook with minimal confusion. Just direct them to double-click on the icon and away they go!

Creating Your Own Jupyter Notebook Shortcut

new_shortcut.png

To begin, we must have already installed Jupyter Notebook or Jupyter Lab. Next, navigate to the folder we want to create your shortcut. Right-click, select ‘New’, then create a shortcut. 

shortcut_34.PNG

In the Create Shortcut Windows prompt, type the location of the item you want the Shortcut Icon to direct to. In this case, we are wanting direct this shortcut to the Command Prompt and have it run the command to open Jupyter Notebook. Copy/paste or type the following into the prompt:

cmd /k “jupyter notebook”

Note that cmd will change to the location of the Command Prompt executable file (e.g. C:\Windows\System32\cmd.exe), and ‘/k’ keeps the Command Prompt window open to ensure Jypyter Notebook does not crash. You can edit the command in the quotation marks to any command you would want, but in this case ‘jupyter notebook’ launches an instance of Jupyter Notebook.

You can then save this shortcut with whatever name you wish!

At this point, double-clicking the shortcut will open Jupyter Notebook in a static default directory (e.g. ‘C:\Windows\system32’). To fix this, we need to ensure that this shortcut instead directs to the current working directory (the location of the shortcut).

Picture1321.png

Next, we need to edit the location where the Command Prompt will run in. Right-click on your newly-created icon and select ‘Properties’ at the bottom of the menu to open the window shown on the left. One thing to note is that the ‘Target’ input is where we initially put in our ‘location’ prompt from above.

At this point, change the ‘Start in:’ input (e.g. ‘C:\Windows\system32’) to the following:

%cd%

By changing this input, instead of starting the Command Prompt in a static default directory, it instead starts the command prompt  in the current working directory for the shortcut.

At this point, you’re finished! You can drag and drop this icon to any new folder and have Jupyter Notebook start in that new folder.

If you wish to download a copy of the shortcut from Dropbox. Note that for security reasons, most browsers, hosting services, and email services will rename the file from ‘jupyter_notebook_shortcut.lnk’ to ‘jupyter_notebook_shortcut.downloads’.

Many thanks to users on superuser for helping develop this solution!

Please let me know if you have any questions, comments, or additional suggestions on applications for this shortcut!

 

Jupyter Notebook: A “Hello World” Overview

Jupyter Notebook: A “Hello World” Overview

Jupyter Notebook: Overview

When first learning Python, I was introduced to Jupyter Notebook as an extremely effective IDE for group-learning situations. I’ve since used this browser-based interactive shell for homework assignments, data exploration and visualization, and data processing. The functionality of Jupyter Notebook extends well past simple development and showcasing of code as it can be used with almost any Python library (except for animated figures right before a deadline). Jupyter Notebook is my go-to tool when I am writing code on the go.

As a Jupyter Notebook martyr, I must point out that Jupyter Notebooks can be used for almost anything imaginable. It is great for code-oriented presentations that allow for running live code, timing of lines of code and other magic functions, or even just sifting through data for processing and visualization. Furthermore, if documented properly, Jupyter Notebook can be used as an easy guide for stepping people through lessons. For example, check out the structure of this standalone tutorial for NumPy—download and open it in Jupyter Notebook for the full experience. In a classroom setting, Jupyter Notebook can utilize nbgrader to create quizzes and assignments that can be automatically graded. Alas, I am still trying to figure out how to make it iron my shirt.

1_XMB2FXE3sN4FTwaO8dMMeA

A Sample Jupyter Notebook Presentation (credit: Matthew Speck)

One feature of Jupyter Notebook is that it can be used for a web application on a server-client structure to allow for users to interact remotely via ssh or http. In an example is shown here, you can run Julia on this website even if it is not installed locally. Furthermore, you can use the Jupyter Notebook Viewer to share notebooks online.  However, I have not yet delved into these areas as of yet.

For folks familiar with Python libraries through the years, Jupyter Notebook evolved from IPython and has overtaken its niche. Notably, it can be used for over 40 languages—the original intent was to create an interface for Julia, Python and R, hence Ju-Pyt-R— including Python, R, C++, and more. However, I have only used it for Python and each notebook kernel will run in a single native language (although untested workaround exist).

Installing and Opening the Jupyter Notebook Dashboard

While Jupyter Notebook comes standard with Anaconda, you can easily install it via pip or by checking out this link.

As for opening and running Jupyter Notebook, navigate to the directory (in this case, I created a directory in my username folder titled ‘Example’) you want to work out of in your terminal (e.g. Command Prompt in Windows, Terminal in MacOS) and run the command ‘jupyter notebook’.

command_prompt_start

Opening the Command Prompt

Once run, the following lines appear in your terminal but are relatively unimportant. The most important part is being patient and waiting for it to open in your default web browser—all mainstream web browsers are supported, but I personally use Chrome.

 

 

If at any time you want to exit Jupyter Notebook, press Ctrl + C twice in your terminal to immediately shut down all running kernels (Windows and MacOS). Note that more than one instance of Jupyter Notebook can be running by utilizing multiple terminals.

Creating a Notebook

Once Jupyter Notebook opens in your browser, you will encounter the dashboard. All files and subdirectories will be visible on this page and can generally be opened or examined.

jupyter_start

Initial Notebook Dashboard Without Any Files

If you want to create a shiny new Notebook to work in, click on ‘New’ and select a new Notebook in the language of your choice (shown below). In this case, only Python 3 has been installed and is the only option available. Find other language kernels here.

jupyter_open

Opening a New Notebook

Basic Operations in Jupyter Notebook

Once opened, you will find an untitled workbook without a title or text. To edit the title, simply left-click on ‘Untitled’ and enter your name of choice.

jupyter_new

Blank Jupyter Notebook

To write code, it is the same as writing a regular Python script in any given text editor. You can divide your code into separate sections that are run independently instead of running the entire script again. However, when importing libraries and later using them, you must run the corresponding lines to import them prior to using the aforementioned libraries.

To run code, simply press Shift + Enter while the carat—the blinking text cursor—is in the cell.

jupyter_hello_world

Jupyter Notebook with Basic Operations

After running any code through a notebook, the file is automatically backed up in a hidden folder in your working directory. Note that you cannot directly open the notebook (IPYNB File) by double-clicking on the file. Rather, you must reopen Jupyter Notebook and access it through the dashboard.

jupyter_folder_windows

Directory where Sample Jupyter Notebook Has Been Running

As shown below, you can easily generate and graph data in line. This is very useful when wanting to visualize data in addition to modifying a graphic (e.g. changing labels or colors). These graphics are not rendered at the same DPI as a saved image or GUI window by default but can be changed by modifying matplotlib’s rcParams.

jupyter_graphic

Example Histogram in Jupyter Notebook

Conclusion

At this point, there are plenty of directions you can proceed. I would highly suggest exploring some of the widgets available which include interesting interactive visualizations. I plan to explore further applications in future posts, so please feel free to give me a yell if you have any ideas.

Web Scraping: Constructing URLs, Downloading and Unpacking Zipped Files in Python and R

Introduction

Being able to automate data retrieval helps alleviate encountering irritating, repetitive tasks. Often times, these data files can be separated by a range of times, site locations, or even type of measurement, causing them to be cumbersome to download manually. This blog post outlines how to download multiple zipped csv files from a webpage using both R and Python.

We will specifically explore downloading historical hourly locational marginal pricing (LMP) data files from PJM, a regional transmission organization coordinating a wholesale electricity market across 13 Midwestern, Southeastern, and Northeastern states. The files in question are located here.

First, Using Python

A distinct advantage of using Python over R is being able to write fewer lines for the same results. However, the differences in the time required to execute the code are generally minimal between the two languages.

Constructing URLs

The URL for the most recent file from the webpage for August 2017 has the following format: “http://www.pjm.com/pub/account/lmpmonthly/201708-da.csv
Notably, the URL is straightforward in its structure and allows for a fairly simple approach in attempting to create this URL from scratch.

To begin, we can deconstruct the URL into three aspects: the base URL, the date, and the file extension.
The base URL is fairly straightforward because it is constant: “http://www.pjm.com/pub/account/lmpmonthly/
As is the file extension: “-da.csv”

However, the largest issue presented is how to approach recreating the date: “201708”
It is in the form “yyyymm”, requiring a reconstruction of a 4-digit year (luckily these records don’t go back to the first millennium) and a 2-digit month.

In Python, this is excessively easy to reconstruct using the .format() string modifying method. All that is required is playing around with the format-specification mini-language within the squiggly brackets. Note that the squiggly brackets are inserted within the string with the following form: {index:modifyer}

In this example, we follow the format above and deconstruct the URL then reassemble it within one line in Python. Assigning this string to a variable allows for easy iterations through the code by simply changing the inputs. So let’s create a quick function named url_creator that allows for the input of a date in the format (yyyy, mm) and returns the constructed URL as a string:

def url_creator(date):
    """
    Input is list [yyyy, mm]
    """
    yyyy, mm = date

    return "http://www.pjm.com/pub/account/lmpmonthly/{0:}{1:02d}-da.zip".format(yyyy, mm)

To quickly generate a list of the dates we’re interested in, creating a list of with entries in the format (yyyy, mm) for each of the month’s we’re interested in is easy enough using a couple of imbedded for loops. If you need alternative dates, you can easily alter this loop around or manually create a list of lists to accommodate your needs:

#for loop method
dates = []
for i in range(17):
    for j in range(12):
        dates.append([2001 + i, j + 1])

#alternative manual method
dates = [[2011, 1], [2011, 2], [2011, 3]]

Retrieving and Unpacking Zipped Files

Now that we can create the URL for this specific file type by calling the url_creator function and have all of the dates we may be interested in, we need to be able to access, download, unzip/extract, and subsequently save these files. Utilizing the urllib library, we can request and download the zipped file. Note that urllib.request.urlretrieve only retrieves the files and does not attempt to read it. While we could simply save the zipped file at this step, it is preferred to extract it to prevent headaches down the line.

Utilizing the urllib library, we can extract the downloaded files to the specified folder. Notably, I use the operating system interface method getcwd while extracting the file to save the resulting csv file into the directory where this script is running. Following this, the extracted file is closed.

import zipfile, urllib, os
from urllib.request import Request,urlopen, urlretrieve

for date in dates:
    baseurl = url_creator(date)

    local_filename, headers = urllib.request.urlretrieve(url = baseurl)
    zip_ref = zipfile.ZipFile(file = local_filename, mode = 'r')
    zip_ref.extractall(path = os.getcwd())     #os.getcwd() directs to current working directory
    zip_ref.close()

At this point, the extracted csv files will be located in the directory where this script was running.

Alternatively, Using R

We will be using an csv files to specify date ranges instead of for loops as shown in the Python example above. The advantage of using the csv files rather than indexing i=2001:2010 is that if you have a list of non-consecutive months or years, it is sometimes easier to just make a list of the years and cycle through all of the elements of the list rather than trying to figure out how to index the years directly. However, in this example, it would be just as easy to index through the years and month rather than the csv file.

This first block of code sets the working directory and reads in two csv files from the directory. The first csv file contains a list of the years 2001-2010. The second CSV file lists out the months January-December as numbers.

#Set working directory
#specifically specified for the directory where we want to "dump" the data
setwd("E:/Data/WebScraping/PJM/Markets_and_operations/Energy/Real_Time_Monthly/R")

#Create csv files with a list of the relevant years and months
years=read.csv("Years.csv")
months=read.csv("Months.csv")

The breakdown of the URL was previously shown. To create this in R, we will generate a for-loop that will change this link to cycle through all the years and months listed in the csv files. The index i and j iterate through the elements in the year and month csv file respectively. (The if statement will be discussed at the end of this section). We then use the “downloadurl” and “paste” functions to turn the entire link into a string. The parts of the link surrounded in quotes are unchanging and denoted in green. The “as.character” function is used to cycle in the elements from the csv files and turn them into characters. Finally, the sep=’’ is used to denote that there should be no space separating the characters. The next two lines download the file into R as a temporary file.

#i indexes through the years and j indexes through the months
for (i in (1:16)){
  for( j in (1:12)){
      if (j>9){

    #dowload the url and save it as a temporary file

    downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),as.character(months[j,1]),"-da.zip",sep="")
    temp="tempfile"
    download.file(downloadurl,temp)

Next, we unzip the temporary file and finds the csv (which is in the form: 200101-da.csv), reading in the csv file into the R environment as “x”. We assign the name “200101-da” to “x”. Finally, it is written as a csv file and stored in the working directory, the temporary file is deleted, and the for-loop starts over again.

    #read in the csv into the global environment
    x=read.csv((unz(temp,paste(as.character(years[i,1]),as.character(months[j,1]),"-da.csv",sep=""))))
    #Assign a name to the csv "x"
    newname=paste(as.character(years[i,1]),as.character(months[j,1]),sep="")
    assign(newname,x)
    #create a csv that is stored in your working directory
    write.csv(x,file=paste(newname,".csv",sep=""))

    unlink(temp) #deletetempfile
      }

    #If j is between 1 and 10, the only difference is that a "0" has to be added
    #in front of the month number

Overcoming Formatting

The reason for the “if” statement is that it is not a trivial process to properly format the dates in Excel when writing to a csv to create the list of dates. When inputing “01”, the entry is simply changed to “1” when the file is saved as a csv. However, note that in the name of the files, the first nine months have a 0 preceding their number. Therefore, the URL construct must be modified for these months. The downloadurl has been changed so that a 0 is added before the month number if j < 10. Aside from the hard-coded zeroes, the block of code is the same as the one above.

else { 

        downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),"0",as.character(months[j,1]),"-da.zip",sep="")

        temp="tempfile"
        download.file(downloadurl,temp)

        #x=read.csv(url(paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1],as.character(months[j,1],sep=""))))
        x=read.csv((unz(temp,paste(as.character(years[i,1]),"0",as.character(months[j,1]),"-da.csv",sep=""))))
        newname=paste(as.character(years[i,1]),"0",as.character(months[j,1]),sep="")
        assign(newname,x)
        write.csv(x,file=paste(newname,".csv",sep=""))

        unlink(temp) #deletetempfile

}

Acknowledgements

Many thanks to Rohini Gupta for contributing to this post by defining the problem and explaining her approach in R.

Versions: R = 3.4.1 Python = 3.6.2
To see the code in its entirety, please visit the linked GitHub Repository.