Web Scraping: Constructing URLs, Downloading and Unpacking Zipped Files in Python and R

Introduction

Being able to automate data retrieval helps alleviate encountering irritating, repetitive tasks. Often times, these data files can be separated by a range of times, site locations, or even type of measurement, causing them to be cumbersome to download manually. This blog post outlines how to download multiple zipped csv files from a webpage using both R and Python.

We will specifically explore downloading historical hourly locational marginal pricing (LMP) data files from PJM, a regional transmission organization coordinating a wholesale electricity market across 13 Midwestern, Southeastern, and Northeastern states. The files in question are located here.

First, Using Python

A distinct advantage of using Python over R is being able to write fewer lines for the same results. However, the differences in the time required to execute the code are generally minimal between the two languages.

Constructing URLs

The URL for the most recent file from the webpage for August 2017 has the following format: “http://www.pjm.com/pub/account/lmpmonthly/201708-da.csv
Notably, the URL is straightforward in its structure and allows for a fairly simple approach in attempting to create this URL from scratch.

To begin, we can deconstruct the URL into three aspects: the base URL, the date, and the file extension.
The base URL is fairly straightforward because it is constant: “http://www.pjm.com/pub/account/lmpmonthly/
As is the file extension: “-da.csv”

However, the largest issue presented is how to approach recreating the date: “201708”
It is in the form “yyyymm”, requiring a reconstruction of a 4-digit year (luckily these records don’t go back to the first millennium) and a 2-digit month.

In Python, this is excessively easy to reconstruct using the .format() string modifying method. All that is required is playing around with the format-specification mini-language within the squiggly brackets. Note that the squiggly brackets are inserted within the string with the following form: {index:modifyer}

In this example, we follow the format above and deconstruct the URL then reassemble it within one line in Python. Assigning this string to a variable allows for easy iterations through the code by simply changing the inputs. So let’s create a quick function named url_creator that allows for the input of a date in the format (yyyy, mm) and returns the constructed URL as a string:

def url_creator(date):
    """
    Input is list [yyyy, mm]
    """
    yyyy, mm = date

    return "http://www.pjm.com/pub/account/lmpmonthly/{0:}{1:02d}-da.zip".format(yyyy, mm)

To quickly generate a list of the dates we’re interested in, creating a list of with entries in the format (yyyy, mm) for each of the month’s we’re interested in is easy enough using a couple of imbedded for loops. If you need alternative dates, you can easily alter this loop around or manually create a list of lists to accommodate your needs:

#for loop method
dates = []
for i in range(17):
    for j in range(12):
        dates.append([2001 + i, j + 1])

#alternative manual method
dates = [[2011, 1], [2011, 2], [2011, 3]]

Retrieving and Unpacking Zipped Files

Now that we can create the URL for this specific file type by calling the url_creator function and have all of the dates we may be interested in, we need to be able to access, download, unzip/extract, and subsequently save these files. Utilizing the urllib library, we can request and download the zipped file. Note that urllib.request.urlretrieve only retrieves the files and does not attempt to read it. While we could simply save the zipped file at this step, it is preferred to extract it to prevent headaches down the line.

Utilizing the urllib library, we can extract the downloaded files to the specified folder. Notably, I use the operating system interface method getcwd while extracting the file to save the resulting csv file into the directory where this script is running. Following this, the extracted file is closed.

import zipfile, urllib, os
from urllib.request import Request,urlopen, urlretrieve

for date in dates:
    baseurl = url_creator(date)

    local_filename, headers = urllib.request.urlretrieve(url = baseurl)
    zip_ref = zipfile.ZipFile(file = local_filename, mode = 'r')
    zip_ref.extractall(path = os.getcwd())     #os.getcwd() directs to current working directory
    zip_ref.close()

At this point, the extracted csv files will be located in the directory where this script was running.

Alternatively, Using R

We will be using an csv files to specify date ranges instead of for loops as shown in the Python example above. The advantage of using the csv files rather than indexing i=2001:2010 is that if you have a list of non-consecutive months or years, it is sometimes easier to just make a list of the years and cycle through all of the elements of the list rather than trying to figure out how to index the years directly. However, in this example, it would be just as easy to index through the years and month rather than the csv file.

This first block of code sets the working directory and reads in two csv files from the directory. The first csv file contains a list of the years 2001-2010. The second CSV file lists out the months January-December as numbers.

#Set working directory
#specifically specified for the directory where we want to "dump" the data
setwd("E:/Data/WebScraping/PJM/Markets_and_operations/Energy/Real_Time_Monthly/R")

#Create csv files with a list of the relevant years and months
years=read.csv("Years.csv")
months=read.csv("Months.csv")

The breakdown of the URL was previously shown. To create this in R, we will generate a for-loop that will change this link to cycle through all the years and months listed in the csv files. The index i and j iterate through the elements in the year and month csv file respectively. (The if statement will be discussed at the end of this section). We then use the “downloadurl” and “paste” functions to turn the entire link into a string. The parts of the link surrounded in quotes are unchanging and denoted in green. The “as.character” function is used to cycle in the elements from the csv files and turn them into characters. Finally, the sep=’’ is used to denote that there should be no space separating the characters. The next two lines download the file into R as a temporary file.

#i indexes through the years and j indexes through the months
for (i in (1:16)){
  for( j in (1:12)){
      if (j>9){

    #dowload the url and save it as a temporary file

    downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),as.character(months[j,1]),"-da.zip",sep="")
    temp="tempfile"
    download.file(downloadurl,temp)

Next, we unzip the temporary file and finds the csv (which is in the form: 200101-da.csv), reading in the csv file into the R environment as “x”. We assign the name “200101-da” to “x”. Finally, it is written as a csv file and stored in the working directory, the temporary file is deleted, and the for-loop starts over again.

    #read in the csv into the global environment
    x=read.csv((unz(temp,paste(as.character(years[i,1]),as.character(months[j,1]),"-da.csv",sep=""))))
    #Assign a name to the csv "x"
    newname=paste(as.character(years[i,1]),as.character(months[j,1]),sep="")
    assign(newname,x)
    #create a csv that is stored in your working directory
    write.csv(x,file=paste(newname,".csv",sep=""))

    unlink(temp) #deletetempfile
      }

    #If j is between 1 and 10, the only difference is that a "0" has to be added
    #in front of the month number

Overcoming Formatting

The reason for the “if” statement is that it is not a trivial process to properly format the dates in Excel when writing to a csv to create the list of dates. When inputing “01”, the entry is simply changed to “1” when the file is saved as a csv. However, note that in the name of the files, the first nine months have a 0 preceding their number. Therefore, the URL construct must be modified for these months. The downloadurl has been changed so that a 0 is added before the month number if j < 10. Aside from the hard-coded zeroes, the block of code is the same as the one above.

else { 

        downloadurl=paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1]),"0",as.character(months[j,1]),"-da.zip",sep="")

        temp="tempfile"
        download.file(downloadurl,temp)

        #x=read.csv(url(paste("http://www.pjm.com/pub/account/lmpmonthly/",as.character(years[i,1],as.character(months[j,1],sep=""))))
        x=read.csv((unz(temp,paste(as.character(years[i,1]),"0",as.character(months[j,1]),"-da.csv",sep=""))))
        newname=paste(as.character(years[i,1]),"0",as.character(months[j,1]),sep="")
        assign(newname,x)
        write.csv(x,file=paste(newname,".csv",sep=""))

        unlink(temp) #deletetempfile

}

Acknowledgements

Many thanks to Rohini Gupta for contributing to this post by defining the problem and explaining her approach in R.

Versions: R = 3.4.1 Python = 3.6.2
To see the code in its entirety, please visit the linked GitHub Repository.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s