Efficient hydroclimatic data accessing with HyRiver for Python

This tutorial highlights the HyRiver software stack for Python, which is a very powerful tool for acquiring large sets of data from various web services.

I have uploaded a Jupyter-Notebook version of this post here if you would like to execute it yourself.

HyRiver Introduction

The HyRiver software suite was developed by Taher Chegini who, in their own words, describes HyRiver as:

“… a software stack consisting of seven Python libraries that are designed to aid in hydroclimate analysis through web services.”

This description does not do justice to the capability of this software. Through my research I have spent significant amounts of time wrangling various datasets – making sure that dates align, or accounting for spatial misalignment of available data. The HyRiver suite streamlines this process, and makes acquisition of different data from various sources much more efficient.

Here, I am going walk through a demonstration of how to easily access large amounts of data (streamflow, geophysical, and meteorological) for a basin of interest.

Before going through the code, I will highlight the three libraries from the HyRiver stack which I have found most useful: PyGeoHydro, PyNHD, and PyDaymet.

PyGeohydro

PyGeoHydro allows for interaction with eight different online datasets, including:

In this tutorial, I will only be interacting with the USGS NWIS, which provides daily streamflow data.

PyNHD

The PyNHD library is designed to interact with the National Hydrography Dataset (NHD)and the Hydro Network-Linked Data Index (NLDI).

NHDPlus (National Hydrography Dataset)

The NHD defines a high-resolutioon network of stream linkages, each with a unique idenfier (ComID).

NLDI (Network-Linked Data Index)

The NLDI aids in the discovery of indexed information along some NHD-specified geometry (ComIDs). The NLDI essentially tranverses the linkages specified by the NHD geometry and generates data either local or basin-aggregated data relative to a specific linkage (ComID).

As will be seen later in the tutorial, the NLDI is able to retrieve at least 126 different types of data for a given basin…

PyDaymet

The PyDaymet GirHub repository summarizes the package as:

“[providing] access to climate data from Daymet V4 database using NetCDF Subset Service (NCSS). Both single pixel (using get_bycoords function) and gridded data (using get_bygeom) are supported which are returned as pandas.DataFrame and xarray.Dataset, respectively.”

Tutorial outline:

  1. Installation
  2. Retrieving USGS Water Data
  3. Retrieving Geophysical (NLDI) Data
  4. Retrieving Daymet Data

The HyRiver repository contains various examples demonstrating the use of the various libraries. I would definitely recommend digging in deeper to these, and other HyRiver documentation if this post piques your interest.

Step 0: Installation

In this tutorial, I only only interact with the PyNHD, PyGeoHydro, and PyDaymet libraries, so I do not need to install all of the HyRiver suite.

If you operate through pip, you can install these libraries using:

pip install pynhd pygeohydro pydaymet

If you use Anaconda package manager, you can install these packages using:

conda install -c conda-forge pynhd pygeohydro pydaymet

For more information on installation, refer to the HyRiver GitHub repository and related documentation.

Now, onto the fun part!

Step 1: Retreiving USGS Water Data

I am beginning here because streamflow data is typically the first point of interest for most hydrologic engineers or modelers.

Personally, I have gone through the process of trying to download data manually from the USGS NWIS website… My appreciation for the USGS prevents me from saying anything too negative, but let’s just say it was not a pleasant experience.

Pygeohydro allows for direct requests from the USGS National Water Information System (NWIS), which provides daily streamflow data from all USGS gages. The data is conveniently output as a Pandas DataFrame.

With this functionality alone, the PyGeoHydro library is worth learning.

1.1 Initialize PyGeoHydro NWIS tool

# Import common libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import the PyGeohydro libaray tools
import pygeohydro as gh
from pygeohydro import NWIS, plot

# Use the national water info system (NWIS)
nwis = NWIS()

1.2 Requesting USGS Streamflow Data

The get_streamflow() function does exactly as the name entails and will retrieve daily streamflow timeseries, however USGS gage station IDs must be provided. If you are only interested in a single location, then you can enter 8-digit gage ID number along with a specified date range to generate the data:

get_streamflow('########', dates = ('Y-M-D', 'Y-M-D'))

However, I am want to explore larger sets of data over an entire region. Thus, I am going to use PyGeoHydro's get_info() function to identify all gages within some region of interest.

First, I specify a region via (latitude, longitude) bounds, then I send a query which retrieves meta-data information on all the gages in the specified region. In this case, I am exploring the data available near Ithaca, NY.

# Query specifications
region = (-76.7, 42.3, -76, 42.6) # Ithaca, NY

# Send a query for all gage info in the region
query = {"bBox": ",".join(f"{b:.06f}" for b in region),
         "hasDataTypeCd": "dv",
         "outputDataTypeCd": "dv"}

info_box = nwis.get_info(query)

print(f'PyGeoHydro found {len(set(info_box.site_no))} unique gages in this region.')

# [Out]: PyGeoHydro found #N unique gages in this region.

Although, this info_box identify many gages in the region which have very old or very brief data records. Knowing this, I want to filter out data which does not have a suitable record length.

For the sake of this tutorial, I am considering data between January 1st, 2020 and December 31st, 2020.

# Specify date range of interest
dates = ("2020-01-01", "2020-12-31") 

# Filter stations to have only those with proper dates
stations = info_box[(info_box.begin_date <= dates[0]) & (info_box.end_date >= dates[1])].site_no.tolist()

# Remove duplicates by converting to a set
stations = set(stations)

Now, I am ready to use the gage IDs contained in stations to request the streamflow data!

# Retrieve the flow data
flow_data = nwis.get_streamflow(stations, dates, mmd=False)

# Remove gages with nans
flow_data = flow_data.dropna(axis = 1, how = 'any')

After removing duplicates and gages with nans, I have data from five unique gages in this region.

Additionally, PyGeoHydro has a convenient plotting feature to help quickly visualize the streamflow data.

from pygeohydro import plot

# Plot flow data summary
plot.signatures(flow_data)
Summary of flow data for the 5 gages found with PyGeoHydro.

There is a lot more to be explored in the PyGeoHydro library, but I will leave that up to the curious reader.

Step 2: Retrieving Geophysical (NLDI) Data

So, you’ve got some streamflow data but you don’t know anything about the physical watershed…

This is where the PyNHD library comes in. Using this library, I can identify entire upstream network from a gage, then extract the NLDI data associated with the watershed linkages.

# Import the PyNHD library
import pynhd as pynhd
from pynhd import NHD
from pynhd import NLDI, WaterData

First, we can take a look at all possible local basin characteristic data that are available:

# Get list of local data types (AKA characteristics, or attributes)
possible_attributes = NLDI().get_validchars("local").index.to_list()

There are 126 characteristics available from the NLDI! These characteristics range from elevation, to reservoir capacity, to bedrock depth. Many if these are not of immediate interest to me, so I will specify a subset of select_attributes to retrieve (basin area, max elevation, and stream slope).

I then loop through all of my USGS stations for which I have data in flow_data, identifying the upstream basin linkages using NLDI().navigate_byid(). Once the basin is identified, I extract the ComID numbers for each linkage and use that number to retrieve the NLDI data of interest. I then store the data in nldi_data. This process is done by the following:

# Specify characteristics of interest
select_attributes = ['CAT_BASIN_AREA', 'CAT_ELEV_MAX', 'CAT_STREAM_SLOPE']

# Initialize a storage matrix
nldi_data = np.zeros((len(flow_data.columns), len(select_attributes)))

# Loop through all gages, and request NLDI data near each gage
for i, st in enumerate(flow_data.columns):

    # Navigate up all flowlines from gage
    flowlines = NLDI().navigate_byid(fsource = 'nwissite',
                                    fid = f'{st}',
                                    navigation="upstreamTributaries",
                                    source = 'flowlines',
                                    distance = 10)

    # Get the nearest comid
    station_comid = flowlines.nhdplus_comid.to_list()[0]

    # Source NLDI local data
    nldi_data[i,:] = NLDI().getcharacteristic_byid(station_comid, "local", char_ids = select_attributes)

So far, I have timeseries streamflow data for five locations in the Ithaca, NY area, along with the basin area, max basin elevation, and stream slope for each stream. If I can access hydro-climate data, maybe I could begin studying the relationships between streamflow and physical basin features after some rain event.

Step 3: Meteorological data

The PyDaymet library allows for direct requests of meteorological data across an entire basin.

The available data includes:

  • Minimum and maximum temperature (tmin, tmax)
  • Precipitation (prcp)
  • Vapor pressure (vp)
  • Snow-Water Equivalent (swe)
  • Shortwave radiation (srad)

All data are reported daily at a 1km x 1km resolution. Additionally, the PyDaymet library has the ability to estimate potential evapotranspiration, using various approximation methods.

Here, I choose to only request precipitation (prcp) and max temperature (tmax).

NOTE:
So far, the Daymet data retrieval process has been the slowest aspect of my HyRiver workflow. Due to the high-resolution, and potential for large basins, this may be computationally over-intensive if you try to request data for many gages with long time ranges.

# Import the  PyDayment library
import pydaymet as daymet

## Specify which data to request
met_vars = ["prcp", "tmax"]
met_data_names = np.array(['mean_prcp','sd_prcp','mean_tmax','sd_tmax'])

## Initialize storage
daymet_data = np.zeros((len(flow_data.columns), len(met_data_names)))

Similar to the NLDI() process, I loop through each gage (flow_data.columns) and (1) identify the up-gage basin, (2) source the Daymet data within the basin, (3) aggregate and store the data in daymet_data.

## Loop through stations from above
for i, st in enumerate(flow_data.columns):

    # Get the up-station basin geometry
    geometry = NLDI().get_basins(st).geometry[0]

    # Source Daymet data within basin
    basin_met_data = daymet.get_bygeom(geometry, dates, variables= met_vars)

    ## Pull values, aggregate, and store
    # Mean and std dev precipitation
    daymet_data[i, 0] = np.nan_to_num(basin_met_data.prcp.values).mean()
    daymet_data[i, 1] = np.nan_to_num(basin_met_data.prcp.values).std()

    # Mean and std dev of max temperature
    daymet_data[i, 2] = np.nan_to_num(basin_met_data.tmax.values).mean()
    daymet_data[i, 3] = np.nan_to_num(basin_met_data.tmax.values).std()

daymet_data.shape

# [Out]: (5, 4)

Without having used a web-browsers, I have been able to get access to a set of physical basin characteristics, various climate data, and observed streamflow relevant to my region of interest!

Now this data can be exported to a CSV, and used on any other project.

Conclusion

I hope this introduction to HyRiver has encouraged you to go bigger with your hydroclimate data ambitions.

If you are curious to learn more, I’d recommend you see the HyRiver Examples which have various in-depth Jupyter Notebook tutorials.

Citations

Chegini, Taher, et al. “HyRiver: Hydroclimate Data Retriever.” Journal of Open Source Software, vol. 6, no. 66, 27 Oct. 2021, p. 3175, 10.21105/joss.03175. Accessed 15 June 2022.

Getting started with API requests in Python

This is an introductory blogpost on how to work with Application Programming Interfaces (APIs) in Python. Seeing as this is our blog’s first post on this topic I’ll spend some time explaining some basic information I’ve had to learn in the process of doing this and why it might be useful in your research. There are several blogs and websites with tutorials online and there’s no point repeating, but I’ll try to explain this for an audience like me, i.e., (a) has no formal training in computer science/web servers/HTTP but competent with scripting, and (b) interested in using this for research purposes.

What is an API?

Many sites (e.g., Facebook, Twitter, many many more) make their data available through what’s called Application Programming Interfaces or APIs. APIs basically allow software to interact, and send and receive data from servers so as to typically provide additional services to businesses, mobile apps and the like, or allow for additional analysis of the data for research or commercial purposes (e.g., collecting all trending Twitter hashtags per location to analyze how news and information propagates).

I am a civil/environmental engineer, what can APIs do for me?

APIs are particularly useful for data that changes often or that involves repeated computation (e.g., daily precipitation measurements used to forecast lake levels). There’s a lot of this kind of data relevant for water resources systems analysts, easily accessible through APIs:

I’m interested, how do I do it?

I’ll demonstrate some basic information scripts using Python and the Requests library in the section below. There are many different API requests one can perform, but the most common one is GET, which is a request to retrieve data (not to modify it in any way). To retrieve data from an API you need two things: a base URL and an endpoint. The base URL is basically the static address to the API you’re interested in and an endpoint is appended to the end of it as the server route used to collect a specific set of data from the API. Collecting different kinds of data from an API is basically a process of manipulating that endpoint to retrieve exactly what is needed for a specific computation. I’ll demo this using the USGS’s API, but be aware that it varies for each one, so you’d need to figure out how to construct your URLs for the specific API you’re trying to access. They most often come with a documentation page and example URL generators that help you figure out how to construct them.

The base URL for the USGS API is http://waterservices.usgs.gov/nwis/iv/

I am, for instance, interested in collecting streamflow data from a gage near my house for last May. In Python I would set up the specific URL for these data like so:

response = requests.get("http://waterservices.usgs.gov/nwis/iv/?format=json&indent=on&sites=04234000&startDT=2020-05-01&endDT=2020-05-31¶meterCd=00060&siteType=ST&siteStatus=all")

For some interpretation of how this was constructed, every parameter option is separated by &, and they can be read like so: data in json format; indented so I can read them more easily; site number is the USGS gage number; start and end dates; a USGS parameter code to indicate streamflow; site type ‘stream’; any status (active or inactive). This is obviously the format that works for this API, for a different API you’d have to figure out how to structure these arguments for the data you need. If you paste the URL in your browser you’ll see the same data I’m using Python to retrieve.

Object response now contains several attributes, the most useful of which are response.content and response.status_code. content contains the content of the URL, i.e. your data and other stuff, as a bytes object. status_code contains the status of your request, with regards to whether the server couldn’t find what you asked for or you weren’t authenticated to access that data, etc. You can find what the codes mean here.

To access the data contained in your request (most usually in json format) you can use the json function contained in the library to return a dictionary object:

data = response.json()

The Python json library is also very useful here to help you manipulate the data you retrieve.

How can I use this for multiple datasets with different arguments?

So obviously, the utility of APIs comes when one needs multiple datasets of something. Using a Python script we can iterate through the arguments needed and generate the respective endpoints to retrieve our data. An easier way of doing this without manipulating and appending strings is to set up a dictionary with the parameter values and pass that to the get function:

parameters = {"format": 'json', "indent": 'on',
              "sites": '04234000', "startDT": '2020-05-01',
              "endDT": '2020-05-31', "parameterCd":'00060',
              "siteType": 'ST', "siteStatus":'all'}
response = requests.get("http://waterservices.usgs.gov/nwis/iv/", 
                        params=parameters)

This produces the same data, but we now have an easier way to manipulate the arguments in the dictionary. Using simple loops and lists to iterate through one can retrieve and store multiple different datasets from this API.

Other things to be aware of

Multiple other people use these APIs and some of them carry datasets that are very large so querying them takes time. If the server needs to process something before producing it for you, that takes time too. Some APIs limit the number of requests you can make as a result and it is general good practice to try to be as efficient as possible with your requests. This means not requesting large datasets you only need parts of, but being specific to what you need. Depending on the database, this might mean adding several filters to your request URL to be as specific as possible to only the dates and attributes needed, or combining several attributes (e.g., multiple gages in the above example) in a single request.