Use python `cf ` and `esgf-python-client` package to interact with ESGF data portal

Working with climate change impact assessment may often require us to download a number of outputs from Global Circulation Models (GCMs) to depict the plausible changes of projected future conditions, as well as to use them for conventional top-down scenario analysis. Usually those kinds of scenario data are public available for download and are stored as NetCDF format on the data portal using Earth System Grid Federation (ESGF) architecture. Yet the procedure of downloading those data, which typically span over a lengthy time horizon, may be very cumbersome and requires following steps:

  • create the username and credential on ESGF data portal;
    • define the search entries to narrow down the results;
    • generate the *.wget file for batch downloading using secured OpenDAP protocal ;
    • run the *.wget file to download all the raw data onto the local disk;

For the climatic projection data, usually the downloaded data are stored in NetCDF format with irregular coordinate systems (e.g., rotated pole), segmented in pieces of small time periods, and available only for large continental scale, which is, however, unnecessary for regional case study.

In this post, I would like to share some of my experience and tools to tackle with the ESGF data portal, and facilitate the process. The context will encompass the concepts of NetCDF and Python language, both of which were already well introduced in our Water Programming blog. Therefore, I strongly suggest to revisit those posts in case of any doubt, and also because I am not an expert on neither NetCDF nor Python.

The first tool is called cf-python package, which implements comprehensive tools for reading, writing and processing NetCDf data. Personally I found it extremely useful for reading/writing NetCDF data, regridding and extracting the sub-domain from original dataset. More importantly, given the OpenDAP url all those aforementioned operations can be done on the server side without explicitly downloading the orignal data onto local disk! So instead of downloading a 10Gb projection data for entire Europe that takes 1 hour to finish, you can just download the part specific to your study area within a couple of seconds at a cost of a few Mb.

The second tool is named esgf-python-client. This package contains the API code for calling the ESGF Search API, whose documentation by the way has not been updated for quite a while. With the combined use of cf-python and this client, you are saved from user/password login and dealing with tedious web interface of ESGF search page (try here) and downloading.

Below I explain more in details by using an example of downloading European regional downscaling data for RCP4.5 scenario from one of the ESGF data portal. The example is essentially a few lines of codes with inline comments (notice that some indents of lines may not show correctly in the post, so just copy paste and reformat them in your editor).

from pyesgf.logon import LogonManager 
from pyesgf.search import SearchConnection
import cf
import numpy as np

def main():
 ## ================== To be modified by user ================================================
 # Define the search key words for facet fields in ESGF node. 
 # Note: the supported facet fields can be identified using: 
 # " http://<base_search_URL>/search?facets=*&distrib=false&limit=0 ", where
 # <base_search_URL> is the the domain of esgf-node, e.g., https://pcmdi.llnl.gov/esg-search

esgf_node='https://esgf-node.ipsl.upmc.fr/esg-search' # you may try different data portal, as many of them are frequently under maintenance longer than their actual service time

Project='CORDEX' 
 Time_frequency='day' # 'month', '6h', '3h' 
 Experiment='rcp45' # 'rcp25', 'rcp85', 'history'
 Variable='tas' # 'pr' for precipitation 
 Ensemble='r1i1p1' # not change 
 Domain='EUR-44' # EUR-44i, EUR-22, EUR-22i, EUR-11i

# Define the index of domain for study region. Say original domain is 100 (lon) by 60 (lat) grids, and your study area falls between 50 - 54 in longitude and 41 to 45 in latitude. 
 idx_lon = np.array( range(49,54) ) #49 instead of 50 as numpy array is 0-based indexing
 idx_lat = np.array( range(40,45) )

## =========================================================================================

 # Connect to ESGF data portal using your openid and password. May need a single run of *.wget file first to
 # retrieve personal credientials
 lm = LogonManager()
 lm.logon_with_openid(your_openID, your_password)

conn = SearchConnection(esgf_node, distrib=True)

 # Make query based on the facet fields. 
 ctx = conn.new_context( project=Project, 
 time_frequency=Time_frequency,
 experiment=Experiment, 
 variable=Variable,
 ensemble=Ensemble, 
 domain=Domain )

# ctx.hit_count # show total number of results found

 # loop over the hit_count results
 for each_hit in range(0, ctx.hit_count):

ds = ctx.search()[each_hit]
 files = ds.file_context().search() # use "len(files)" to see total files found

url_list_sorted = sorted( [i.opendap_url for i in files] ) # url list sorted to get assending timestamp, e.g., from 1/1/2005 to 12/31/2099
 filename_out = i.filename[0:-21] # get file name, the [0:21] means a subset of filename string

 counter = 0

data_list = []
 for each_url in url_list_sorted:
 counter += 1 
 meta_data = cf.read(each_url) # read the meta-data information from OpenDAP NetCDF url
 if len(meta_data) > 1: # in some cases, the returned meta_data contains two fields, i.e., len(meta_data)>1, and only the second field contains the variable of interest
 data_list.append(meta_data[1])
 else:
 data_list.append(meta_data[0])

# Aggregating the time
 dataset_agg = cf.aggregate(data_list) # this will concatenate all files with small time period into a single file with complete time series

# Slicing dataset
 if len(dataset_agg.shape) == 4:
 dataset_sliced = dataset_agg.subspace[0,:,idx_lat,idx_lon] # for 4-D , e.g., [height, time, lat, lon], .subspace() function is used to extract the subset of the data
 elif len(dataset_agg.shape) == 3:
 dataset_sliced = dataset_agg.subspace[:,idx_lat,idx_lon] # for 3-D , e.g., [time, lat, lon], .subspace() function is used to extract the subset of the data

# Save data into desired format (e.g., using savemat from 'scipy' package)
 cf.write(dataset_sliced, filename_out+'.nc', mode='w')

 lm.logoff() # logoff

if __name__ == "__main__":
 main()

Sorry for the lengthy and poor organized post, but I hope you will find it useful. For any doubts or suggestions, please feel free to contact me (likymice@gmail.com).

Advertisements

2 thoughts on “Use python `cf ` and `esgf-python-client` package to interact with ESGF data portal

  1. Pingback: Water Programming Blog Guide (3) – Water Programming: A Collaborative Research Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s