Working with climate change impact assessment may often require us to download a number of outputs from Global Circulation Models (GCMs) to depict the plausible changes of projected future conditions, as well as to use them for conventional top-down scenario analysis. Usually those kinds of scenario data are public available for download and are stored as NetCDF format on the data portal using Earth System Grid Federation (ESGF) architecture. Yet the procedure of downloading those data, which typically span over a lengthy time horizon, may be very cumbersome and requires following steps:
- create the username and credential on ESGF data portal;
- define the search entries to narrow down the results;
- generate the
*.wgetfile for batch downloading using secured OpenDAP protocal ;
- run the
*.wgetfile to download all the raw data onto the local disk;
For the climatic projection data, usually the downloaded data are stored in NetCDF format with irregular coordinate systems (e.g., rotated pole), segmented in pieces of small time periods, and available only for large continental scale, which is, however, unnecessary for regional case study.
In this post, I would like to share some of my experience and tools to tackle with the ESGF data portal, and facilitate the process. The context will encompass the concepts of NetCDF and Python language, both of which were already well introduced in our Water Programming blog. Therefore, I strongly suggest to revisit those posts in case of any doubt, and also because I am not an expert on neither NetCDF nor Python.
The first tool is called cf-python package, which implements comprehensive tools for reading, writing and processing NetCDf data. Personally I found it extremely useful for reading/writing NetCDF data, regridding and extracting the sub-domain from original dataset. More importantly, given the OpenDAP url all those aforementioned operations can be done on the server side without explicitly downloading the orignal data onto local disk! So instead of downloading a 10Gb projection data for entire Europe that takes 1 hour to finish, you can just download the part specific to your study area within a couple of seconds at a cost of a few Mb.
The second tool is named esgf-python-client. This package contains the API code for calling the ESGF Search API, whose documentation by the way has not been updated for quite a while. With the combined use of
cf-python and this client, you are saved from user/password login and dealing with tedious web interface of ESGF search page (try here) and downloading.
Below I explain more in details by using an example of downloading European regional downscaling data for RCP4.5 scenario from one of the ESGF data portal. The example is essentially a few lines of codes with inline comments (notice that some indents of lines may not show correctly in the post, so just copy paste and reformat them in your editor).
from pyesgf.logon import LogonManager from pyesgf.search import SearchConnection import cf import numpy as np def main(): ## ================== To be modified by user ================================================ # Define the search key words for facet fields in ESGF node. # Note: the supported facet fields can be identified using: # " http://<base_search_URL>/search?facets=*&distrib=false&limit=0 ", where # <base_search_URL> is the the domain of esgf-node, e.g., https://pcmdi.llnl.gov/esg-search esgf_node='https://esgf-node.ipsl.upmc.fr/esg-search' # you may try different data portal, as many of them are frequently under maintenance longer than their actual service time Project='CORDEX' Time_frequency='day' # 'month', '6h', '3h' Experiment='rcp45' # 'rcp25', 'rcp85', 'history' Variable='tas' # 'pr' for precipitation Ensemble='r1i1p1' # not change Domain='EUR-44' # EUR-44i, EUR-22, EUR-22i, EUR-11i # Define the index of domain for study region. Say original domain is 100 (lon) by 60 (lat) grids, and your study area falls between 50 - 54 in longitude and 41 to 45 in latitude. idx_lon = np.array( range(49,54) ) #49 instead of 50 as numpy array is 0-based indexing idx_lat = np.array( range(40,45) ) ## ========================================================================================= # Connect to ESGF data portal using your openid and password. May need a single run of *.wget file first to # retrieve personal credientials lm = LogonManager() lm.logon_with_openid(your_openID, your_password) conn = SearchConnection(esgf_node, distrib=True) # Make query based on the facet fields. ctx = conn.new_context( project=Project, time_frequency=Time_frequency, experiment=Experiment, variable=Variable, ensemble=Ensemble, domain=Domain ) # ctx.hit_count # show total number of results found # loop over the hit_count results for each_hit in range(0, ctx.hit_count): ds = ctx.search()[each_hit] files = ds.file_context().search() # use "len(files)" to see total files found url_list_sorted = sorted( [i.opendap_url for i in files] ) # url list sorted to get assending timestamp, e.g., from 1/1/2005 to 12/31/2099 filename_out = i.filename[0:-21] # get file name, the [0:21] means a subset of filename string counter = 0 data_list =  for each_url in url_list_sorted: counter += 1 meta_data = cf.read(each_url) # read the meta-data information from OpenDAP NetCDF url if len(meta_data) > 1: # in some cases, the returned meta_data contains two fields, i.e., len(meta_data)>1, and only the second field contains the variable of interest data_list.append(meta_data) else: data_list.append(meta_data) # Aggregating the time dataset_agg = cf.aggregate(data_list) # this will concatenate all files with small time period into a single file with complete time series # Slicing dataset if len(dataset_agg.shape) == 4: dataset_sliced = dataset_agg.subspace[0,:,idx_lat,idx_lon] # for 4-D , e.g., [height, time, lat, lon], .subspace() function is used to extract the subset of the data elif len(dataset_agg.shape) == 3: dataset_sliced = dataset_agg.subspace[:,idx_lat,idx_lon] # for 3-D , e.g., [time, lat, lon], .subspace() function is used to extract the subset of the data # Save data into desired format (e.g., using savemat from 'scipy' package) cf.write(dataset_sliced, filename_out+'.nc', mode='w') lm.logoff() # logoff if __name__ == "__main__": main()
Sorry for the lengthy and poor organized post, but I hope you will find it useful. For any doubts or suggestions, please feel free to contact me (email@example.com).