This post is dedicated to Chris Vernon and Travis Thurber from PNNL who taught me how to use these tools!
Lately, I’ve been generating synthetic weather scenarios for basins in California. The weather scenarios are created at a daily time step and are informed by tree-ring products which span across 600 years. We want to develop multiple ensembles of these scenarios that are representative of plausible future climate changes that the region could experience. It became clear to me that I would need to make sure that I was storing data efficiently and that I was able to parse through these data quickly to generate plots and develop metrics.
Below, I wanted to discuss two points that have really helped my workflow: (1) focusing on file structure and compression and (2) changing the way that I explore the resulting scenarios.
File choice and compression
To illustrate the importance of considering how you choose to save your data, I grab the historical trace of daily precipitation over the last 50 years for the Tuolumne River Basin and upload it into R as a dataframe. Below I show the code to take this dataframe and save it in various formats.
#Traditional RDS system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne.rds")) #Compressed RDS system.time(saveRDS(prcp_tuolumne,"E:/blog/prcp_tuolumne_compressed.rds",compress = 'xz')) #Text system.time(write.table(prcp_tuolumne,"E:/blog/prcp_tuolumne.txt")) #CSV system.time(write.csv(prcp_tuolumne,"E:/blog/prcp_tuolumne.csv")) #Parquet library(arrow) system.time(write_parquet(prcp_tuolumne,"E:/blog/prcp_tuolumne.parquet",compression = "gzip")) #HDF5 library(rhdf5) system.time(h5write(prcp_tuolumne,"E:/blog/prcp_tuolumne.h5","df")) #NetCDF library(ncdf4) # path and file name, set dname ncpath <- "E:/blog/" ncname <- "prcp_tuolumne" ncfname <- paste(ncpath, ncname, ".nc", sep="") dname <- "prcp" # note: tmp means temperature (not temporary) timedim <- ncdim_def("time","days since 1950-11-1",as.numeric(dates)) # define variables fillvalue <- 1e32 dlname <- "observed precipitation" prcp_def <- ncvar_def("prcp","mm",list(timedim),1e32,"observed prcp",prec = "double") # create netCDF file and put arrays ncout <- nc_create(ncfname,list(prcp_def),force_v4=TRUE) # put variables system.time(ncvar_put(ncout,prcp_def,as.matrix(prcp_tuolumne[ ,1])))
Here, I summarize the resulting size and write times for each format.
|File Type||Size (Kb)||Writing Time (seconds)|
Note that the original file size was on the smaller side so this exercise may seem trivial, but when you are creating many ensembles of both precipitation and temperature, across hundreds of years, across may basins, across many climate change scenarios, small files will add up and can potentially limit the scope of experiment that you can conduct if you don’t have enough file storage. Of these file types, one of my recent favorites has been Apache Parquet. Parquet is an open-source column-oriented format that is specifically designed for efficient storage and accessibility of data. Whereas the compressed RDS file and HDF5 beat Parquet in terms of size, it takes much longer to write to these files and subsequently read them back in. Another advantage of Parquet is that it is recognized by common coding languages (R, Matlab, Python) which allows for a more seamless workflow between models of different languages. If you have a little more storage to work with, Parquet is a good choice to balance size and writing time tradeoffs.
Querying data with SQL and DuckDB
Once your data are in an efficient format like Parquet, the next order of business is to make sure that you can easily sort through it and, as Chris would say, “talk to your data and let it talk back”. One way to communicate with your data and ask questions in a more semantically meaningful way is to use Structured Query Language (SQL). There are many methods of querying in R, but a useful data management tool that I have enjoyed interacting with is DuckDB which is a relational database management system. DuckDB has an R API and allows you to use the SQL syntax to query data in an efficient way. It should be noted that you can perform similar aggregation and subsetting using other R base functions, but DuckDB allows you to perform these tasks faster and across multiple files that you may have in a directory.
Let’s use a case where you have developed 30 traces of daily basin-averaged synthetic precipitation for the Tuolumne based on the historical period and each trace is stored in a respective parquet file (“ensemble1.parquetgzip, ensemble2.parquetgzip…ensemble30.parquetgzip). Each ensemble looks something like this:
Now let’s state a question that I want to ask to the data and see how to structure these questions using SQL and DuckDB.
“Hey DuckDB, can you create a dataframe for me that provides an annual average across all my synthetic traces? Let’s plot that with respect to the observed annual average and show the bounds of the synthetic traces.”
First we need to open a connection to DuckDB. Then we are going to have to create annual averages from the daily historical and synthetic traces and then create max/min bounds to show the additional variability we are getting from the generator.
In order to answer this question, DuckDB will query all our parquet files simultaneously (!) and return a dataframe with annual averages and we can do some clever sub-querying to get the max/min of the averages. I really enjoy that I don’t have to read in the files into my R workspace. R reads in files and stores them based on memory, so those of you who have used larger datasets might have found that you are limited in how many datasets you can have open, which can be frustrating! This is not an issue with DuckDB.
#Historical trace historical=data.frame(dates.year,dates.month,dates.day,rowMeans(prcp_sacsma)) historical_yearly=aggregate(historical,by=list(dates.year),FUN=mean) # open connection to DuckDB con <- dbConnect(duckdb::duckdb()) # query to find annual average across all synthetic traces df=dbGetQuery(con, "SELECT year, AVG(precipitation) AS yearly_average FROM 'E:/NHMM/blog/*.parquet' GROUP BY year ORDER BY year") #For the max/min, we need to find the average first and then return the max/min values in something like a nested approach #First create a dataframe with all files all_files=dbGetQuery(con,"SELECT * FROM 'E:/NHMM/blog/*.parquet'") # register the dataset as a DuckDB table, and give it a name duckdb::duckdb_register_arrow(con, "all_files_table", all_files) annual_average=dbGetQuery(con, "SELECT year, AVG(precipitation) AS yearly_average FROM all_files_table GROUP BY year,sample ORDER BY year") # register the dataset as a DuckDB table, and give it a name duckdb::duckdb_register_arrow(con, "annual_table", annual_average) #query to find max max=dbGetQuery(con, "SELECT year, MAX(yearly_average) AS max_yearly_average FROM annual_table GROUP BY year ORDER BY year") #query to find min min=dbGetQuery(con, "SELECT year, min(yearly_average) AS min_yearly_average FROM annual_table GROUP BY year ORDER BY year") #Plot the results! library(ggplot2) ggplot(df, aes(x = year, y = yearly_average)) + geom_ribbon(aes(ymin = min$min_yearly_average, ymax = max$max_yearly_average), alpha = 0.2,fill="blue") + geom_line()+geom_line(color="blue",size=2)+ geom_line(data=basin.wide.yearly[3:64,],aes(x=year,y=precipitation), color="Black",size=2)+ ggtitle("Annual Average Precipitation") + xlab("Year") + ylab("Annual Average Precipitation (mm)")+theme_ridges()
Here’s the resulting figure:
“Hey DuckDB, from all of the synthetic traces that I have generated, how many instances are there where we are producing daily precipitation that is above 10 mm for specifically the years 1980 or 2000?”
In order to answer this question, DuckDB will once again query all our parquet files simultaneously and return a dataframe with all the instances of >10 mm and only for the years 1980 or 2000. This is how you would write this question in “code”.
library(arrow) library(duckdb) library(fs) library(tidyverse) library(DBI) library(glue) # open connection to DuckDB con <- dbConnect(duckdb::duckdb()) df=dbGetQuery(con,"SELECT * FROM 'E:/blog/*.parquet' WHERE precipitation > 10 AND (year = 1980 OR year = 2000) order BY year")
If we look at the size of the resulting dataframe it looks like we have generated 1335 instances of daily precipitation that are greater than 10 mm in specifically the years 1980 or 2000:
Hey DuckDB, I want to think about meteorological drought. In how many instances am I half a standard deviation below the long-term average monthly precipitation?
To try this out, let’s just look at one synthetic trace. First we need to find the long term average across the trace and let’s plot what a drought might look like.
# query to grab the first synthetic trace scenario1=dbGetQuery(con, "SELECT month,year, AVG(precipitation) AS monthly_avg FROM 'E:/NHMM/blog/ensemble1.parquet' GROUP by year, month") # query to find mean long_term_average=dbGetQuery(con, "SELECT AVG(precipitation) AS monthly_avg FROM 'E:/NHMM/blog/ensemble1.parquet'") # query to standard deviation stdev=dbGetQuery(con, "SELECT AVG(precipitation) AS monthly_stdev FROM 'E:/NHMM/blog/ensemble1.parquet'")
We can then plot the synthetic trace (blue), mean (black), and a 1/2 standard deviation (dashed line) below the mean.
Now we can count the number of months that we are classified in a drought across the synthetic trace. Based on the stats that we calculated in the last block, I hard coded in the drought threshold, but in theory, you can nest many functions with more complicated sub-queries.
# register the dataset as a DuckDB table, and give it a name duckdb::duckdb_register_arrow(con, "scenario_1", scenario1) instances=dbGetQuery(con, "SELECT month,year FROM scenario_1 WHERE monthly_avg < 3.125291 GROUP by year, month ORDER by year,month")
It looks like 115/325 months or about 35% of the time we are in a drought in this particular synthetic trace. Yikes! But it makes for a good way to assess the vulnerability of the system to future plausible meteorological drought conditions.
I hope that these approaches can be as useful to you as they have been for me!