# Using arcpy to calculate area-weighted averages of gridded spatial data over political units (Part 1)

Note: In order to run arcpy, you need an installation of ArcGIS

This is a two-part blog on using arcpy, the Python package for performing ArcGIS spatial analysis. Specifically, I will detail two methods of calculating area-weighted averages of gridded spatial data over political units.

Many of you know I’ve been working with several Cornell development economists and the Ethiopian Agricultural Transformation Agency (ATA) to develop a fertilizer profitability tool for targeting agricultural interventions (see more info here). For this project, we built a statistical crop yield model using publicly available soil and weather data, as well as household survey data from the Ethiopian Central Statistical Agency (CSA) Agricultural Sample Survey.  Because the Ag Sample Survey only reports the political unit in which a surveyed household is located, not its GPS coordinates, we needed to calculate area-weighted soil and weather data over these units.

The smallest political unit in Ethiopia is a “kebele,” and the next smallest, a “woreda,” (see maps below; the kebele map does not include the Somali region, which is too arid for agriculture). I allowed the user to choose either of these to process the data over, to see which resulted in a more predictive statistical crop yield model.  For this discussion, I’ve assumed averages are being calculated at the woreda level.

We obtained soil data from the African Soil Information Service (AfSIS) at 250 m resolution on more than 30 different soil characteristics, many of them at several different depths (see more info here). I retrieved the daily rainfall data at 0.1° resolution from NOAA here (with some help from Jon Herman!), and monthly temperature data at 0.5° resolution from NOAA here.

Because the soil and weather datasets were in different formats, they needed to be processed differently. In this part, I’ll discuss the soil data, which came in the form of tiffs covering all of Africa. These were the easiest to process since they came in an Arc GIS-readable format.

Soil Data Processing

The first step in processing the soil data is to import arcpy as ap and then turn on the Spatial Analyst Toolbox with the command ap.CheckOutExtension(“Spatial”). The data can be clipped to Ethiopia using the function ap.Clip_Management. The average soil data in each political unit can then be calculated with the function ap.sa.ZonalStatisticsAsTable. The “.sa” means it is using the Spatial Analyst Extension, which is why this needs to be turned on. More info on the Clip_Management and ZonalStatisticsAsTable arcpy methods can be found on ESRI’s ArcGIS Resources site, as well as through their application in my code below.

It should be noted that technically, Zonal Statistics as Table does not calculate an area-weighted average; it converts the polygon shapefile to a raster and then finds the average cell value of all raster cells in each polygon. This is very close to the weighted average if the cell size of the data being averaged is much smaller than the size of the political unit it is being averaged over. Since the soil data are at such a high resolution (250 m), this method was deemed sufficient. In part because the weather data was a much lower resolution, a different method was used to calculate area-weighted averages in that case.

All of my codes can be found on github, but the bulk of the processing is done in the function processSoilData.py, shown below. As you can see, I imported arcpy on line 2, turned on Spatial Analyst on line 17, clipped the data to Ethiopia on line 53, and found the average value in each political unit in line 58, but clearly there’s more to the code than that.

```import os
import arcpy as ap
from convertDBFtoCSV import convertDBFtoCSV
from joinSoilCSVs import joinSoilCSVs

def processSoilData(AggLevel):
'''Calculates average soil characteristics at AggLevel = "Woreda" or "Kebele" and outputs them to WoredaSoilData.csv or KebeleSoilData.csv'''

#set the working directory
workingDir = os.getcwd()

#Turn on Spatial Statistics package and define field over which ZonalStatisticsAsTable will be calculated (Woreda or Kebele ID)
ap.CheckOutExtension("Spatial")
if AggLevel == 'Kebele':
in_zone_data = os.path.dirname(workingDir) + "\\Shapefiles\\Ethiopia Kebeles without Somali region.shp"
in_template_dataset = in_zone_data
zone_field = "KebeleID"
elif AggLevel == 'Woreda':
in_zone_data = os.path.dirname(workingDir) + "\\Shapefiles\\WoredasWGS1984.shp"
in_template_dataset = in_zone_data
zone_field = "WOREDANO_"

#Define the projection and change the working directory to the directory with all of the soil data folders
latLongRef = "Coordinate Systems\Geographic Coordinate Systems\World\WGS 1984.prj"
os.chdir(workingDir)
directories = [f for f in os.listdir(os.getcwd()) if os.path.isfile(f) == False]

for i in range(len(directories)):
#Find all the tiffs with soil data in each soil characteristic folder
os.chdir(workingDir + "\\" + directories[i])
filelist = os.listdir(os.getcwd())
tiffs = []
clipTiffs = []
for j in range(len(filelist)):
name = filelist[j]
if name[-8::] == '250m.tif':
tiffs.append(name)
elif name[-9::] == '_Clip.tif':
clipTiffs.append(name)

for j in range(len(clipTiffs)):
clipTiffs[j] = os.getcwd() + "\\" + clipTiffs[j]

for j in range(len(tiffs)):
in_raster = os.getcwd() + "\\" + tiffs[j]
out_raster = os.getcwd() + "\\" + tiffs[j][0:-4] + "_Clip.tif"
#Clip the tiffs to Ethiopia if they haven't been already
if out_raster not in clipTiffs:
ap.Clip_management(in_raster, "#", out_raster, in_template_dataset, "#", 1)

#Calculate Zonal Statistics of soil data at AggLevel
in_value_raster = out_raster
out_table = os.getcwd() + "\\" + tiffs[j][0:-4] + AggLevel + ".dbf"
ap.sa.ZonalStatisticsAsTable(in_zone_data, zone_field, in_value_raster, out_table)

#Convert the DBFs with all the AggLevel soil data to CSVs
#Change the working directory to the directory with all the soil data folders
os.chdir(workingDir)
directories = [f for f in os.listdir(os.getcwd()) if os.path.isfile(f) == False]

for i in range(len(directories)):
#Find all the DBFs with soil data in each soil characteristic folder
os.chdir(workingDir + "\\" + directories[i])
filelist = os.listdir(os.getcwd())
DBFs = []
for j in range(len(filelist)):
name = filelist[j]
if name[-10::] == AggLevel + '.dbf':
DBFs.append(name)

#Convert the DBF to a CSV
for j in range(len(DBFs)):
convertDBFtoCSV(os.getcwd() + "\\" + DBFs[j])

#Join the CSVs with all the AggLevel soil data into one CSV titled "WoredaSoilData.csv" or "KebeleSoilData.csv" depending on the AggLevel
joinSoilCSVs(AggLevel, workingDir)

return None
```

First, the data is downloaded from the ftp site using the function downloadSoilData.py, called on line 14. This function also unzips the data and stores the tiffs for each soil characteristic in a different directory (see code here for more details). Each directory can have multiple tiffs, since some of the soil characteristics are measured at several different depths. The tiffs are clipped to Ethiopia in lines 49-53 of processSoilData.py. The loop beginning on line 32 simply goes through all of the different directories, and the loop beginning on line 38 finds all of the clipped and unclipped tiffs in those directories.

The zonal statistics are calculated as described above and shown in lines 56-58 of processSoilData.py. Unfortunately, while the data is completely processed at this point, it’s stored in database files (.dbf) which can only be read by humans through ArcGIS or a database like Access. Each characteristic and layer is also stored in a separate .dbf, which is not very convenient. I converted all of these database files to comma separated files using the function convertDBFtoCSV.py (which still requires arcpy) on line 77 and then joined all of the csv’s of different soil characteristics and layers into one file using the function joinSoilCSVs.py on line 80 (which doesn’t require arcpy!). The loop beginning on line 65 simply goes through all of the soil directories, and the loop on line 70 finds all of the dbf’s in those directories.

The final result is a table where the first column stores the woreda or kebele ID, and the remaining columns store the average values of each soil characteristic at each measured depth. Here is a snippet: