Create ensemble average of CMIP5 data

The CMIP5 data for the next IPCC AR5 report are now available online and I wanted to use these data for a new project. I needed to download sea surface temperature data for the next 100 years and analyze the data. To start off I searched for the appropriate data I needed on the data portal “”. You need to make a user account to get access to download data. However, once that was done it was easy to find the data I was interested in. I started by downloading SST data (variable name = tos) for the RCP8.5 projections as well as the historical data using teh handy WGET functionality offered at the data portal.

Allowing the data to download over night I could start getting my head around them the next day. The first step in organizing the downloaded data is to collect all files from same model and model run into one file. Ususally the model results are available as one file per 5 or 10 years which gives you 10-20 files for 100 years.

I used the excellent NCO (NetCDF Operator) command line utilities to gather all of the files into one file with the ncrcat command. The following script will gather all of the seperate files for the models defined in the ‘models’ array into one file.

#!/bin/bash ## shell type
drc_in=( '/Users/trond/Projects/ESM-PP/SST/Historical/' )
var=( 'tos_Omon' ) ## variables
models=( 'GFDL-ESM2G' 'GFDL-ESM2M' 'HadGEM2-ES' )
rlm=( 'ocean' )
xpt=( 'esmHistorical_r1i1p1' ) 
for mdl in ${models[@]}; do ## loop over models
fl_nbr=$( ls ${drc_in}${var}_${mdl}_${xpt}*.nc | wc -w ) ## number of files in this ensemble member
echo "" $mdl

if [ ${fl_nbr} -le 1 ] ## if there is only 1 file, continue to next loop
echo "There is only 1 file in" ${nsm}.

echo "There are" ${fl_nbr} "files in" ${drc_in}

## starting date of data (sed [the name of the first file includes the starting date])
yyyymm_str=$( ls ${drc_in}${var}_${mdl}_${xpt}*.nc | sed -n '1p' | cut -d '_' -f 6 | cut -d '-' -f 1 )
## ending date of data (sed [the name of the last file includes the ending date])
yyyymm_end=$( ls ${drc_in}${var}_${mdl}_${xpt}*.nc | sed -n "${fl_nbr}p" | cut -d '_' -f 6 | cut -d '-' -f 2 )
echo $yyyymm_str, $yyyymm_end
## concatenate the files of one ensemble member into one along the record dimension (now is time)
ncrcat -O ${drc_in}${var}_${mdl}_${xpt}*.nc ${drc_in}${var}_${xpt}_${mdl}_${yyyymm_str}-${yyyymm_end}

All of the IPCC data are in NetCDF format which is great. However, all of the model results are stored in their native grid coordinates which can be a headache to understand. Since I wanted to calculate an ensemble of all of the models I needed to re-grid all of the data to a generic rectangular grid that made it easy to compare files and perform array operations. The solution to this problem was to download and install the fantastic Climate Data Operators I have previously described <a href="">here</a>. To re-grid the model data to a rectangular grid where latitude runs from -90 to 90 and longitude runs from 0 to 360, run these CDO commands:

<code lang="bash">
cdo genbil,r360x180
cdo remap,r360x180,

This code first calculates the weights for each grid cell, before it uses the weights to perform the cubic interpolation from the native grid to the new rectangular grid. I followed the same steps for all my files and gathered the commands into a script that will loop over all files in the directory defined:

drc_in=( '/Users/trond/Projects/ESM-PP/SST/Historical/MODELS/' )
modelNumbers=$( ls ${drc_in}*.nc | wc -w )
echo "Found " modelNumbers " in directory"

models=$( ls *.nc )

for model in ${models[@]}; do
echo "Regridding model: " $model

name=$( ls $model | cut -d '.' -f 1 )

cdo genbil,r360x180 $model ${name}
cdo remap,r360x180,${name} ${name}.nc ${name}
cdo sinfo ${name}
mv ${name} REGRID/.
rm ${name}

Next, I wanted to calculate the yearly averages of SST in all of the files as this makes it somewhat easier to use some of the CDO commands later. I therefore looped over all of my re-gridded files and calculated the yearly mean. I also discovered that since all of the native grids are different, the re-gridding to a rectangular grid may lead to some models having values close to land while others have fill_value or masked values. This generates a problem when you want to calculate the ensemble average as the program just loops over whatever values is present in each grid cell and averages. Therefore, I had to define the range of acceptable values in the file which is from freezing point of sea surface water to a ridiculous high value. The ‘setvrange’ CDO command takes care of this for us before calculating the yearly averages.

modelNumbers=$( ls *.nc | wc -w )
echo "Found " ${modelNumbers}
rm -f YRLYAVG/*.nc
models=$( ls *.nc )

for model in ${models[@]}; do
echo "Calculate yearly averages: " $model

name=$( ls $model | cut -d '.' -f 1 )

cdo -setvrange,271.45,320 $model $model.tmp
cdo yearmean -selyear,1950/2005 -selvar,tos $model.tmp ${name}
rm $model.tmp

cdo sinfo ${name}
mv ${name} YRLYAVG/.



cdo ensmean *.nc
cdo sinfo
cdo ensstd *.nc

The simple command

 cdo ensmean *.nc

calculates the ensemble mean of all of the yearly averaged files each representing one model scenario.

Next, I repeated all of these commands on both the future RCP8.5 projections and for the historical files. As a final preparation of these files prior to analysis I wanted to estimated the long term mean of the ensemble average. First, I needed to calculate the climatology and I decided to use 1961-1990 as reference climatology.

cdo timselavg,30 -seldate,19610101,19900101
cdo sinfov

This calculates the climatology from the historical model simulations. Next I repeated these commands for the periods 2006-2025 and 2025-2050 to get long term future projections.

2 thoughts on “Create ensemble average of CMIP5 data”

  1. Dear Kristiansen,

    Thank you very much for share the script to process the CMIP5 data.

    However, I found a small problem when using the ncrcat command, for example, when I connect the HadGEM2-CC, the input data is “” and “”. But the output file is “”.

    I think the problem should be as following:
    ## starting date of data (sed [the name of the first file includes the starting date])
    yyyymm_str=$( ls ${drc_in}${var}_${mdl}_${xpt}*.nc | sed -n ‘1p’ | cut -d ‘_’ -f 6 | cut -d ‘-‘ -f 1 )
    ## ending date of data (sed [the name of the last file includes the ending date])
    yyyymm_end=$( ls ${drc_in}${var}_${mdl}_${xpt}*.nc | sed -n “${fl_nbr}p” | cut -d ‘_’ -f 6 | cut -d ‘-‘ -f 2 )

    I am not familiar with shell, could you give some suggestions on how to give the output file correct starting and ending year?

    Best regards,
    Dachao Jin

Leave a Reply

Your email address will not be published. Required fields are marked *