python4oceanographers

Turning ripples into waves

Blending pandas and iris

If you have not checked Nikolay Koldunov's Earthpy blog by now stop reading this and click on my blog roll at the bottom right of the page. This post is basically "re-post" from his time-series part 2. I will to try reproduce his results using my two favorite tools pandas and iris.

Pandas was created by and for quant analysis, it is a mature tool built to do virtually anything you can imagine with time-series. However, pandas may disapoint you when used for oceanographic or meteorological data analysis. Iris, on the other hand, was born at the UK Met Office, it is tailor made for meteorology and oceanography. In this post I will try to get the best of both.

Luckily iris.pandas module already implements functions for that like as_cube, as_series and as_data_frame to convert to and from iris cubes and pandas Series/DataFrame objects. It is important to note that iris does rely on pandas for 1D time-series.

In [2]:
from pandas import DataFrame, read_csv

First let's download the data for the AO and NOA used in the original post. Iris can read OpenDAP, netCDF, PP and grib easily, but it is complicated to load a custom format, even simple text formats like the ones below. Thankfully, reading various text formats with pandas is a piece of cake.

In [3]:
url = 'http://www.cpc.ncep.noaa.gov/products/precip/CWlink/'

ao_file = url + 'daily_ao_index/monthly.ao.index.b50.current.ascii'
nao_file = url + 'pna/norm.nao.monthly.b5001.current.ascii'

kw = dict(sep='\s*', parse_dates={'dates': [0, 1]},
          header=None, index_col=0, squeeze=True, engine='python')

AO = read_csv(ao_file, **kw)
NAO = read_csv(nao_file, **kw)

read_csv downloads the file and load it into a pandas Series indexing by its date. To better understand all the kw used check the original post.

I want to represent those series as iris cubes. Below I show some boiler plate to get a nicer cube, with calendar, units and etc. That is not necessary if you wish just a default cube.

In [4]:
import iris
from iris.pandas import as_cube, as_series, as_data_frame

def boiler_plate(df, name=None, calendars=None,
                 unit=None, index_name=None):
    cube = as_cube(df, calendars=calendars)
    cube.rename(name)
    cube.coord("index").rename("time")
    cube.units = unit
    return cube

kw = dict(index_name='time',
          unit=iris.unit.Unit('no_unit'),
          calendars={0: iris.unit.CALENDAR_GREGORIAN})

AO = boiler_plate(AO, name="AO", **kw)
NAO = boiler_plate(NAO, name="NAO", **kw)

cubes = iris.cube.CubeList([AO, NAO])

print(cubes)
0: AO / (no_unit)                      (time: 775)
1: NAO / (no_unit)                     (time: 775)

Here is where iris shines, instead of hacking our way with pandas resample, like in the original post, we can actually aggregate_by seasons. All we need to do is to add a coord_categorisation as an auxiliary coordinate. I will add two categories: months and seasons.

In [5]:
import iris.coord_categorisation as coord_cat

[coord_cat.add_season(cube, 'time', name='clim_season')
 for cube in cubes]

[coord_cat.add_month(cube, 'time', name='months')
 for cube in cubes]

print(cubes[0])
AO / (no_unit)                      (time: 775)
     Dimension coordinates:
          time                           x
     Auxiliary coordinates:
          clim_season                    x
          months                         x

Let's extract the boreal winter.

In [6]:
constraint = iris.Constraint(clim_season='djf')
winter = cubes.extract(constraint)
print(winter)
0: AO / (no_unit)                      (time: 194)
1: NAO / (no_unit)                     (time: 194)

Try compare this with the results obtained by pandas resample in the original notebook.

In [7]:
import matplotlib.pyplot as plt
import iris.quickplot as qplt

fig, ax = plt.subplots(figsize=(7, 3.5))
qplt.plot(winter[0], label=winter[0].name())
qplt.plot(winter[1], label=winter[1].name(), alpha=0.75)
ax.set_title("Winter")
ax.set_ylabel("AO/NAO Index")
leg = ax.legend()

And here is where iris starts to come short. The bar plot with pandas is just one line of code. Iris does not have such a handy method, and we have to deal with raw matplotlib.

In [8]:
fig, ax = plt.subplots(figsize=(7, 3.5))
width = 0.5
months = np.arange(1, 13)

monthly_mean = cubes[0].aggregated_by('months', iris.analysis.MEAN)
ax.bar(months, monthly_mean.data,
       label=monthly_mean.name(), color='blue',
       alpha=0.75, align='edge', width=width)

monthly_mean = cubes[1].aggregated_by('months', iris.analysis.MEAN)
ax.bar(months+width, monthly_mean.data,
       label=monthly_mean.name(), color='green',
       alpha=0.75, align='edge', width=width)
ax.locator_params(axis='x', tight=True, nbins=12)

So let's ditch iris and go from cube to DataFrame to enjoy pandas plotting library.

In [9]:
df = as_series(cube)
df.head()
Out[9]:
1950-01-06    0.92
1950-02-06    0.40
1950-03-06   -0.36
1950-04-06    0.73
1950-05-06   -0.59
dtype: float32

That is a problem for me as_series uses the main coordinate as the index and drops the rest. I want to be able to choose which coordinate to pass as index. So let's try as_data_frame instead.

In [10]:
from iris.exceptions import CoordinateNotFoundError

try:
    df = as_data_frame(cube)
except CoordinateNotFoundError as e:
    print(e)
u'Expected to find exactly 1 coordinate, but found 3. They were: time, clim_season, months.'

That is worse. We cannot get a DataFrame back because of the auxiliary coordinates we added.

We can circumvent those issues by hacking as_series and implementing an index keyword to gain control over which index to use. Also, we can attach the Auxiliary coordinates to the data as extra columns in as_data_frame.

In [11]:
import pandas
from iris.pandas import _as_pandas_coord

def as_series(cube, index=None, copy=True):
    data = cube.data
    if isinstance(data, np.ma.MaskedArray):
        if not copy:
            raise ValueError("Masked arrays must always be copied.")
        data = data.astype('f').filled(np.nan)
    elif copy:
        data = data.copy()

    if not index:
        if cube.dim_coords:
            index = _as_pandas_coord(cube.dim_coords[0])
    else:
        index = _as_pandas_coord(index)

    series = pandas.Series(data, index)
    if not copy:
        _assert_shared(data, series)

    return series

def as_data_frame(cube, copy=True):
    data = cube.data
    if isinstance(data, np.ma.MaskedArray):
        if not copy:
            raise ValueError("Masked arrays must always be copied.")
        data = data.astype('f').filled(np.nan)
    elif copy:
        data = data.copy()

    index = columns = None
    coords0 = cube.coords(dimensions=[0])
    if coords0:
        index = _as_pandas_coord(coords0.pop(0))
        if cube.ndim == 1:
            columns = [cube.name()]+[coord.name() for coord in coords0]
            extra_cols = np.asanyarray([coord.points for coord in coords0]).T
            data = np.c_[data, extra_cols]
    coords1 = cube.coords(dimensions=[1])
    if coords1:
        columns = _as_pandas_coord(coords1)
        columns.extend([coord.name() for coord in coords0])
        data = np.c_[data, [coord.points for coord in coords0]]
    data_frame = DataFrame(data, index, columns)
    if not copy:
        _assert_shared(data, data_frame)

    return data_frame
In [12]:
df = as_series(AO, AO.coord('clim_season'))
print(df.head())
print('\n')
df = as_series(AO, AO.coord('months'))
print(df.head())
djf   -0.060310
djf    0.626810
mam   -0.008128
mam    0.555100
mam    0.071577
dtype: float32


Jan   -0.060310
Feb    0.626810
Mar   -0.008128
Apr    0.555100
May    0.071577
dtype: float32

In [13]:
df_ao = as_data_frame(AO)
df_nao = as_data_frame(NAO)
print(df_ao.head())
                           AO clim_season months
1950-01-06   -0.0603099986911         djf    Jan
1950-02-06     0.626810014248         djf    Feb
1950-03-06  -0.00812750030309         mam    Mar
1950-04-06     0.555100023746         mam    Apr
1950-05-06    0.0715769976377         mam    May

That's much better! One side effect is that all the dtypes are objects now.

In [14]:
print(df_ao.dtypes)
AO             object
clim_season    object
months         object
dtype: object

That is easy to solve.

In [15]:
df_ao[['AO']] = df_ao[['AO']].astype(float)
df_nao[['NAO']] = df_nao[['NAO']].astype(float)
In [16]:
print(df_ao.dtypes)
AO             float64
clim_season     object
months          object
dtype: object

We survived the barplot without pandas, but there was now way I was going to try the boxplots without it.

In [17]:
ax = df_ao.boxplot(column=['AO'], by='months')
ax = df_nao.boxplot(column=['NAO'], by='months')

I will stop here for now. Don't forget to check the original post at Earthpy.

In [18]:
HTML(html)
Out[18]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://ocefpaf.github.io/.

Comments