If you have not checked Nikolay Koldunov's Earthpy blog by now stop reading this and click on my blog roll at the bottom right of the page. This post is basically "re-post" from his time-series part 2. I will to try reproduce his results using my two favorite tools pandas and iris.
Pandas was created by and for quant analysis, it is a mature tool built to do virtually anything you can imagine with time-series. However, pandas may disapoint you when used for oceanographic or meteorological data analysis. Iris, on the other hand, was born at the UK Met Office, it is tailor made for meteorology and oceanography. In this post I will try to get the best of both.
Luckily iris.pandas
module already implements functions for that like
as_cube
, as_series
and as_data_frame
to convert to and from iris cubes
and pandas Series/DataFrame objects. It is important to note that iris does
rely on pandas for 1D time-series.
from pandas import DataFrame, read_csv
First let's download the data for the AO and NOA used in the original post.
Iris can read OpenDAP
, netCDF
, PP
and grib
easily, but it is
complicated to load a custom format, even simple text formats like the ones below.
Thankfully, reading various text formats with pandas is a piece of cake.
url = 'http://www.cpc.ncep.noaa.gov/products/precip/CWlink/'
ao_file = url + 'daily_ao_index/monthly.ao.index.b50.current.ascii'
nao_file = url + 'pna/norm.nao.monthly.b5001.current.ascii'
kw = dict(sep='\s*', parse_dates={'dates': [0, 1]},
header=None, index_col=0, squeeze=True, engine='python')
AO = read_csv(ao_file, **kw)
NAO = read_csv(nao_file, **kw)
read_csv
downloads the file and load it into a pandas Series
indexing by
its date. To better understand all the kw
used check the original post.
I want to represent those series as iris cubes. Below I show some boiler plate to get a nicer cube, with calendar, units and etc. That is not necessary if you wish just a default cube.
import iris
from iris.pandas import as_cube, as_series, as_data_frame
def boiler_plate(df, name=None, calendars=None,
unit=None, index_name=None):
cube = as_cube(df, calendars=calendars)
cube.rename(name)
cube.coord("index").rename("time")
cube.units = unit
return cube
kw = dict(index_name='time',
unit=iris.unit.Unit('no_unit'),
calendars={0: iris.unit.CALENDAR_GREGORIAN})
AO = boiler_plate(AO, name="AO", **kw)
NAO = boiler_plate(NAO, name="NAO", **kw)
cubes = iris.cube.CubeList([AO, NAO])
print(cubes)
Here is where iris shines, instead of hacking our way with pandas resample,
like in the original post, we can actually aggregate_by
seasons. All we need
to do is to add a coord_categorisation
as an auxiliary coordinate. I will
add two categories: months and seasons.
import iris.coord_categorisation as coord_cat
[coord_cat.add_season(cube, 'time', name='clim_season')
for cube in cubes]
[coord_cat.add_month(cube, 'time', name='months')
for cube in cubes]
print(cubes[0])
Let's extract the boreal winter.
constraint = iris.Constraint(clim_season='djf')
winter = cubes.extract(constraint)
print(winter)
Try compare this with the results obtained by pandas resample in the original notebook.
import matplotlib.pyplot as plt
import iris.quickplot as qplt
fig, ax = plt.subplots(figsize=(7, 3.5))
qplt.plot(winter[0], label=winter[0].name())
qplt.plot(winter[1], label=winter[1].name(), alpha=0.75)
ax.set_title("Winter")
ax.set_ylabel("AO/NAO Index")
leg = ax.legend()
And here is where iris starts to come short. The bar plot with pandas is just one line of code. Iris does not have such a handy method, and we have to deal with raw matplotlib.
fig, ax = plt.subplots(figsize=(7, 3.5))
width = 0.5
months = np.arange(1, 13)
monthly_mean = cubes[0].aggregated_by('months', iris.analysis.MEAN)
ax.bar(months, monthly_mean.data,
label=monthly_mean.name(), color='blue',
alpha=0.75, align='edge', width=width)
monthly_mean = cubes[1].aggregated_by('months', iris.analysis.MEAN)
ax.bar(months+width, monthly_mean.data,
label=monthly_mean.name(), color='green',
alpha=0.75, align='edge', width=width)
ax.locator_params(axis='x', tight=True, nbins=12)
So let's ditch iris and go from cube to DataFrame to enjoy pandas plotting library.
df = as_series(cube)
df.head()
That is a problem for me as_series
uses the main coordinate as the index and
drops the rest. I want to be able to choose which coordinate to pass as
index. So let's try as_data_frame
instead.
from iris.exceptions import CoordinateNotFoundError
try:
df = as_data_frame(cube)
except CoordinateNotFoundError as e:
print(e)
That is worse. We cannot get a DataFrame back because of the auxiliary coordinates we added.
We can circumvent those issues by hacking as_series
and implementing an
index keyword
to gain control over which index to use. Also, we can
attach the Auxiliary coordinates to the data as extra columns in as_data_frame
.
import pandas
from iris.pandas import _as_pandas_coord
def as_series(cube, index=None, copy=True):
data = cube.data
if isinstance(data, np.ma.MaskedArray):
if not copy:
raise ValueError("Masked arrays must always be copied.")
data = data.astype('f').filled(np.nan)
elif copy:
data = data.copy()
if not index:
if cube.dim_coords:
index = _as_pandas_coord(cube.dim_coords[0])
else:
index = _as_pandas_coord(index)
series = pandas.Series(data, index)
if not copy:
_assert_shared(data, series)
return series
def as_data_frame(cube, copy=True):
data = cube.data
if isinstance(data, np.ma.MaskedArray):
if not copy:
raise ValueError("Masked arrays must always be copied.")
data = data.astype('f').filled(np.nan)
elif copy:
data = data.copy()
index = columns = None
coords0 = cube.coords(dimensions=[0])
if coords0:
index = _as_pandas_coord(coords0.pop(0))
if cube.ndim == 1:
columns = [cube.name()]+[coord.name() for coord in coords0]
extra_cols = np.asanyarray([coord.points for coord in coords0]).T
data = np.c_[data, extra_cols]
coords1 = cube.coords(dimensions=[1])
if coords1:
columns = _as_pandas_coord(coords1)
columns.extend([coord.name() for coord in coords0])
data = np.c_[data, [coord.points for coord in coords0]]
data_frame = DataFrame(data, index, columns)
if not copy:
_assert_shared(data, data_frame)
return data_frame
df = as_series(AO, AO.coord('clim_season'))
print(df.head())
print('\n')
df = as_series(AO, AO.coord('months'))
print(df.head())
df_ao = as_data_frame(AO)
df_nao = as_data_frame(NAO)
print(df_ao.head())
That's much better! One side effect is that all the dtypes are objects now.
print(df_ao.dtypes)
That is easy to solve.
df_ao[['AO']] = df_ao[['AO']].astype(float)
df_nao[['NAO']] = df_nao[['NAO']].astype(float)
print(df_ao.dtypes)
We survived the barplot without pandas, but there was now way I was going to try the boxplots without it.
ax = df_ao.boxplot(column=['AO'], by='months')
ax = df_nao.boxplot(column=['NAO'], by='months')
I will stop here for now. Don't forget to check the original post at Earthpy.
HTML(html)