python4oceanographers

Turning ripples into waves

Testing xray for ocean models output

xray is an interesting package that has been under my radar for a while.

According to the docs xray is:

... an open source project and Python package that aims to bring the
labeled data power of pandas to the physical sciences, by providing
N-dimensional variants of the core pandas data structures, Series and
DataFrame: the xray DataArray and Dataset.

Our goal is to provide a pandas-like and pandas-compatible toolkit for
analytics on multi-dimensional arrays, rather than the tabular data for
which pandas excels. Our approach adopts the Common Data Model for
self- describing scientific data in widespread use in the Earth sciences
(e.g., netCDF and OPeNDAP): xray.Dataset is an in-memory representation of
a netCDF file.

This post is my first attempt using it for ocean models output. Because xray is under heavy development I will be using the latest git checkout.

In [2]:
import xray

print(xray.__version__)
0.3.1-47-ga9f63ff

xray uses python-netCDF4 under the hood, so it "speaks" OPeNDAP.

In [3]:
url = ('http://tds.marine.rutgers.edu/thredds/dodsC/roms/'
       'espresso/2013_da/his_Best/'
       'ESPRESSO_Real-Time_v2_History_Best_Available_best.ncd')

nc = xray.open_dataset(url)

It has a nice and meaniful __repr__ for the DataArray object.

In [4]:
salt = nc['salt']
salt
Out[4]:
<xray.DataArray 'salt' (time: 13932, s_rho: 36, eta_rho: 82, xi_rho: 130)>
[5346544320 values with dtype=float64]
Coordinates:
    lon_rho   (eta_rho, xi_rho) float64 ...
    lat_rho   (eta_rho, xi_rho) float64 ...
  * eta_rho   (eta_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...
    time_run  (time) datetime64[ns] ...
  * xi_rho    (xi_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...
  * s_rho     (s_rho) float64 -0.9861 -0.9583 -0.9306 -0.9028 -0.875 -0.8472 -0.8194 ...
  * time      (time) datetime64[ns] 2013-05-19T01:00:00 2013-05-19T02:00:00 ...
Attributes:
    long_name: salinity
    time: ocean_time
    field: salinity, scalar, series
    _ChunkSize: [  1  36  82 130]
    standard_name: sea_water_salinity

My first impression is that, when compared to raw netCDF4 interface, the xray.DataArray object is richer (more metadata). However, if compared to his older brother iris.cube.Cube, it falls short. For example: xray does not attach the 2D coordinates to the object. (As of the latest git checkout it does attach the coordinates.)

In [5]:
print(salt.coords['xi_rho'].coords)
print(salt.coords['eta_rho'].coords)
Coordinates:
  * xi_rho   (xi_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ...
Coordinates:
  * eta_rho  (eta_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...

In [6]:
print(salt.coords['lon_rho'].coords)
print(salt.coords['lat_rho'].coords)
Coordinates:
  * xi_rho   (xi_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ...
    lat_rho  (eta_rho, xi_rho) float64 ...
  * eta_rho  (eta_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...
Coordinates:
    lon_rho  (eta_rho, xi_rho) float64 ...
  * xi_rho   (xi_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 ...
  * eta_rho  (eta_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 ...

The Dataset is another important xray object. Here is a small snippet on how to construct one:

In [7]:
ds = xray.Dataset({'temperature': nc['temp']},
                  coords={'lon': (['eta_rho', 'xi_rho'], nc['lon_rho']),
                          'lat': (['eta_rho', 'xi_rho'], nc['lat_rho']),
                          's': nc['s_rho'],
                          'time': nc['time']})
ds
Out[7]:
<xray.Dataset>
Dimensions:      (eta_rho: 82, s_rho: 36, time: 13932, xi_rho: 130)
Coordinates:
    lon_rho      (eta_rho, xi_rho) float64 -75.84 -75.78 -75.71 -75.65 -75.59 -75.52 ...
    lat_rho      (eta_rho, xi_rho) float64 33.74 33.79 33.84 33.89 33.94 33.99 34.04 ...
  * eta_rho      (eta_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...
    time_run     (time) datetime64[ns] ...
  * xi_rho       (xi_rho) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 ...
  * s_rho        (s_rho) float64 -0.9861 -0.9583 -0.9306 -0.9028 -0.875 -0.8472 -0.8194 ...
  * time         (time) datetime64[ns] 2013-05-19T01:00:00 ...
    lat          (eta_rho, xi_rho) float64 33.74 33.79 33.84 33.89 33.94 33.99 34.04 ...
    s            (s_rho) float64 -0.9861 -0.9583 -0.9306 -0.9028 -0.875 -0.8472 -0.8194 ...
    lon          (eta_rho, xi_rho) float64 -75.84 -75.78 -75.71 -75.65 -75.59 -75.52 ...
Variables:
    temperature  (time, s_rho, eta_rho, xi_rho) float64 ...

The best feature, IMHO, is the use of Pandas "fuzzyness" when slicing with strings dates. Iris uses the cumbersome lambda approach, but using strings is much easier.

In [8]:
s = salt.loc['2014-12-02']
s.coords['time']
Out[8]:
<xray.DataArray 'time' (time: 24)>
array(['2014-12-01T21:00:00.000000000-0300',
       '2014-12-01T22:00:00.000000000-0300',
       '2014-12-01T23:00:00.000000000-0300',
       '2014-12-02T00:00:00.000000000-0300',
       '2014-12-02T01:00:00.000000000-0300',
       '2014-12-02T02:00:00.000000000-0300',
       '2014-12-02T03:00:00.000000000-0300',
       '2014-12-02T04:00:00.000000000-0300',
       '2014-12-02T05:00:00.000000000-0300',
       '2014-12-02T06:00:00.000000000-0300',
       '2014-12-02T07:00:00.000000000-0300',
       '2014-12-02T08:00:00.000000000-0300',
       '2014-12-02T09:00:00.000000000-0300',
       '2014-12-02T10:00:00.000000000-0300',
       '2014-12-02T11:00:00.000000000-0300',
       '2014-12-02T12:00:00.000000000-0300',
       '2014-12-02T13:00:00.000000000-0300',
       '2014-12-02T14:00:00.000000000-0300',
       '2014-12-02T15:00:00.000000000-0300',
       '2014-12-02T16:00:00.000000000-0300',
       '2014-12-02T17:00:00.000000000-0300',
       '2014-12-02T18:00:00.000000000-0300',
       '2014-12-02T19:00:00.000000000-0300',
       '2014-12-02T20:00:00.000000000-0300'], dtype='datetime64[ns]')
Coordinates:
  * time      (time) datetime64[ns] 2014-12-02 2014-12-02T01:00:00 ...
    time_run  (time) datetime64[ns] 2014-12-01T01:00:00 2014-12-02T01:00:00 ...
Attributes:
    long_name: Forecast time for ForecastModelRunCollection
    standard_name: time
    _CoordinateAxisType: Time
In [9]:
import cartopy.crs as ccrs
import matplotlib.pyplot as plt
import cartopy.feature as cfeature
from cartopy.mpl.gridliner import LONGITUDE_FORMATTER, LATITUDE_FORMATTER

LAND = cfeature.NaturalEarthFeature('physical', 'land', '50m',
                                    edgecolor='face',
                                    facecolor=cfeature.COLORS['land'])


def make_map(bbox, projection=ccrs.PlateCarree()):
    fig, ax = plt.subplots(figsize=(8, 6),
                           subplot_kw=dict(projection=projection))
    ax.set_extent(bbox)
    ax.add_feature(LAND, facecolor='0.75')
    ax.coastlines(resolution='50m')
    gl = ax.gridlines(draw_labels=True)
    gl.xlabels_top = gl.ylabels_right = False
    gl.xformatter = LONGITUDE_FORMATTER
    gl.yformatter = LATITUDE_FORMATTER
    return fig, ax
In [10]:
import numpy.ma as ma
import matplotlib.pyplot as plt
from oceans.colormaps import cm

sal = s[0, -1, ...].values
lon, lat = nc['lon_rho'], nc['lat_rho']

fig, ax = make_map([lon.min(), lon.max(), lat.min(), lat.max()])
cs = ax.pcolormesh(lon.values, lat.values,
                   ma.masked_invalid(sal), cmap=cm.odv)
cbar = fig.colorbar(cs, extend='both')

Note that when plotting I had to use the .values method. The reason is because the DataArray object does not behave entirely like a numpy array. The .ravel() method, need for pcolormesh, is missing. Also note that xray does not support masked array (yet).

One might ask: why use labeled arrays? The main argument is to write idiomatic expressions like,

s.mean(dim='time')

there is no doubt on which dimension I am averaging.

Because xray is designed to behave like pandas, when dealing with time series one can take advantage of all pandas features. The same may be accomplished with iris.pandas.as_series, but with xray there is no awkward conversions.

Now let's try an example straight from xray's documentation. First I will reduce the data from hourly to daily observations, then compute weekly anomalies.

In [11]:
def daily_aggregator(data_array):
    time = data_array.coords['time']
    values = time.to_index().to_period('D').to_timestamp()
    return xray.DataArray(values, [time], name='daily')


temp = nc['temp'].loc['2013-06-01':'2013-07-01']  # I get a weird error if extending this.
ts = temp.isel(eta_rho=16, xi_rho=40, s_rho=-1)  # Cannot use lon, lat because xray does not support 2D dimensions.

daily_avg = ts.groupby(daily_aggregator(ts)).mean()
In [12]:
week_avg = daily_avg.groupby('daily.week').mean('daily')
anomalies = daily_avg.groupby('daily.week') - week_avg

ax = anomalies.reset_coords(drop=True).to_series().plot()

Those anomalies could very well be the GS front meandering around that point.

In [13]:
surface = temp.isel(s_rho=-1)

fig, ax = make_map([lon.min(), lon.max(), lat.min(), lat.max()])
cs = ax.pcolormesh(lon.values, lat.values,
                   ma.masked_invalid(surface[0].values),
                                     cmap=cm.avhrr, alpha=0.5, zorder=1)
p, = ax.plot(lon.isel(eta_rho=16, xi_rho=40),
             lat.isel(eta_rho=16, xi_rho=40),
             'ko', zorder=2)
In [14]:
HTML(html)
Out[14]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://ocefpaf.github.io/.

Comments