There has been a lot of buzz around the python ggplot module recently. I must confess that the original (ggplot2 for R) for R is not a tool in my utility belt. However, every now and then I find myself teaching it to biologist/ecologist that are stuck with R. I do appreciate the concept Grammar of Graphics though, it is just not my everyday plotting tool.
This post is just to give the python version of ggplot a try and see what all the fuzz is about.
Yhat's python version of ggplot is is extremely un-pythonic (it says so in README file!), so be aware, if you never used ggplot before and/or you are coming from matplotlib you might get a little bit confused.
We'll try the module out by comparing CTD temperature profile plots with:
- pure matplotlib
- my own wrapper for plotting ctd profiles
- ggplot.
The first issue I faced with python ggplot
was that I could not reverse the
axis of a plot. Hopefully this
PR
will change this situation. If you want to reproduce the plot at the end of
the post you'll need to install ggplot
from my
branch.
The PR was merged:
pip install https://github.com/yhat/ggplot/tarball/master
Then you'll need to download the matplotlibrc for ggplot's layout, unzip, and copy it into your local .matplotlib folder.
wget https://github.com/yhat/ggplot/raw/master/matplotlibrcs/matplotlibrc-windows.zip
unzip -p matplotlibrc-windows.zip home/stefan/.matplotlib/matplotlibrc > matplotlibrc
cp matplotlibrc $HOME/.config/matplotlib/matplotlibrc
Before plotting we need to load the data (and perform some simple pre-processing).
It is worth mentioning that I explicitly named the DataFrame index
, that way the ctd
module can automagically label the plots.
import gsw
from ctd import DataFrame, Series
cast = DataFrame.from_cnv('./data/CTD_001.cnv.gz', compression='gzip')
keep = set(['t090C', 'c0S/m'])
null = map(cast.pop, keep.symmetric_difference(cast.columns))
cast, _ = cast.split()
cast = cast.apply(Series.bindata, **dict(delta=1.))
cast['SP'] = gsw.SP_from_C(cast['c0S/m'].values * 10.,
cast['t090C'].values,
cast.index.values.astype(float))
cast.index.name = 'Pressure [dbar]'
First let's plot the profile "the matplotlib way" (11 LOC).
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(3, 4))
ax.plot(cast['t090C'], cast.index)
ax.set_ylabel(cast.index.name)
ax.invert_yaxis()
offset = 0.01
x1, x2 = ax.get_xlim()[0] - offset, ax.get_xlim()[1] + offset
ax.set_xlim(x1, x2)
ax.set_title("Matplotlib")
ax.set_xlabel("Temperature")
ax.set_ylabel("Pressure [dbar]")
Now let's make the same plot with the ctd
module (3 LOC).
(Note: The next version will accept title
and figsize
as kw options, making
this a one-liner.)
fig, ax = cast['t090C'].plot()
ax.set_title('python-ctd "wrapper"')
fig.set_size_inches(3, 4)
Finally, we will plot the profile with ggplot
. You'll observe that, before plotting, I created a new column
data with the index
. The reason for that is because I could not figure out how to pass the index
as the y-axis
.
from ggplot import *
cast['pressure'] = cast.index.values
p = ggplot(cast, aes(x='t090C', y='pressure')) + geom_line() + scale_y_reverse()
p
It is kind of a one liner if you exclude the import
and the print
line.
I understand that ggplot has lots of fans from the R world, but it fells a little alien to me. For example, I'm still figuring out how to adjust the figure size/aspect ratio to give it a "depth-profile look."
My final take is that, if you have a very specific type of plot that requires some tweaking, and/or that you'll have to re-plot it several times, you might be better of writing your own wrapper around matplotlib.
Still, I can see the potential of ggplot when teaching students how to make simple, yet powerful, plots.
I'll leave it with a TS-diagram. (Note that we don't need to pass raw strings to use latex.)
p = ggplot(cast, aes(x='SP', y='t090C')) + \
geom_point(color='black') + \
xlab("Salinity [g kg$^{-1}$]") + \
ylab("Temperature [$^\circ$C]")
p
HTML(html)