python4oceanographers

Turning ripples into waves

Exploratory analysis using seaborn

This week I was helping a friend to explore her data-set with some simple statistics and plots. So I decided to try seaborn out.

It is a really nice library that, together with pandas, becomes a powerful tool to take the first steps while exploring your data.

Here is a simple example of what we did.

In [2]:
import seaborn
import numpy as np
import matplotlib.pyplot as plt

from io import BytesIO
from pandas import read_csv
In [3]:
kw = dict(na_values='NaN', sep=',', encoding='utf-8',
          skipinitialspace=True, index_col=False)

df = read_csv("./data/fish.csv", **kw)
In [4]:
df.head()
Out[4]:
Days ID Recovery Extract weight Lipid % Weight (g) Size (cm) Liver weight (g) LSI CF BDE 47 (ng/g) BDE 99 (ng/g)
0 0 A 73.21 0.10 3.600000 20.09 12.8 0.14 0.696864 0.957966 0 0
1 0 B 98.24 0.22 2.272727 36.52 15.5 0.33 0.903614 0.980699 0 0
2 0 C 89.71 0.18 3.500000 28.74 14.7 0.25 0.869868 0.904763 0 0
3 1 A 78.40 0.13 1.330769 23.70 14.0 0.15 0.632911 0.863703 0 0
4 1 B 66.24 0.13 2.838462 32.80 15.0 0.20 0.609756 0.971852 0 0

Seaborn makes it easy to control the figure aesthetics with set_style and get_style.

In [5]:
kw = {'axes.edgecolor': '0', 'text.color': '0', 'ytick.color': '0', 'xtick.color': '0',
      'ytick.major.size': 5, 'xtick.major.size': 5, 'axes.labelcolor': '0'}

seaborn.set_style("whitegrid", kw)

The first plot will be a simple and naive correlation matrix. It it just one line with seaborn.

In [6]:
ax = seaborn.corrplot(df, annot=False, diag_names=False)

Easy conclusion, the bigger the fish, the heavier it is ;). But seriously now, BDE 47 is positively correlated with Days and BDE 99, that is worth exploring. BDE 99 was part of the experiment. However, BDE 47 was not in the fish at the begging, it is a by-product of the BDE 99 that appear as the fish metabolized it.

We can explore this a little further. Note that we used pandas groupby to aggregate the the data around the variables "Days".

In [7]:
g = df.groupby('Days')
mean_df = g.mean()
g.describe().head()
Out[7]:
BDE 47 (ng/g) BDE 99 (ng/g) CF Extract weight LSI Lipid % Liver weight (g) Recovery Size (cm) Weight (g)
Days
0 count 3 3 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000 3.000000
mean 0 0 0.947809 0.166667 0.823449 3.124242 0.240000 87.053333 14.333333 28.450000
std 0 0 0.038974 0.061101 0.110916 0.739127 0.095394 12.724725 1.386843 8.218838
min 0 0 0.904763 0.100000 0.696864 2.272727 0.140000 73.210000 12.800000 20.090000
25% 0 0 0.931364 0.140000 0.783366 2.886364 0.195000 81.460000 13.750000 24.415000
In [8]:
ax = seaborn.jointplot("Days", "BDE 99 (ng/g)", df, kind="reg")
In [9]:
ax = seaborn.jointplot("Days", "BDE 47 (ng/g)", df, kind="reg")

The increase in BDE 47 is clear. BDE 99 does not show a decrease in the same rate as BDE 47 because it was part of the fish diet.

The inspection of the residues is also a one-liner.

In [10]:
ax = seaborn.residplot("Days", "BDE 99 (ng/g)", df)
In [11]:
ax = seaborn.residplot("Days", "BDE 47 (ng/g)", df)

Hopefully that is useful for others. Do not forget to check seaborn docs.

In [12]:
HTML(html)
Out[12]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://ocefpaf.github.io/.

Comments