The first steps to clean a data-set is to remove outliers (or spikes). Some spikes are easy to spot with a simple histogram of the data. Let's take a look at a velocity time-series with some bad data.
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from pandas import read_table
from matplotlib.dates import DateFormatter
formatter = DateFormatter('%d')
cols = ['j', 'u', 'v', 'temp', 'sal', 'y', 'mn', 'd', 'h', 'mi']
fname = './data/spikey_v.dat'
df = read_table(fname , delim_whitespace=True, names=cols)
df.index = [datetime(*x) for x in zip(df['y'], df['mn'], df['d'], df['h'], df['mi'])]
df = df.drop(['y', 'mn', 'd', 'h', 'mi'], axis=1)
kw = dict(histtype='stepfilled', alpha=0.5, normed=True)
fig, (ax0, ax1) = plt.subplots(ncols=2, sharex=True, sharey=True)
ax0.hist(df['u'], **kw)
ax1.hist(df['v'], **kw)
_ = ax1.set_title('v')
_ = ax0.set_title('u')
Can you see the bins that separate themselves from the rest? Those are probably spikes. It is common practice to use a criteria of labeling the data $3\times \sigma$ larger than the mean as outliers. However, some spikes might be "hidden" within the data limits, or one can remove good data together with bad data if the data is highly variable (or episodic systems). For a better discussion on this topic check out Chapter 2 from Emery and Thomson -- Data Analysis Methods in Physical Oceanography.
One solution is a "two(or more)-pass" criteria. Also, one might wish to evaluate small chunks of the data separately. Here is an example:
def rolling_window(data, block):
shape = data.shape[:-1] + (data.shape[-1] - block + 1, block)
strides = data.strides + (data.strides[-1],)
return np.lib.stride_tricks.as_strided(data, shape=shape, strides=strides)
def despike(arr, n1=2, n2=20, block=10):
offset = arr.values.min()
arr -= offset
data = arr.copy()
roll = rolling_window(data, block)
roll = ma.masked_invalid(roll)
std = n1 * roll.std(axis=1)
mean = roll.mean(axis=1)
# Use the last value to fill-up.
std = np.r_[std, np.tile(std[-1], block - 1)]
mean = np.r_[mean, np.tile(mean[-1], block - 1)]
mask = (np.abs(data - mean.filled(fill_value=np.NaN)) >
std.filled(fill_value=np.NaN))
data[mask] = np.NaN
# Pass two: recompute the mean and std without the flagged values from pass
# one now removing the flagged data.
roll = rolling_window(data, block)
roll = ma.masked_invalid(roll)
std = n2 * roll.std(axis=1)
mean = roll.mean(axis=1)
# Use the last value to fill-up.
std = np.r_[std, np.tile(std[-1], block - 1)]
mean = np.r_[mean, np.tile(mean[-1], block - 1)]
mask = (np.abs(arr - mean.filled(fill_value=np.NaN)) >
std.filled(fill_value=np.NaN))
arr[mask] = np.NaN
return arr + offset
The first compute the statistics of the data ($\mu$ and $\sigma$) and marks (but do not exclude yet) data that deviates more than $n1 \times \sigma$ from the mean. The second pass recompute the data statistics but now removing the data that deviates more than $n2 \times \sigma$ from the mean.
The code above makes use of NumPy stride tricks to avoid loops while computing $\mu$ and $\sigma$ for each block.
kw = dict(n1=2, n2=20, block=6)
df['ud'] = despike(df['u'], **kw)
df['vd'] = despike(df['v'], **kw)
fig, (ax0, ax1) = plt.subplots(nrows=2, sharex=True, sharey=True)
ax0.plot(df.index, df['u'], 'b')
ax1.plot(df.index, df['v'], 'b')
mask = np.isnan(df['ud'])
ax0.plot(df.index[mask], df['u'][mask], 'ro', alpha=0.3, label='Two-pass')
mask = np.isnan(df['vd'])
ax1.plot(df.index[mask], df['v'][mask], 'ro', alpha=0.3, label='Two-pass')
ax1.xaxis.set_major_formatter(formatter)
mask = np.abs(df['u']) > 3 * np.abs(df['u']).std()
ax0.plot(df.index[mask], df['u'][mask], 'g.', label=r'$3\times \sigma$')
mask = np.abs(df['v']) > 3 * np.abs(df['v']).std()
ax1.plot(df.index[mask], df['v'][mask], 'g.', label=r'$3\times \sigma$')
ax0.legend(numpoints=1, bbox_to_anchor=(1.35, 0.15))
_ = ax0.text(df.index[150], 40, 'u')
_ = ax1.text(df.index[150], 40, 'v')
The "two-pass" method did well for u while the $3\times \sigma$ criteria failed not only to identify all the spikes, but also removed good data from the time series. On the other had, the $3\times \sigma$ criteria removed two bad values from v while the two pass could one find one. Still both methods left behind two small spikes on v.
There are probably better ways of doing this. Also, there are methods that are more statistically robust and more efficient at catching bad values. Still, the message of this post is not actually an algorithm to remove bad data, but:
"To identify errors, it is necessary to examine all of the data in visual form and to get a "feel" for the data." -- directly from Emery and Thomson book.
In other words, there is no silver bullet!
http://www.johndcook.com/blog/2013/05/30/there-are-no-outliers/
HTML(html)