Turning ripples into waves

Binstar/conda package stats

Conda and binstar are changing the packaging world of Python. Conda made it easy to install re-locatable python binaries that where hard to build, while binstar provides a "Linux repository-like system" (or if you are younger than me an AppStore-like system) to host custom binaries.

Taking advantage of that IOOS created a binstar channel with Met-ocean themed packages for Windows, Linux and MacOS. Note that, if you are using Red Hat Enterprise Linux or Centos you should use the rhel6 channel to avoid the GLIBC problem.

All the conda-recipes are open and kept in a GitHub repository. (And accepting PRs ;-)

In this post I will not show how to install and configure conda with this channel. It has been done already here and here. Is this post I will scrape the binstar channel stats to evaluate how the channel is doing.

First some handy functions to parse the dates, the package names, and to same all the data into a pandas DataFrame.

In [2]:
import requests
import numpy as np
from datetime import date
from pandas import DataFrame
from bs4 import BeautifulSoup
from dateutil.relativedelta import relativedelta

def todatetime(string):
    number, period, _ = string.split()
    if 'day' in period:
        delta = relativedelta(days=int(number))
    elif 'month' in period:
        delta = relativedelta(months=int(number))
    elif 'hour' in period:
        delta = relativedelta(hours=int(number))
        raise ValueError("Unexpected period {!r}".format(period))
    return - delta

def parse_name(cell):
    name = cell.text.strip().split('/')
    if len(name) != 2:
        name = cell.text.strip().split('\\')
    arch = '{}'.format(name[0].split()[1])
    name = '{}'.format(name[1].split('.tar.bz2')[0])
    return arch, name

def get_df(package):
    url = "{}/files".format
    r  = requests.get(url(package))
    soup = BeautifulSoup(r.text)
    table = soup.find("table", {"class": "table table-condensed table-striped"})

    downloads, uploaded, platforms, names = [], [], [], []
    for row in table.findAll('tr'):
        col = row.findAll('td')
        if len(col) == 7:
          platform, name = parse_name(col[3])
    df = DataFrame(data=np.c_[platforms, names, uploaded, downloads],
                   columns=['platform', 'name', 'uploaded', 'downloads'])
    df.set_index('uploaded', inplace=True, drop=True)
    df['downloads'] = df['downloads'].astype(int)
    return df

All the data we need is in the repodata.json file. There isn't an API to access that via the command line (yet), that is why we need to scrape it.

In [ ]:
from requests import HTTPError
from pandas import Panel, read_json

json = ""
df = read_json(json)

packages = sorted(set(['-'.join(pac.split('-')[:-2]) for pac in df.index]))

dfs = dict()
for pac in packages:
        dfs.update({pac: get_df(pac)})
    except HTTPError:

Now let's split the various platforms and compute total number of downloads for each package.

In [4]:
def get_plat_total(df):
    package = dict()
    for plat in ['linux-64', 'osx-64', 'win-32', 'win-64']:
        total = df.query('platform == "{}"'.format(plat)).sum()
        package.update({plat: total['downloads']})
    return package

packages = dict()
for pac in dfs.keys():
    df = dfs[pac]
    packages.update({pac: get_plat_total(df)})
In [5]:
df = DataFrame.from_dict(packages).T
df['sum'] = df.T.sum()
df.sort('sum', ascending=False, inplace=True)
df.drop('sum', axis=1, inplace=True)

And here is the result,

In [6]:
%matplotlib inline
import seaborn
import matplotlib.pyplot as plt

stride = 19 # 19 x 5 = 95
kw = dict(kind='bar', stacked=True)

fig, ax = plt.subplots(figsize=(11, 3))
ax = df.ix[:stride].plot(ax=ax, **kw)

fig, ax = plt.subplots(figsize=(11, 3))
ax = df.ix[stride:stride*2].plot(ax=ax, **kw)

fig, ax = plt.subplots(figsize=(11, 3))
ax = df.ix[stride*2:stride*3].plot(ax=ax, **kw)

fig, ax = plt.subplots(figsize=(11, 3))
ax = df.ix[stride*3:stride*4].plot(ax=ax, **kw)

fig, ax = plt.subplots(figsize=(11, 3))
ax = df.ix[stride*4:stride*5].plot(ax=ax, **kw)
In [7]:
df['win'] = df['win-32'] + df['win-64']

total = df[['linux-64', 'osx-64', 'win']].sum()

fig, ax = plt.subplots(figsize=(7, 3))
ax = total.plot(ax=ax, kind='bar')

Right now it is hard to make sense of the data. That is because some downloads might be a direct download or an indirect download via a package dependency. Also, our own build system downloads the dependencies when building new or when updating the packages in the channel. One conclusion that we may take from this is that the Windows packages are as popular the Linux packages!

In [8]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at