Turning ripples into waves

Web scraping 101 (or how to get ready for a long trip)

In this post I will show how to write a simple script to scrape a webpage with a list of podcasts links. I did that while preparing to go to Austin. It is a nice way to use my extra "airport time" to study a little bit.

The first step is to list all the links in the podcast webpage,

In [3]:
import requests
from bs4 import BeautifulSoup, SoupStrainer

def urllister(url):
    r = requests.get(url)
    soup = r.content
    urls = []
    for link in BeautifulSoup(soup, parse_only=SoupStrainer('a')):
            if link.has_attr('href'):
        except AttributeError:
    return urls

and filter it by the file extension you want to download:

In [4]:
import fnmatch

def filter_url(urls, filetype="*.mp3"):
    return (fname for fname in fnmatch.filter(urls, filetype))

Now we need to create a download function. I do not remember where I got the function below. It is probably a mixture of StackOverflow and some customizations. The beauty of this function is that it can resume a partial download and displays a nice progress bar.

In [5]:
import os
import sys

    from urllib.error import HTTPError
    from urllib.request import FancyURLopener
except ImportError:
    from urllib2 import HTTPError
    from urllib import FancyURLopener

from progressbar import ProgressBar

class URLOpener(FancyURLopener):
    """Subclass to override error 206 (partial file being sent)."""
    def http_error_206(self, url, fp, errcode, errmsg, headers, data=None):
        pass  # Ignore the expected "non-error" code.

def download(fname, url, verbose=False):
    """Resume download."""
    current_size = 0
    url_obj = URLOpener()
    if os.path.exists(fname):
        output = open(fname, "ab")
        current_size = os.path.getsize(fname)
        # If the file exists, then download only the remainder.
        url_obj.addheader("Range", "bytes=%s-" % (current_size))
        output = open(fname, "wb")

    web_page =

    if verbose:
        for key, value in web_page.headers.items():
            sys.stdout.write("{} = {}\n".format(key, value))

    # If we already have the whole file, there is no need to download it again.
    num_bytes = 0
    full_size = int(web_page.headers['Content-Length'])
    if full_size == current_size:
        msg = "File ({}) was already downloaded from URL ({})".format
        sys.stdout.write(msg(fname, url))
    elif full_size == 0:
        sys.stdout.write("Full file size equal zero!"
                         "Try again later or check the file")
        if verbose:
            msg = "Downloading {:d} more bytes".format
            sys.stdout.write(msg(full_size - current_size))
        pbar = ProgressBar(maxval=full_size)
        while True:
                data =
            except ValueError:
            if not data:
            num_bytes = num_bytes + len(data)

    if verbose:
        msg = "Downloaded {} bytes from {}".format
        sys.stdout.write(msg(num_bytes, web_page.url))

Now find a URL with the podcasts you want and start scrapping. Be nice and sleep a little bit before each download!

from time import sleep

podcasts = range(0, 101)
uri = ("http://some-url-with-podcasts/podcast-{}.mp3".format)

for podcast in podcasts:
    url = uri(podcast)
    print(url + '\n')
        fname = url.split('/')[-1]
        download(fname, url, verbose=True)
    except HTTPError:
        print('Cannot download {}\n'.format(url))

Be sure to read the page terms of use. Some podcasts providers do not like scrapping!!

I will be listening to some Spanish classes. Nope, just lost my phone at the airport... I won't be listening to anything :-(

In [6]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at