python4oceanographers

Turning ripples into waves

Downloading several hdf files from a server

This post is a quick example on how to use download several hdf 4 files by "scrapping" NASA's server.

Manually downloading several hdf files is, most of the time, impractical. Some time ago I helped a friend with a similar problem with a simple python script. This post is just a review of that script so others can modify/re-use it for similar cases.

All we need is a lister class to extract the urls from the web-page, and a hook function to show the download progress progress.

In [2]:
import sys
import urllib
import fnmatch
import lxml.html

def url_lister(url):
    urls = []
    connection = urllib.request.urlopen(url)
    dom =  lxml.html.fromstring(connection.read())
    for link in dom.xpath('//a/@href'):
        urls.append(link)
    return urls


def progress_hook(out):
    """Return a progress hook function, suitable for passing to
    urllib.retrieve, that writes to the file object *out*."""
    def it(n, bs, ts):
        got = n * bs
        if ts < 0:
            outof = ''
        else:
            # On the last block n*bs can exceed ts, so we clamp it
            # to avoid awkward questions.
            got = min(got, ts)
            outof = '/%d [%d%%]' % (ts, 100 * got // ts)
        out.write("\r  %d%s" % (got, outof))
        out.flush()
    return it

Here we will download the Mapped Monthly mean 4km CZCS data, but this script can be extended to any web-page that has a list of urls.

In [3]:
url = "http://oceandata.sci.gsfc.nasa.gov/CZCS/Mapped/Monthly/4km/chlor/"
urls = url_lister(url)

Before downloading let's filter by the filename extension (bz2), so we download just what we really want.

In [4]:
filetype = "*.bz2"
file_list = [filename for filename in fnmatch.filter(urls, filetype)]

Now we can dowloand the whole list in a for loop, bu here I'll get only one file.

In [5]:
url = file_list[0]
hdf_file = url.split('/')[-1]
sys.stdout.write(hdf_file + '\n')
urllib.request.urlretrieve(url, filename=hdf_file,
                           reporthook=progress_hook(sys.stdout))
sys.stdout.write('\n')
sys.stdout.flush()
C19782741978304.L3m_MO_CHL_chlor_a_4km.bz2
  64861/64861 [100%]

In [6]:
HTML(html)
Out[6]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://ocefpaf.github.io/.

Comments