This post is a quick example on how to use download several hdf 4 files by "scrapping" NASA's server.
Manually downloading several hdf
files is, most of the time, impractical.
Some time ago I helped a friend with a similar problem with a simple python
script. This post is just a review of that script so others can modify/re-use
it for similar cases.
All we need is a lister
class to extract the urls
from the web-page, and a
hook
function to show the download progress progress.
import sys
import urllib
import fnmatch
import lxml.html
def url_lister(url):
urls = []
connection = urllib.request.urlopen(url)
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/@href'):
urls.append(link)
return urls
def progress_hook(out):
"""Return a progress hook function, suitable for passing to
urllib.retrieve, that writes to the file object *out*."""
def it(n, bs, ts):
got = n * bs
if ts < 0:
outof = ''
else:
# On the last block n*bs can exceed ts, so we clamp it
# to avoid awkward questions.
got = min(got, ts)
outof = '/%d [%d%%]' % (ts, 100 * got // ts)
out.write("\r %d%s" % (got, outof))
out.flush()
return it
Here we will download the Mapped Monthly mean 4km CZCS data, but this script can be extended to any web-page that has a list of urls.
url = "http://oceandata.sci.gsfc.nasa.gov/CZCS/Mapped/Monthly/4km/chlor/"
urls = url_lister(url)
Before downloading let's filter by the filename extension (bz2), so we download just what we really want.
filetype = "*.bz2"
file_list = [filename for filename in fnmatch.filter(urls, filetype)]
Now we can dowloand the whole list in a for loop, bu here I'll get only one file.
url = file_list[0]
hdf_file = url.split('/')[-1]
sys.stdout.write(hdf_file + '\n')
urllib.request.urlretrieve(url, filename=hdf_file,
reporthook=progress_hook(sys.stdout))
sys.stdout.write('\n')
sys.stdout.flush()
HTML(html)