Catalog based search for the IOOS Regional Associations using UUID

Dec 07, 2015

In the previous example we investigated if it was possible to query the NGDC CSW Catalog to extract records matching an IOOS RA acronym. However, we could not trust the results. Some RAs results in just a few records or no record at all, like AOOS and PacIOOS respectively.

We can make a more robust search using the UUID rather than the acronym. The advantage is that all records will be associated to an UUID, hence a more robust search. The disadvantage is that we need to keep track of a long and unintelligible identification.

As usual let's start by instantiating the csw catalog object.

In [3]:

from owslib.csw import CatalogueServiceWeb

endpoint = 'http://www.ngdc.noaa.gov/geoportal/csw'
csw = CatalogueServiceWeb(endpoint, timeout=30)

We will use the same list of all the Regional Associations as before, but now we will match them with the corresponding UUID from the IOOS registry.

In [4]:

import pandas as pd

ioos_ras = ['AOOS',      # Alaska
            'CaRA',      # Caribbean
            'CeNCOOS',   # Central and Northern California
            'GCOOS',     # Gulf of Mexico
            'GLOS',      # Great Lakes
            'MARACOOS',  # Mid-Atlantic
            'NANOOS',    # Pacific Northwest 
            'NERACOOS',  # Northeast Atlantic 
            'PacIOOS',   # Pacific Islands 
            'SCCOOS',    # Southern California
            'SECOORA']   # Southeast Atlantic

url = 'https://raw.githubusercontent.com/ioos/registry/master/uuid.csv'

df = pd.read_csv(url, index_col=0, header=0, names=['UUID'])
df['UUID'] = df['UUID'].str.strip()

The function below is similar to the one we used before. Note the same matching (PropertyIsEqualTo), but different property name (sys.siteuuid rather than apiso:Keywords).

That is the key difference for the robustness of the search, keywords are not always defined and might return bogus matching. While UUID will always mean one RA.

In [5]:

from owslib.fes import PropertyIsEqualTo

def query_ra(csw, uuid):
    q = PropertyIsEqualTo(propertyname='sys.siteuuid', literal='%s' % uuid)
    csw.getrecords2(constraints=[q], maxrecords=2000, esn='full')
    return csw

Here is what we got:

In [6]:

for ra in ioos_ras:
    try:
        uuid = df.ix[ra]['UUID']
        csw = query_ra(csw, uuid)
        ret = csw.results['returned']
        word = 'records' if ret > 1 else 'record'
        print("{0:>8} has {1:>4} {2}".format(ra, ret, word))
        csw.records.clear()
    except KeyError:
        pass

    AOOS has   74 records
   GCOOS has    8 records
    GLOS has   20 records
MARACOOS has  468 records
  NANOOS has    8 records
NERACOOS has 1109 records
 PacIOOS has  192 records
  SCCOOS has   23 records
 SECOORA has  100 records

Compare the results above with cell [6] from before. Note that now we got 192 for PacIOOS and 74 for AOOS now!

You can see the original notebook here.

In [7]:

HTML(html)

Out[7]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.

Catalog based search for the IOOS Regional Associations acronyms

Nov 23, 2015

The goal of this post is to investigate if it is possible to query the NGDC CSW Catalog to extract records matching an IOOS RA acronym, like SECOORA for example.

In the cell above we do the usual: instantiate a Catalogue Service Web (csw) using the NGDC catalog endpoint.

In [3]:

from owslib.csw import CatalogueServiceWeb

endpoint = 'http://www.ngdc.noaa.gov/geoportal/csw'
csw = CatalogueServiceWeb(endpoint, timeout=30)

We need a list of all the Regional Associations we know.

In [4]:

ioos_ras = ['AOOS',      # Alaska
            'CaRA',      # Caribbean
            'CeNCOOS',   # Central and Northern California
            'GCOOS',     # Gulf of Mexico
            'GLOS',      # Great Lakes
            'MARACOOS',  # Mid-Atlantic
            'NANOOS',    # Pacific Northwest 
            'NERACOOS',  # Northeast Atlantic 
            'PacIOOS',   # Pacific Islands 
            'SCCOOS',    # Southern California
            'SECOORA']   # Southeast Atlantic

To streamline the query we can create a function that instantiate the fes filter and returns the records.

In [5]:

from owslib.fes import PropertyIsEqualTo

def query_ra(csw, ra='SECOORA'):
    q = PropertyIsEqualTo(propertyname='apiso:Keywords', literal=ra)
    csw.getrecords2(constraints=[q], maxrecords=100, esn='full')
    return csw

Here is what we got:

In [6]:

for ra in ioos_ras:
    csw = query_ra(csw, ra)
    ret = csw.results['returned']
    word = 'records' if ret > 1 else 'record'
    print("{0:>8} has {1:>3} {2}".format(ra, ret, word))
    csw.records.clear()

    AOOS has   1 record
    CaRA has   0 record
 CeNCOOS has   7 records
   GCOOS has   5 records
    GLOS has  15 records
MARACOOS has 100 records
  NANOOS has   1 record
NERACOOS has 100 records
 PacIOOS has   0 record
  SCCOOS has  23 records
 SECOORA has  71 records

I would not trust those number completely. Surely some of the RA listed above have more than 0/1 record.

Note that we have more information in the csw.records. Let's inspect one of SECOORA's stations for example.

In [7]:

csw = query_ra(csw, 'SECOORA')
key = csw.records.keys()[0]

print(key)

id_usf.tas.ngwlms

We can verify the station type, title, and last date of modification.

In [8]:

station = csw.records[key]

station.type, station.title, station.modified

Out[8]:

('downloadableData', 'usf.tas.ngwlms', '2015-11-25T01:32:42-07:00')

The subjects field contains the variables and some useful keywords.

In [9]:

station.subjects

Out[9]:

['air_pressure',
 'air_temperature',
 'water_surface_height_above_reference_datum',
 'wind_from_direction',
 'wind_speed_of_gust',
 'wind_speed',
 'SECOORA',
 'air_pressure',
 'air_temperature',
 'water_surface_height_above_reference_datum',
 'wind_from_direction',
 'wind_speed_of_gust',
 'wind_speed',
 'latitude',
 'longitude',
 'time',
 'climatologyMeteorologyAtmosphere']

And we can access the full XML description for the station.

In [10]:

print(station.xml)

<csw:Record xmlns:csw="http://www.opengis.net/cat/csw/2.0.2" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcmiBox="http://dublincore.org/documents/2000/07/11/dcmi-box/" xmlns:dct="http://purl.org/dc/terms/" xmlns:gml="http://www.opengis.net/gml" xmlns:ows="http://www.opengis.net/ows" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<dc:identifier scheme="urn:x-esri:specification:ServiceType:ArcIMS:Metadata:FileID">id_usf.tas.ngwlms</dc:identifier>
<dc:identifier scheme="urn:x-esri:specification:ServiceType:ArcIMS:Metadata:DocID">{9DDE8E32-EB36-4E72-B2CC-A47D51151271}</dc:identifier>
<dc:title>usf.tas.ngwlms</dc:title>
<dc:type scheme="urn:x-esri:specification:ServiceType:ArcIMS:Metadata:ContentType">downloadableData</dc:type>
<dc:type scheme="urn:x-esri:specification:ServiceType:ArcIMS:Metadata:ContentType">liveData</dc:type>
<dc:subject>air_pressure</dc:subject>
<dc:subject>air_temperature</dc:subject>
<dc:subject>water_surface_height_above_reference_datum</dc:subject>
<dc:subject>wind_from_direction</dc:subject>
<dc:subject>wind_speed_of_gust</dc:subject>
<dc:subject>wind_speed</dc:subject>
<dc:subject>SECOORA</dc:subject>
<dc:subject>air_pressure</dc:subject>
<dc:subject>air_temperature</dc:subject>
<dc:subject>water_surface_height_above_reference_datum</dc:subject>
<dc:subject>wind_from_direction</dc:subject>
<dc:subject>wind_speed_of_gust</dc:subject>
<dc:subject>wind_speed</dc:subject>
<dc:subject>latitude</dc:subject>
<dc:subject>longitude</dc:subject>
<dc:subject>time</dc:subject>
<dc:subject>climatologyMeteorologyAtmosphere</dc:subject>
<dct:modified>2015-11-25T01:32:42-07:00</dct:modified>
<dct:references scheme="urn:x-esri:specification:ServiceType:distribution:url">http://tds.secoora.org/thredds/dodsC/usf.tas.ngwlms.nc.html</dct:references>
<dct:references scheme="urn:x-esri:specification:ServiceType:distribution:url">http://www.ncdc.noaa.gov/oa/wct/wct-jnlp-beta.php?singlefile=http://tds.secoora.org/thredds/dodsC/usf.tas.ngwlms.nc</dct:references>
<dct:references scheme="urn:x-esri:specification:ServiceType:sos:url">http://tds.secoora.org/thredds/sos/usf.tas.ngwlms.nc?service=SOS&amp;version=1.0.0&amp;request=GetCapabilities</dct:references>
<dct:references scheme="urn:x-esri:specification:ServiceType:odp:url">http://tds.secoora.org/thredds/dodsC/usf.tas.ngwlms.nc</dct:references>
<dct:references scheme="urn:x-esri:specification:ServiceType:download:url">http://tds.secoora.org/thredds/dodsC/usf.tas.ngwlms.nc.html</dct:references>
<ows:WGS84BoundingBox>
<ows:LowerCorner>-82.75800323486328 28.1560001373291</ows:LowerCorner>
<ows:UpperCorner>-82.75800323486328 28.1560001373291</ows:UpperCorner>
</ows:WGS84BoundingBox>
<ows:BoundingBox>
<ows:LowerCorner>-82.75800323486328 28.1560001373291</ows:LowerCorner>
<ows:UpperCorner>-82.75800323486328 28.1560001373291</ows:UpperCorner>
</ows:BoundingBox>
<dc:source>{B3EA8869-B726-4E39-898A-299E53ABBC98}</dc:source>
</csw:Record>

This query is very simple, but also very powerful. We can quickly assess the data available for a certain Regional Association data with just a few line of code.

You can see the original notebook here.

In [11]:

HTML(html)

Out[11]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.

What information is available?

Nov 09, 2015

A common task is to find out what information is available for further research later on.

We can programmatically build a list of strings to query common data catalogs and find out what services are available. This post will show how to perform a query for numerical models strings and try to answer the question: how many services are available in each catalog?

To answers that question we will start by building a list of known catalogs services.

(This post is part of Theme 1 - Scenario A.)

In [3]:

known_csw_servers = ['http://data.nodc.noaa.gov/geoportal/csw',
                     'http://cwic.csiss.gmu.edu/cwicv1/discovery',
                     'http://geoport.whoi.edu/geoportal/csw',
                     'https://edg.epa.gov/metadata/csw',
                     'http://www.ngdc.noaa.gov/geoportal/csw',
                     'http://cmgds.marine.usgs.gov/geonetwork/srv/en/csw',
                     'http://www.nodc.noaa.gov/geoportal/csw',
                     'http://cida.usgs.gov/gdp/geonetwork/srv/en/csw',
                     'http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw',
                     'http://geoport.whoi.edu/gi-cat/services/cswiso',
                     'https://data.noaa.gov/csw']

And a list of known model strings to query.

In [4]:

known_model_strings = ['roms', 'selfe', 'adcirc', 'ncom',
                       'hycom', 'fvcom', 'pom', 'wrams', 'wrf']

In [5]:

from owslib import fes

model_name_filters = []
for model in known_model_strings:
    kw = dict(literal='*%s*' % model, wildCard='*')
    title_filter = fes.PropertyIsLike(propertyname='apiso:Title', **kw)
    subject_filter = fes.PropertyIsLike(propertyname='apiso:Subject', **kw)
    model_name_filters.append(fes.Or([title_filter, subject_filter]))

The FES filter we build below is simpler than what we did before. We are only looking for matches in Title or Subject that contain the model strings.

In [6]:

from owslib.csw import CatalogueServiceWeb

model_results = []

for x in range(len(model_name_filters)):
    model_name = known_model_strings[x]
    single_model_filter = model_name_filters[x]
    for url in known_csw_servers:
        try:
            csw = CatalogueServiceWeb(url, timeout=20)
            csw.getrecords2(constraints=[single_model_filter],
                            maxrecords=1000, esn='full')
            for record, item in csw.records.items():
                for d in item.references:
                    result = dict(model=model_name,
                                  scheme=d['scheme'],
                                  url=d['url'],
                                  server=url)
                    model_results.append(result)
        except BaseException as e:
            print("- FAILED: {} - {}".format(url, e))

- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://data.nodc.noaa.gov/geoportal/csw - ('Connection aborted.', gaierror(-2, 'Name or service not known'))
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50
- FAILED: http://cwic.csiss.gmu.edu/cwicv1/discovery - ('Connection aborted.', error(111, 'Connection refused'))
- FAILED: https://edg.epa.gov/metadata/csw - ('Connection aborted.', error(104, 'Connection reset by peer'))
- FAILED: http://cida.usgs.gov/gdp/geonetwork/srv/en/csw - Space required after the Public Identifier, line 1, column 50
- FAILED: http://geodiscover.cgdi.ca/wes/serviceManagerCSW/csw - 404 Client Error: Not Found
- FAILED: http://geoport.whoi.edu/gi-cat/services/cswiso - Space required after the Public Identifier, line 1, column 50

Note that some servers have a maximum amount of records you can retrieve at once and are failing our query here. (See https://github.com/ioos/system-test/issues/126.)

Let's get the data as a pandas.DataFrame.

In [7]:

from pandas import DataFrame

df = DataFrame(model_results)
df = df.drop_duplicates()

And now that we have the results, what do they mean?

First let's plot the total number of services available.

In [8]:

total_services = DataFrame(df.groupby("scheme").size(),
                           columns=(["Number of services"]))

ax = total_services.sort('Number of services',
                         ascending=False).plot(kind="barh", figsize=(10, 8))

We can note that some identical services types URNs are being identified differently!

There should be a consistent way of representing each service, or a mapping needs to be made available.

We can try to get around the issue of the same services being identified differently by relying on the "Scheme" metadata field.

In [9]:

def normalize_service_urn(urn):
    urns = urn.split(':')
    if urns[-1].lower() == "url":
        del urns[-1]
    return urns[-1].lower()


urns = df.copy(deep=True)
urns["urn"] = urns["scheme"].map(normalize_service_urn)

In [10]:

urns_summary = DataFrame(urns.groupby("scheme").size(),
                         columns=(["Number of services"]))

ax = urns_summary.sort('Number of services',
                       ascending=False).plot(kind="barh", figsize=(10, 6))

A little better, but still not ideal.

Let's move forward and plot the number of services available for the list of model strings we requested.

Models per CSW server:¶

In [11]:

records_per_csw = DataFrame(urns.groupby(["model", "server"]).size(),
                            columns=(["Number of services"]))

model_csw_plotter = records_per_csw.unstack("model")

ax = model_csw_plotter['Number of services'].plot(kind='barh', figsize=(10, 8))

Services available per CSW server:¶

In [12]:

records_per_csw = DataFrame(urns.groupby(["scheme", "server"]).size(),
                            columns=(["Number of services"]))

model_csw_plotter = records_per_csw.unstack("server")
ax = model_csw_plotter.plot(kind='barh', subplots=True,
                            figsize=(12, 30), sharey=True)

Querying several catalogs like we did in this notebook is very slow. This approach should be used only to help to determine which catalog we can use after we know what type of data and service we need.

You can see the original IOOS System Test notebook here.

In [13]:

HTML(html)

Out[13]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.

Near real-time current data

Oct 26, 2015

The System Integration Test is divided into three main themes:

Baseline Assessment,
Extreme Events, and
Species Protection and Marine Habitat Classification.

The second theme, extreme events, serves to test discovery, access, and usefulness of meteorological and oceanographic (metocean) data commonly needed to plan for or respond to extreme events such as coastal storms, oil spills, or search and rescue. The tests in this theme are focused around four questions:

Is it possible to discover long enough records of metocean data to successfully perform extreme values analysis and predict extreme conditions for a given location?
Is it possible to discover and compare modeled and observed metocean parameters for a given geographic location?
What is the extent of relevant meteorological and oceanographic data available in near real-time that could be used to support rapid deployment of a common operational picture to support response?
Is it possible to discover and use information about the consequences of an event (e.g. determining whether extreme water levels cause coastal flooding and to what extent)?

We already known that (2) is possible for sea surface height. We also know we can query very long time-series over OPeNDAP, while SOS is limited to ~1 week of data. Therefore, (1) is possible depending on the endpoint available for the data in question. Finally, to properly answer question (4) we need first to be able to do (3) correctly.

The notebook shown below will try to answer (3) for ocean currents measured by HF-Radar.

We are interested in looking at near real-time data to analyze extreme events for a certain area of interest. Let's start by defining a small bounding box for San Francisco Bay, and query data from the past week up to now.

In [3]:

from datetime import datetime, timedelta


bounding_box_type = "box"
bounding_box = [-123, 36, -121, 40]

# Temporal range.
jd_now = datetime.utcnow()
jd_start,  jd_stop = jd_now - timedelta(days=(7)), jd_now

start_date = jd_start.strftime('%Y-%m-%d %H:00')
stop_date = jd_stop.strftime('%Y-%m-%d %H:00')

jd_start = datetime.strptime(start_date, '%Y-%m-%d %H:%M')
jd_stop = datetime.strptime(stop_date, '%Y-%m-%d %H:%M')


print('{} to {}'.format(start_date, stop_date))

2015-10-19 14:00 to 2015-10-26 14:00

The cell below defines the variable of choice: currents.

In [4]:

data_dict = dict()
sos_name = 'Currents'
data_dict['currents'] = {"names": ['currents',
                                   'surface_eastward_sea_water_velocity',
                                   '*surface_eastward_sea_water_velocity*'],
                         "sos_name": [sos_name]}

The next 3 cells are very similar to what we did in the fetching data example:

Instantiate a csw object using the NGDC catalog.
Create the owslib.fes filter.
Apply the filter!

In [5]:

from owslib.csw import CatalogueServiceWeb


endpoint = 'http://www.ngdc.noaa.gov/geoportal/csw'

csw = CatalogueServiceWeb(endpoint, timeout=60)

In [6]:

from owslib import fes
from utilities import fes_date_filter
                       
start, stop = fes_date_filter(jd_start, jd_stop)
bbox = fes.BBox(bounding_box)

or_filt = fes.Or([fes.PropertyIsLike(propertyname='apiso:AnyText',
                                     literal=('*%s*' % val),
                                     escapeChar='\\', wildCard='*',
                                     singleChar='?') for val in
                  data_dict['currents']['names']])

val = 'Averages'
not_filt = fes.Not([fes.PropertyIsLike(propertyname='apiso:AnyText',
                                       literal=('*%s*' % val),
                                       escapeChar='\\',
                                       wildCard='*',
                                       singleChar='?')])

filter_list = [fes.And([bbox, start, stop, or_filt, not_filt])]

In [7]:

csw.getrecords2(constraints=filter_list, maxrecords=1000, esn='full')

The OPeNDAP URLs found were;

In [8]:

from utilities import service_urls

dap_urls = service_urls(csw.records, service='odp:url')

print("Total DAP: {}".format(len(dap_urls)))
print("\n".join(dap_urls))

Total DAP: 12
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/1km/hourly/GNOME
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/1km/hourly/RTV
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/2km/hourly/GNOME
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/2km/hourly/RTV
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/500m/hourly/GNOME
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/500m/hourly/RTV
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/6km/hourly/GNOME
http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/6km/hourly/RTV
http://oos.soest.hawaii.edu/thredds/dodsC/hioos/tide_pac
http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global
http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/scud/pac
http://thredds.axiomdatascience.com/thredds/dodsC/CA_DAS.nc

We found 12 endpoints for currents data. Note that the last four look like model data, while all others are HF-Radar endpoints.

Can we find more near real-time data?

Let's try the Center for Operational Oceanographic Products and Services (CO-OPS) and the National Data Buoy Center (NDBC).

Both services can be queried using the pyoos library.

In [9]:

from pyoos.collectors.coops.coops_sos import CoopsSos

start_time = datetime.strptime(start_date, '%Y-%m-%d %H:%M')
end_time = datetime.strptime(stop_date, '%Y-%m-%d %H:%M')

coops_collector = CoopsSos()
coops_collector.start_time = start_time
coops_collector.end_time = end_time
coops_collector.variables = data_dict["currents"]["sos_name"]

In [10]:

from pandas import read_csv

box_str = ','.join(str(e) for e in bounding_box)

url = (('http://opendap.co-ops.nos.noaa.gov/ioos-dif-sos/SOS?'
        'service=SOS&request=GetObservation&version=1.0.0&'
        'observedProperty=%s&bin=1&'
        'offering=urn:ioos:network:NOAA.NOS.CO-OPS:CurrentsActive&'
        'featureOfInterest=BBOX:%s&responseFormat=text/csv') % (sos_name,
                                                                box_str))

obs_loc_df = read_csv(url)

In [11]:

from utilities.ioos import processStationInfo

st_list = processStationInfo(obs_loc_df, "coops")

print(st_list)

urn:ioos:station:NOAA.NOS.CO-OPS:s06010
urn:ioos:station:NOAA.NOS.CO-OPS:s08010
urn:ioos:station:NOAA.NOS.CO-OPS:s09010
Number of stations in bbox 3
{'urn:ioos:station:NOAA.NOS.CO-OPS:s06010': {'lat': 38.034599999999998, 'source': 'coops', 'lon': -122.12520000000001}, 'urn:ioos:station:NOAA.NOS.CO-OPS:s09010': {'lat': 37.808199999999999, 'source': 'coops', 'lon': -122.3443}, 'urn:ioos:station:NOAA.NOS.CO-OPS:s08010': {'lat': 37.916200000000003, 'source': 'coops', 'lon': -122.42230000000001}}

We got a few CO-OPS stations inside the time/box queried. How about NDBC buoys?

In [12]:

from pyoos.collectors.ndbc.ndbc_sos import NdbcSos

ndbc_collector = NdbcSos()
ndbc_collector.start_time = start_time
ndbc_collector.end_time = end_time
ndbc_collector.variables = data_dict["currents"]["sos_name"]

In [13]:

box_str = ','.join(str(e) for e in bounding_box)

url = (('http://sdf.ndbc.noaa.gov/sos/server.php?'
        'request=GetObservation&service=SOS&'
        'version=1.0.0&'
        'offering=urn:ioos:network:noaa.nws.ndbc:all&'
        'featureofinterest=BBOX:%s&'
        'observedproperty=%s&'
        'responseformat=text/csv&') % (box_str, sos_name))

In [14]:

obs_loc_df = read_csv(url)

len(obs_loc_df)

Out[14]:

No recent observation were found for NDBC buoys!

We need to get the data from the stations identified above.

(There is some boilerplate code in the cell below to show a progress bar. That can be useful when downloading from many stations, but it is not necessary.)

In [15]:

import uuid
from IPython.display import HTML, Javascript, display
from utilities.ioos import coopsCurrentRequest, ndbcSOSRequest

iso_start = start_time.strftime('%Y-%m-%dT%H:%M:%SZ')
iso_end = end_time.strftime('%Y-%m-%dT%H:%M:%SZ')

date_range_string = iso_start + "/" + iso_end


divid = str(uuid.uuid4())
pb = HTML("""
<div style="border: 1px solid black; width:500px">
  <div id="%s" style="background-color:blue; width:0%%">&nbsp;</div>
</div>
""" % divid)

display(pb)

count = 0
for station_index in st_list.keys():
    st = station_index.split(":")[-1]
    tides_dt_start = jd_start.strftime('%Y%m%d %H:%M')
    tides_dt_end = jd_stop.strftime('%Y%m%d %H:%M')

    if st_list[station_index]['source'] == "coops":
        df = coopsCurrentRequest(st, tides_dt_start, tides_dt_end)
    elif st_list[station_index]['source'] == "ndbc":
        df = ndbcSOSRequest(station_index, date_range_string)

    if (df is not None) and (len(df) > 0):
        st_list[station_index]['hasObsData'] = True
    else:
        st_list[station_index]['hasObsData'] = False
    st_list[station_index]['obsData'] = df

    print('{}'.format(station_index))

    count += 1
    percent_compelte = (float(count)/float(len(st_list.keys()))) * 100
    display(Javascript("$('div#%s').width('%i%%')" %
                       (divid, int(percent_compelte))))

urn:ioos:station:NOAA.NOS.CO-OPS:s06010

urn:ioos:station:NOAA.NOS.CO-OPS:s09010

urn:ioos:station:NOAA.NOS.CO-OPS:s08010

We found two CO-OPS stations and downloaded the data for both. Now let's get the data for the HF-Radar we found. The function get_hr_radar_dap_data filters the OPeNDAP endpoints returning the the data only for HF-Radar that are near the observations.

(We are downloading only 6 km HF-Radar from GNOME. The function can be adapted to get any other resolution and source.)

In [16]:

from utilities.ioos import get_hr_radar_dap_data

df_list = get_hr_radar_dap_data(dap_urls, st_list, jd_start, jd_stop)

http://hfrnet.ucsd.edu/thredds/dodsC/HFRNet/USWC/6km/hourly/GNOME
lat, lon, dist = 145 132
lat, lon, dist = 140 128
lat, lon, dist = 142 127

The models endpoints found in the CSW search are all global. Global model runs are not very useful when looking at such a small scale like the San Francisco Bay.

Since we do know that the SFO ports operational model exits, we can hard-code a data search for SFO and data points near the observations.

In [17]:

def findSFOIndexs(lats, lons, lat_lon_list):
    index_list, dist_list = [], []
    for val in lat_lon_list:
        point1 = Point(val[1], val[0])
        dist = 999999999
        index = -1
        for k in range(0, len(lats)):
            point2 = Point(lons[k], lats[k])
            val = point1.distance(point2)
            if val < dist:
                index = k
                dist = val
        index_list.append(index)
        dist_list.append(dist)
    print(index_list, dist_list)
    return index_list, dist_list


def buildSFOUrls(jd_start,  jd_stop):
    """
    Multiple files for time step, were only looking at Nowcast (past) values
    times are 3z, 9z, 15z, 21z

    """
    url_list = []
    time_list = ['03z', '09z', '15z', '21z']
    delta = jd_stop-jd_start
    for i in range((delta.days)+1):
        model_file_date = jd_start + timedelta(days=i)
        base_url = ('http://opendap.co-ops.nos.noaa.gov/'
                    'thredds/dodsC/NOAA/SFBOFS/MODELS/')
        val_month, val_year, val_day = '', '', ''
        # Month.
        if model_file_date.month < 10:
            val_month = "0" + str(model_file_date.month)
        else:
            val_month = str(model_file_date.month)
        # Year.
        val_year = str(model_file_date.year)
        # Day.
        if model_file_date.day < 10:
            val_day = "0" + str(model_file_date.day)
        else:
            val_day = str(model_file_date.day)
        file_name = '/nos.sfbofs.stations.nowcast.'
        file_name += val_year + val_month + val_day
        for t in time_list:
            t_val = '.t' + t + '.nc'
            url_list.append(base_url + val_year + val_month +
                            file_name + t_val)
    return url_list

In [18]:

from utilities.ioos import extractSFOModelData

st_known = st_list.keys()

name_list, lat_lon_list = [], []
for st in st_list:
    if st in st_known:
        lat = st_list[st]['lat']
        lon = st_list[st]['lon']
        name_list.append(st)
        lat_lon_list.append([lat, lon])

model_data = extractSFOModelData(lat_lon_list, name_list, jd_start, jd_stop)

NetCDF: file not found

Let's visualize everything we got so far.

In [19]:

import matplotlib.pyplot as plt


for station_index in st_list.keys():
    df = st_list[station_index]['obsData']
    if st_list[station_index]['hasObsData']:
        fig = plt.figure(figsize=(16, 3))
        plt.plot(df.index, df['sea_water_speed (cm/s)'])
        fig.suptitle('Station:'+station_index, fontsize=14)
        plt.xlabel('Date', fontsize=14)
        plt.ylabel('sea_water_speed (cm/s)', fontsize=14)

        if station_index in model_data:
            df_model = model_data[station_index]['data']
            plt.plot(df_model.index, df_model['sea_water_speed (cm/s)'])
        for ent in df_list:
            if ent['ws_pts'] > 4:
                if station_index == ent['name']:
                    df = ent['data']
                    plt.plot(df.index, df['sea_water_speed (cm/s)'])
                    ent['valid'] = True
    l = plt.legend(('Station Obs', 'Model', 'HF Radar'), loc='upper left')

Any comparison will only make sense if we put that in a geographic context. So let's create map and place the observations, model grid point, and HF-Radar grid point on it.

In [20]:

import folium
from utilities import get_coordinates

station = st_list[list(st_list.keys())[0]]
mapa = folium.Map(location=[station["lat"], station["lon"]], zoom_start=10)
mapa.line(get_coordinates(bounding_box),
          line_color='#FF0000',
          line_weight=5)

for st in st_list:
    lat = st_list[st]['lat']
    lon = st_list[st]['lon']

    popupString = ('<b>Obs Location:</b><br>' + st + '<br><b>Source:</b><br>' +
                   st_list[st]['source'])

    if 'hasObsData' in st_list[st] and st_list[st]['hasObsData'] == False:
        mapa.circle_marker([lat, lon], popup=popupString, radius=1000,
                           line_color='#FF0000', fill_color='#FF0000',
                           fill_opacity=0.2)

    elif st_list[st]['source'] == "coops":
        mapa.simple_marker([lat, lon], popup=popupString,
                           marker_color="darkblue", marker_icon="star")
    elif st_list[st]['source'] == "ndbc":
        mapa.simple_marker([lat, lon], popup=popupString,
                           marker_color="darkblue", marker_icon="star")

try:
    for ent in df_list:
        lat = ent['lat']
        lon = ent['lon']
        popupstring = ("HF Radar: [" + str(lat) + ":"+str(lon) + "]" +
                       "<br>for<br>" + ent['name'])
        mapa.circle_marker([lat, lon], popup=popupstring, radius=500,
                           line_color='#FF0000', fill_color='#FF0000',
                           fill_opacity=0.5)
except:
    pass

try:
    for st in model_data:
        lat = model_data[st]['lat']
        lon = model_data[st]['lon']
        popupstring = ("HF Radar: [" + str(lat) + ":" + str(lon) + "]" +
                       "<br>for<br>" + ent['name'])
        mapa.circle_marker([lat, lon], popup=popupstring, radius=500,
                           line_color='#66FF33', fill_color='#66FF33',
                           fill_opacity=0.5)
except:
    pass

Bonus: we can add WMS tile layer for today's HF-Radar data on the map. To enable the layer we must click on the top right menu of the interactive map.

In [21]:

jd_now = datetime.utcnow()
mapa.add_tile_layer(tile_name='hfradar 2km',
                    tile_url=('http://hfradar.ndbc.noaa.gov/tilesavg.php?'
                              's=10&e=100&x={x}&y={y}&z={z}&t=' +
                              str(jd_now.year) + '-' + str(jd_now.month) +
                              '-' + str(jd_now.day) + ' ' +
                              str(jd_now.hour-2) + ':00:00&rez=2'))

mapa.add_layers_to_map()

mapa

Out[21]:

Map legend:

Blue: observed station
Green: model grid point
Red: HF Radar grid point

In the beginning we raised the question:

What is the extent of relevant meteorological and oceanographic data available in near real-time that could be used to support rapid deployment of a common operational picture to support response?

In order to answer that we tested how to get currents data for an area of interest, like the San Fransisco Bay.

The main downside was the fact that we had to augment the catalog search by hard-coding a known model for the area, the SFBOFS Model. Still, the catalog was successful in finding near real-time observations from surrounding buoys/moorings, and HF-radar.

You can see the original IOOS System Test notebook here.

In [22]:

HTML(html)

Out[22]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.

Fetching data from a CSW catalog

Oct 12, 2015

This notebook shows a typical workflow to query a Catalog Service for the Web (CSW) and creates a request for data endpoints that are suitable for download.

The catalog of choice is the NGCD geoportal (http://www.ngdc.noaa.gov/geoportal/csw) and we want to query it using a geographical bounding box, a time range, and a variable of interested.

The example below will fetch Sea Surface Temperature (SST) data from all available observations and models in the Boston Harbor region. The goal is to assess the water temperature for the of the Boston Light Swim event.

We will search for data $\pm$ 4 days centered at the event date.

In [3]:

from datetime import datetime, timedelta

event_date = datetime(2015, 8, 15)

start = event_date - timedelta(days=4)
stop = event_date + timedelta(days=4)

The bounding box is slightly larger than the Boston harbor to assure we get some data.

In [4]:

spacing = 0.25

bbox = [-71.05-spacing, 42.28-spacing,
        -70.82+spacing, 42.38+spacing]

The CF_names object is just a Python dictionary whose keys are SOS names and the values contain all possible combinations of temperature variables names in the CF conventions. Note that we also define a units object. We use the object units to coerce all data to celcius.

In [5]:

import iris
from utilities import CF_names

sos_name = 'sea_water_temperature'
name_list = CF_names[sos_name]

units = iris.unit.Unit('celsius')

Now it is time to stitch all that together. For that we will use OWSLib*.

Constructing the filter is probably the most complex part. We start with a list comprehension using the fes.Or to create the variables filter. The next step is to exclude some unwanted results (ROMS Average files) using fes.Not. To select the desired dates we wrote a wrapper function that takes the start and end dates of the event. Finally, we apply the fes.And to join all the conditions above in one filter list.

* OWSLib is a Python package for client programming with Open Geospatial Consortium (OGC) web service (hence OWS) interface standards, and their related content models.

In [6]:

from owslib import fes
from utilities import fes_date_filter

kw = dict(wildCard='*',
          escapeChar='\\',
          singleChar='?',
          propertyname='apiso:AnyText')

or_filt = fes.Or([fes.PropertyIsLike(literal=('*%s*' % val), **kw)
                  for val in name_list])

# Exclude ROMS Averages and History files.
not_filt = fes.Not([fes.PropertyIsLike(literal='*Averages*', **kw)])

begin, end = fes_date_filter(start, stop)
filter_list = [fes.And([fes.BBox(bbox), begin, end, or_filt, not_filt])]

Now we are ready to load a csw object and feed it with the filter we created.

In [7]:

from owslib.csw import CatalogueServiceWeb

csw = CatalogueServiceWeb('http://www.ngdc.noaa.gov/geoportal/csw',
                          timeout=60)

csw.getrecords2(constraints=filter_list, maxrecords=1000, esn='full')

fmt = '{:*^64}'.format
print(fmt(' Catalog information '))
print("CSW version: {}".format(csw.version))
print("Number of datasets available: {}".format(len(csw.records.keys())))

********************* Catalog information **********************
CSW version: 2.0.2
Number of datasets available: 13

We found 13 datasets! Not bad for such a narrow search area and time-span.

What do we have there? Let's use the custom service_urls function to split the datasets into OPeNDAP and SOS endpoints.

In [8]:

from utilities import service_urls

dap_urls = service_urls(csw.records, service='odp:url')
sos_urls = service_urls(csw.records, service='sos:url')

print(fmt(' SOS '))
for url in sos_urls:
    print('{}'.format(url))

print(fmt(' DAP '))
for url in dap_urls:
    print('{}.html'.format(url))

***************************** SOS ******************************
http://sdf.ndbc.noaa.gov/sos/server.php?service=SOS&request=GetCapabilities&acceptVersions=1.0.0
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/036p1/036p1_d41.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/067p1/067p1_historic.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/094p1/094p1_historic.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/100p1/100p1_historic.nc?service=SOS&version=1.0.0&request=GetCapabilities
http://thredds.cdip.ucsd.edu/thredds/sos/cdip/archive/162p1/162p1_d06.nc?service=SOS&version=1.0.0&request=GetCapabilities
***************************** DAP ******************************
http://geoport.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd.html
http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw.html
http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global.html
http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc.html
http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best.html
http://thredds.axiomdatascience.com/thredds/dodsC/G1_SST_GLOBAL.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/036p1/036p1_d41.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/067p1/067p1_historic.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/094p1/094p1_historic.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/100p1/100p1_historic.nc.html
http://thredds.cdip.ucsd.edu/thredds/dodsC/cdip/archive/162p1/162p1_d06.nc.html

We will ignore the SOS endpoints for now and use only the DAP endpoints. But note that some of those SOS and DAP endpoints look suspicious. The Scripps Institution of Oceanography (SIO/UCSD) data should not appear in a search for the Boston Harbor. That is a known issue and we are working to sort it out. Meanwhile we have to filter out all observations form the DAP with the is_station function.

However, that filter still leaves behind URLs like http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc.html. That is probably satellite data and not model.

In an ideal world all datasets would have the metadata coverage_content_type defined. With the coverage_content_type we could tell models apart automatically. Until then we will have to make due with the heuristic function is_model from the utilities module. The is_model function works by comparing the metadata (and sometimes the data itself) against a series of criteria, like grid conventions, to figure out if a dataset is model data or not. Because the function operates on the data we will call it later on when we start downloading the data.

In [9]:

from utilities import is_station

non_stations = []
for url in dap_urls:
    try:
        if not is_station(url):
            non_stations.append(url)
    except RuntimeError as e:
        print("Could not access URL {}. {!r}".format(url, e))

dap_urls = non_stations

print(fmt(' Filtered DAP '))
for url in dap_urls:
    print('{}.html'.format(url))

************************* Filtered DAP *************************
http://geoport.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd.html
http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw.html
http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global.html
http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc.html
http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best.html

We still need to find endpoints for the observations. For that we'll use pyoos' NdbcSos and CoopsSoscollectors. The pyoos API is different from OWSLib's, but note that we are re-using the same query variables we create for the catalog search (bbox, start, stop, and sos_name.)

In [10]:

from pyoos.collectors.ndbc.ndbc_sos import NdbcSos

collector_ndbc = NdbcSos()

collector_ndbc.set_bbox(bbox)
collector_ndbc.end_time = stop
collector_ndbc.start_time = start
collector_ndbc.variables = [sos_name]

ofrs = collector_ndbc.server.offerings
title = collector_ndbc.server.identification.title
print(fmt(' NDBC Collector offerings '))
print('{}: {} offerings'.format(title, len(ofrs)))

******************* NDBC Collector offerings *******************
National Data Buoy Center SOS: 955 offerings

That number is misleading! Do we have 955 buoys available there? What exactly are the offerings? There is only one way to find out. Let's get the data!

In [11]:

from utilities import collector2table, get_ndbc_longname

ndbc = collector2table(collector=collector_ndbc)

names = []
for s in ndbc['station']:
    try:
        name = get_ndbc_longname(s)
    except ValueError:
        name = s
    names.append(name)

ndbc['name'] = names

ndbc.set_index('name', inplace=True)
ndbc.head()

Out[11]:

	station	sensor	lon	lat
name
Boston 16 Nm East Of Boston	44013	watertemp1	-70.69	42.35
Buoy A01	44029	ct1	-70.57	42.52

That makes more sense. Two buoys were found in the bounding box, and the name of at least one of them makes sense.

Now the same thing for CoopsSos.

In [12]:

from pyoos.collectors.coops.coops_sos import CoopsSos

collector_coops = CoopsSos()

collector_coops.set_bbox(bbox)
collector_coops.end_time = stop
collector_coops.start_time = start
collector_coops.variables = [sos_name]

ofrs = collector_coops.server.offerings
title = collector_coops.server.identification.title
print(fmt(' Collector offerings '))
print('{}: {} offerings'.format(title, len(ofrs)))

********************* Collector offerings **********************
NOAA.NOS.CO-OPS SOS: 1054 offerings

In [13]:

from utilities import get_coops_metadata

coops = collector2table(collector=collector_coops)

names = []
for s in coops['station']:
    try:
        name = get_coops_metadata(s)[0]
    except ValueError:
        name = s
    names.append(name)

coops['name'] = names

coops.set_index('name', inplace=True)
coops.head()

Out[13]:

	station	sensor	lon	lat
name
Boston, MA	8443970	E1	-71.0534	42.3548

We found one more. Now we can merge both into one table and start downloading the data.

In [14]:

from pandas import concat

all_obs = concat([coops, ndbc])

all_obs.head()

Out[14]:

	station	sensor	lon	lat
name
Boston, MA	8443970	E1	-71.0534	42.3548
Boston 16 Nm East Of Boston	44013	watertemp1	-70.69	42.35
Buoy A01	44029	ct1	-70.57	42.52

In [15]:

from pandas import DataFrame
from owslib.ows import ExceptionReport
from utilities import pyoos2df, save_timeseries

iris.FUTURE.netcdf_promote = True

data = dict()
col = 'sea_water_temperature (C)'
for station in all_obs.index:
    try:
        idx = all_obs['station'][station]
        df = pyoos2df(collector_ndbc, idx, df_name=station)
        if df.empty:
            df = pyoos2df(collector_coops, idx, df_name=station)
        data.update({idx: df[col]})
    except ExceptionReport as e:
        print("[{}] {}:\n{}".format(idx, station, e))

The cell below reduces or interpolates, depending on the original frequency of the data, to 1 hour frequency time-series.

In [16]:

from pandas import date_range

index = date_range(start=start, end=stop, freq='1H')
for k, v in data.iteritems():
    data[k] = v.reindex(index=index, limit=1, method='nearest')

obs_data = DataFrame.from_dict(data)

obs_data.head()

Out[16]:

	44013	44029	8443970
2015-08-11 00:00:00	19	17.700001	17.5
2015-08-11 01:00:00	19	17.600000	17.7
2015-08-11 02:00:00	19	17.500000	17.8
2015-08-11 03:00:00	19	17.299999	17.8
2015-08-11 04:00:00	19	17.100000	17.9

And now the same for the models. Note that now we use the is_model to filter out non-model endpotins.

In [17]:

import warnings
from iris.exceptions import (CoordinateNotFoundError, ConstraintMismatchError,
                             MergeError)
from utilities import (quick_load_cubes, proc_cube, is_model,
                       get_model_name, get_surface)

cubes = dict()
for k, url in enumerate(dap_urls):
    print('\n[Reading url {}/{}]: {}'.format(k+1, len(dap_urls), url))
    try:
        cube = quick_load_cubes(url, name_list,
                                callback=None, strict=True)
        if is_model(cube):
            cube = proc_cube(cube, bbox=bbox,
                             time=(start, stop), units=units)
        else:
            print("[Not model data]: {}".format(url))
            continue
        cube = get_surface(cube)
        mod_name, model_full_name = get_model_name(cube, url)
        cubes.update({mod_name: cube})
    except (RuntimeError, ValueError,
            ConstraintMismatchError, CoordinateNotFoundError,
            IndexError) as e:
        print('Cannot get cube for: {}\n{}'.format(url, e))


[Reading url 1/5]: http://geoport.whoi.edu/thredds/dodsC/coawst_4/use/fmrc/coawst_4_use_best.ncd

[Reading url 2/5]: http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw
[Not model data]: http://oos.soest.hawaii.edu/thredds/dodsC/hioos/satellite/dhw

[Reading url 3/5]: http://oos.soest.hawaii.edu/thredds/dodsC/pacioos/hycom/global

[Reading url 4/5]: http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc
[Not model data]: http://tds.maracoos.org/thredds/dodsC/SST-Three-Agg.nc

[Reading url 5/5]: http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/his/ESPRESSO_Real-Time_v2_History_Best

And now we can use the iris cube objects we collected to download model data near the buoys we found above. We will need get_nearest_water to search the 10 nearest model points at least 0.08 degrees away from each buys.

(This step is still a little bit clunky and need some improvements!)

In [18]:

from iris.pandas import as_series
from utilities import (make_tree, get_nearest_water,
                       add_station, ensure_timeseries, remove_ssh)

model_data = dict()
for mod_name, cube in cubes.items():
    print(fmt(mod_name))
    try:
        tree, lon, lat = make_tree(cube)
    except CoordinateNotFoundError as e:
        print('Cannot make KDTree for: {}'.format(mod_name))
        continue
    # Get model series at observed locations.
    raw_series = dict()
    for station, obs in all_obs.iterrows():
        try:
            kw = dict(k=10, max_dist=0.08, min_var=0.01)
            args = cube, tree, obs.lon, obs.lat
            series, dist, idx = get_nearest_water(*args, **kw)
        except ValueError as e:
            status = "No Data"
            print('[{}] {}'.format(status, obs.name))
            continue
        if not series:
            status = "Land   "
        else:
            series = as_series(series)
            raw_series.update({obs['station']: series})
            status = "Water  "
        print('[{}] {}'.format(status, obs.name))
    if raw_series:  # Save that model series.
        model_data.update({mod_name: raw_series})
        del cube

****************************COAWST_4****************************
[Water  ] Boston, MA
[Water  ] Boston 16 Nm East Of Boston
[Water  ] Buoy A01
*****************************HYCOM******************************
[Land   ] Boston, MA
[Water  ] Boston 16 Nm East Of Boston
[Water  ] Buoy A01
*************************ROMS_ESPRESSO**************************
[Land   ] Boston, MA
[Land   ] Boston 16 Nm East Of Boston
[No Data] Buoy A01

To end this post let's plot the 3 buoys we found together with the nearest model grid point.

In [19]:

import matplotlib.pyplot as plt

buoy = '44013'

fig , ax = plt.subplots(figsize=(11, 2.75))

obs_data[buoy].plot(ax=ax, label='Buoy')

for model in model_data.keys():
    try:
        model_data[model][buoy].plot(ax=ax, label=model)
    except KeyError:
        pass  # Could not find a model at this location.

leg = ax.legend()

In [20]:

buoy = '44029'

fig , ax = plt.subplots(figsize=(11, 2.75))

obs_data[buoy].plot(ax=ax, label='Buoy')

for model in model_data.keys():
    try:
        model_data[model][buoy].plot(ax=ax, label=model)
    except KeyError:
        pass  # Could not find a model at this location.

leg = ax.legend()

In [21]:

buoy = '8443970'

fig , ax = plt.subplots(figsize=(11, 2.75))

obs_data[buoy].plot(ax=ax, label='Buoy')

for model in model_data.keys():
    try:
        model_data[model][buoy].plot(ax=ax, label=model)
    except KeyError:
        pass  # Could not find a model at this location.

leg = ax.legend()

That is it! We fetched data based only on a bounding box, time-range, and variable name. The workflow is not as smooth as we would like. We had to mix OWSLib catalog searches with to different pyoos collector to download the observed and modeled data. Another hiccup are all the workarounds used to go from iris cubes to pandas series/dataframes. There is a clear need to a better way to represent CF feature types in a single Python object.

To end this post check out the full version of the Boston Light Swim notebook. (Specially the interactive map at the end.)

In [22]:

HTML(html)

Out[22]:

This post was written as an IPython notebook. It is available for download. You can also try an interactive version on binder.

Opening post

Sep 28, 2015

This blog is a "notebook digest" for all, but not only, notebooks created during the IOOS DMAC System Integration Test exercise.

The goal is to post readable, bite size, and executable(!) notebooks.

Wait... Before we dive into the notebooks:

What is IOOS?

The US Integrated Ocean Observing System (IOOS) is a collaboration between Federal, State, Local, Academic and Commercial partners to manage and/or provide access to a wide range of ocean observing assets and data feeds, including in-situ buoys, drifters, gliders, radar, satellite data, and numerical models and meet the needs of the ocean data community.

What is DMAC?

The Data Management and Communications (DMAC) is a subsystem of IOOS that provides the procedures, protocols and technology solutions to allow integration of the disparate observational data feeds within and amongst the regional associations and other IOOS data providers. Much of the data delivery and access technologies implemented by the DMAC subsystem are leveraged from the atmospheric community, where these technologies had become community practice.

What is System Integration Test?

The System Integration Test (SIT) is an effort to stress-testing DMAC by evaluating how they scale across geographies, data types, very large datasets, and long term archives.

The SIT project has been organized into three themes

1) Baseline Assessment,
2) Extreme Events, and
3) Species Protection and Marine Habitat Classification.

Click in the binder badge below to load the notebook binder. You can browse, open, modify, and run any of the notebooks available here.

newer older