python4oceanographers

Turning ripples into waves

xml is OK (if you don't have to write or read it)

DISCLAIMER: most of the code below was stolen from the old, but still awesome and very useful cfchecker.

I am writing a library that needs to "understand" the CF-conventions standard names table. More specifically I need to know the variable standard_name and its corresponding units. The format of choice for the standard names table is the Extensible Markup Language (XML).

I don't like XML and avoid it as much as I can, but in a case like this there is no way around. Or is there?

This post is just a few notes for my future self on how to parse the CF standard names table using Python.

Let's start with a boilerplate class to parse the table and construct a dictionary with the standard_names as keys and units as values.

(PS: I am sure there are better ways to do this. So if you are a XML expert feel free to post a better way in the comments section.)

In [3]:
from xml.sax import ContentHandler

def normalize_whitespace(text):
    """
    Remove redundant whitespace from a string.
    
    """
    return ' '.join(text.split())


class ConstructDict(ContentHandler):
    """
    Parse the xml standard_name table, reading all entries
    into a dictionary and storing the `standard_name` and the `units`.
    
    """
    def __init__(self):
        self.inUnitsContent = 0
        self.inEntryIdContent = 0
        self.inVersionNoContent = 0
        self.inLastModifiedContent = 0
        self.dict = {}
        
    def startElement(self, name, attrs):
        # If it's an entry element, save the id
        if name == 'entry':
            id = normalize_whitespace(attrs.get('id', ""))
            self.this_id = id

        # If it's the start of a canonical_units element
        elif name == 'canonical_units':
            self.inUnitsContent = 1
            self.units = ""

        elif name == 'alias':
            id = normalize_whitespace(attrs.get('id', ""))
            self.this_id = id

        elif name == 'entry_id':
            self.inEntryIdContent = 1
            self.entry_id = ""

        elif name == 'version_number':
            self.inVersionNoContent = 1
            self.version_number = ""

        elif name == 'last_modified':
            self.inLastModifiedContent = 1
            self.last_modified = ""

    def characters(self, ch):
        if self.inUnitsContent:
            self.units = self.units + ch

        elif self.inEntryIdContent:
            self.entry_id = self.entry_id + ch

        elif self.inVersionNoContent:
            self.version_number = self.version_number + ch

        elif self.inLastModifiedContent:
            self.last_modified = self.last_modified + ch

    def endElement(self, name):
        # If it's the end of the canonical_units element, save the units
        if name == 'canonical_units':
            self.inUnitsContent = 0
            self.units = normalize_whitespace(self.units)
            self.dict[self.this_id] = self.units
            
        # If it's the end of the entry_id element, find the units for the self.alias
        elif name == 'entry_id':
            self.inEntryIdContent = 0
            self.entry_id = normalize_whitespace(self.entry_id)
            self.dict[self.this_id] = self.dict[self.entry_id]

        # If it's the end of the version_number element, save it
        elif name == 'version_number':
            self.inVersionNoContent = 0
            self.version_number = normalize_whitespace(self.version_number)

        # If it's the end of the last_modified element, save the last modified date
        elif name == 'last_modified':
            self.inLastModifiedContent = 0
            self.last_modified = normalize_whitespace(self.last_modified)

In the cell below we initiate the parser and feed it with the a ConstructDict instance.

In [4]:
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces

parser = make_parser()
parser.setFeature(feature_namespaces, 0)

std_name_dict = ConstructDict()
parser.setContentHandler(std_name_dict)

standard_name = './data/cf-standard-name-table.xml'
parser.parse(standard_name)

Now we can access the table as a python dictionary.

In [5]:
std_name_dict.dict.get("sea_water_potential_temperature")
Out[5]:
u'K'
In [6]:
std_name_dict.dict.get("sea_water_salinity")
Out[6]:
u'1e-3'

Cool we have a {standard_name: units} dictionary!

I also need some way to parse and validate the CF formula_terms. To do that I need to steal a little bit more from cfchecker.

In the cells below there are some snippets to,

  • identify a coordinate variable,
  • check for valid formula_terms,
  • and check is all formula_terms variables are presented.
In [7]:
from netCDF4 import Dataset

url = ('http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/avg/'
       'ESPRESSO_Real-Time_v2_Averages_Best')

nc = Dataset(url)
In [8]:
formula_terms = lambda v: v is not None

var = nc.get_variables_by_attributes(formula_terms=formula_terms)[0]

formula_terms = var.formula_terms
formula_terms
Out[8]:
u's: s_rho C: Cs_r eta: zeta depth: h depth_c: hc'
In [9]:
if nc.dimensions.get(var.name):
    print("Hurray I am a coordinate variable!")
Hurray I am a coordinate variable!

In [10]:
import re

if re.search("^([a-zA-Z0-9_]+: +[a-zA-Z0-9_]+( +)?)*$", formula_terms):
    print("And I have a valid `formula_terms`!")
And I have a valid `formula_terms`!

In [11]:
for x in formula_terms.split():
    if not re.search("^[a-zA-Z0-9_]+:$", x):
        if x in nc.variables.keys():
            print("{} is present in the file!".format(x))
s_rho is present in the file!
Cs_r is present in the file!
zeta is present in the file!
h is present in the file!
hc is present in the file!

Now I have all the pieces I need to build a CF-convention formula_terms parser.

With that information I can improve odvc to automatically find which formula should be used and correctly determine the derived z variable units. All while avoiding dealing with this:

In [12]:
!head -20 ./data/cf-standard-name-table.xml
<?xml version="1.0"?>
<standard_name_table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="cf-standard-name-table-1.1.xsd">
   <version_number>27</version_number>
   <last_modified>2013-11-28T05:25:32Z</last_modified>
   <institution>Program for Climate Model Diagnosis and Intercomparison</institution>
   <contact>webmaster@pcmdi.llnl.gov</contact>

  
   <entry id="age_of_sea_ice">
      <canonical_units>year</canonical_units>
      <grib></grib>
      <amip></amip>
      <description>&quot;Age of sea ice&quot; means the length of time elapsed since the ice formed.</description>
   </entry>
  
   <entry id="age_of_stratospheric_air">
      <canonical_units>s</canonical_units>
      <grib></grib>
      <amip></amip>
      <description>&quot;Age of stratospheric air&quot; means an estimate of the time since a parcel of stratospheric air was last in contact with the troposphere.</description>

In [13]:
HTML(html)
Out[13]:

This post was written as an IPython notebook. It is available for download or as a static html.

Creative Commons License
python4oceanographers by Filipe Fernandes is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://ocefpaf.github.io/.

Comments