# xml is OK (if you don't have to write or read it)

DISCLAIMER: most of the code below was stolen from the old, but still awesome and very useful cfchecker.

I am writing a library that needs to "understand" the CF-conventions standard names table. More specifically I need to know the variable standard_name and its corresponding units. The format of choice for the standard names table is the Extensible Markup Language (XML).

I don't like XML and avoid it as much as I can, but in a case like this there is no way around. Or is there?

This post is just a few notes for my future self on how to parse the CF standard names table using Python.

Let's start with a boilerplate class to parse the table and construct a dictionary with the standard_names as keys and units as values.

(PS: I am sure there are better ways to do this. So if you are a XML expert feel free to post a better way in the comments section.)

In [3]:
from xml.sax import ContentHandler

def normalize_whitespace(text):
"""
Remove redundant whitespace from a string.

"""
return ' '.join(text.split())

class ConstructDict(ContentHandler):
"""
Parse the xml standard_name table, reading all entries
into a dictionary and storing the standard_name and the units.

"""
def __init__(self):
self.inUnitsContent = 0
self.inEntryIdContent = 0
self.inVersionNoContent = 0
self.inLastModifiedContent = 0
self.dict = {}

def startElement(self, name, attrs):
# If it's an entry element, save the id
if name == 'entry':
id = normalize_whitespace(attrs.get('id', ""))
self.this_id = id

# If it's the start of a canonical_units element
elif name == 'canonical_units':
self.inUnitsContent = 1
self.units = ""

elif name == 'alias':
id = normalize_whitespace(attrs.get('id', ""))
self.this_id = id

elif name == 'entry_id':
self.inEntryIdContent = 1
self.entry_id = ""

elif name == 'version_number':
self.inVersionNoContent = 1
self.version_number = ""

elif name == 'last_modified':
self.inLastModifiedContent = 1
self.last_modified = ""

def characters(self, ch):
if self.inUnitsContent:
self.units = self.units + ch

elif self.inEntryIdContent:
self.entry_id = self.entry_id + ch

elif self.inVersionNoContent:
self.version_number = self.version_number + ch

elif self.inLastModifiedContent:
self.last_modified = self.last_modified + ch

def endElement(self, name):
# If it's the end of the canonical_units element, save the units
if name == 'canonical_units':
self.inUnitsContent = 0
self.units = normalize_whitespace(self.units)
self.dict[self.this_id] = self.units

# If it's the end of the entry_id element, find the units for the self.alias
elif name == 'entry_id':
self.inEntryIdContent = 0
self.entry_id = normalize_whitespace(self.entry_id)
self.dict[self.this_id] = self.dict[self.entry_id]

# If it's the end of the version_number element, save it
elif name == 'version_number':
self.inVersionNoContent = 0
self.version_number = normalize_whitespace(self.version_number)

# If it's the end of the last_modified element, save the last modified date
elif name == 'last_modified':
self.inLastModifiedContent = 0
self.last_modified = normalize_whitespace(self.last_modified)


In the cell below we initiate the parser and feed it with the a ConstructDict instance.

In [4]:
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces

parser = make_parser()
parser.setFeature(feature_namespaces, 0)

std_name_dict = ConstructDict()
parser.setContentHandler(std_name_dict)

standard_name = './data/cf-standard-name-table.xml'
parser.parse(standard_name)


Now we can access the table as a python dictionary.

In [5]:
std_name_dict.dict.get("sea_water_potential_temperature")

Out[5]:
u'K'

In [6]:
std_name_dict.dict.get("sea_water_salinity")

Out[6]:
u'1e-3'


Cool we have a {standard_name: units} dictionary!

I also need some way to parse and validate the CF formula_terms. To do that I need to steal a little bit more from cfchecker.

In the cells below there are some snippets to,

• identify a coordinate variable,
• check for valid formula_terms,
• and check is all formula_terms variables are presented.
In [7]:
from netCDF4 import Dataset

url = ('http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/avg/'
'ESPRESSO_Real-Time_v2_Averages_Best')

nc = Dataset(url)

In [8]:
formula_terms = lambda v: v is not None

var = nc.get_variables_by_attributes(formula_terms=formula_terms)[0]

formula_terms = var.formula_terms
formula_terms

Out[8]:
u's: s_rho C: Cs_r eta: zeta depth: h depth_c: hc'

In [9]:
if nc.dimensions.get(var.name):
print("Hurray I am a coordinate variable!")

Hurray I am a coordinate variable!


In [10]:
import re

if re.search("^([a-zA-Z0-9_]+: +[a-zA-Z0-9_]+( +)?)*$", formula_terms): print("And I have a valid formula_terms!")  And I have a valid formula_terms!  In [11]: for x in formula_terms.split(): if not re.search("^[a-zA-Z0-9_]+:$", x):
if x in nc.variables.keys():
print("{} is present in the file!".format(x))

s_rho is present in the file!
Cs_r is present in the file!
zeta is present in the file!
h is present in the file!
hc is present in the file!



Now I have all the pieces I need to build a CF-convention formula_terms parser.

With that information I can improve odvc to automatically find which formula should be used and correctly determine the derived z variable units. All while avoiding dealing with this:

In [12]:
!head -20 ./data/cf-standard-name-table.xml

<?xml version="1.0"?>
<standard_name_table xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="cf-standard-name-table-1.1.xsd">
<version_number>27</version_number>
<last_modified>2013-11-28T05:25:32Z</last_modified>
<institution>Program for Climate Model Diagnosis and Intercomparison</institution>
<contact>webmaster@pcmdi.llnl.gov</contact>

<entry id="age_of_sea_ice">
<canonical_units>year</canonical_units>
<grib></grib>
<amip></amip>
<description>&quot;Age of sea ice&quot; means the length of time elapsed since the ice formed.</description>
</entry>

<entry id="age_of_stratospheric_air">
<canonical_units>s</canonical_units>
<grib></grib>
<amip></amip>
<description>&quot;Age of stratospheric air&quot; means an estimate of the time since a parcel of stratospheric air was last in contact with the troposphere.</description>


In [13]:
HTML(html)

Out[13]:

This post was written as an IPython notebook. It is available for download or as a static html.