DISCLAIMER: most of the code below was stolen from the old, but still awesome and very useful cfchecker.
I am writing a library that needs to "understand" the CF-conventions standard names table. More specifically I need to know the variable standard_name
and its corresponding units
.
The format of choice for the standard names table is the Extensible Markup Language (XML).
I don't like XML and avoid it as much as I can, but in a case like this there is no way around. Or is there?
This post is just a few notes for my future self on how to parse the CF standard names table using Python.
Let's start with a boilerplate class to parse the table and construct a dictionary with the standard_name
s as keys and units
as values.
(PS: I am sure there are better ways to do this. So if you are a XML expert feel free to post a better way in the comments section.)
from xml.sax import ContentHandler
def normalize_whitespace(text):
"""
Remove redundant whitespace from a string.
"""
return ' '.join(text.split())
class ConstructDict(ContentHandler):
"""
Parse the xml standard_name table, reading all entries
into a dictionary and storing the `standard_name` and the `units`.
"""
def __init__(self):
self.inUnitsContent = 0
self.inEntryIdContent = 0
self.inVersionNoContent = 0
self.inLastModifiedContent = 0
self.dict = {}
def startElement(self, name, attrs):
# If it's an entry element, save the id
if name == 'entry':
id = normalize_whitespace(attrs.get('id', ""))
self.this_id = id
# If it's the start of a canonical_units element
elif name == 'canonical_units':
self.inUnitsContent = 1
self.units = ""
elif name == 'alias':
id = normalize_whitespace(attrs.get('id', ""))
self.this_id = id
elif name == 'entry_id':
self.inEntryIdContent = 1
self.entry_id = ""
elif name == 'version_number':
self.inVersionNoContent = 1
self.version_number = ""
elif name == 'last_modified':
self.inLastModifiedContent = 1
self.last_modified = ""
def characters(self, ch):
if self.inUnitsContent:
self.units = self.units + ch
elif self.inEntryIdContent:
self.entry_id = self.entry_id + ch
elif self.inVersionNoContent:
self.version_number = self.version_number + ch
elif self.inLastModifiedContent:
self.last_modified = self.last_modified + ch
def endElement(self, name):
# If it's the end of the canonical_units element, save the units
if name == 'canonical_units':
self.inUnitsContent = 0
self.units = normalize_whitespace(self.units)
self.dict[self.this_id] = self.units
# If it's the end of the entry_id element, find the units for the self.alias
elif name == 'entry_id':
self.inEntryIdContent = 0
self.entry_id = normalize_whitespace(self.entry_id)
self.dict[self.this_id] = self.dict[self.entry_id]
# If it's the end of the version_number element, save it
elif name == 'version_number':
self.inVersionNoContent = 0
self.version_number = normalize_whitespace(self.version_number)
# If it's the end of the last_modified element, save the last modified date
elif name == 'last_modified':
self.inLastModifiedContent = 0
self.last_modified = normalize_whitespace(self.last_modified)
In the cell below we initiate the parser and feed it with the a
ConstructDict
instance.
from xml.sax import make_parser
from xml.sax.handler import feature_namespaces
parser = make_parser()
parser.setFeature(feature_namespaces, 0)
std_name_dict = ConstructDict()
parser.setContentHandler(std_name_dict)
standard_name = './data/cf-standard-name-table.xml'
parser.parse(standard_name)
Now we can access the table as a python dictionary.
std_name_dict.dict.get("sea_water_potential_temperature")
std_name_dict.dict.get("sea_water_salinity")
Cool we have a {standard_name: units}
dictionary!
I also need some way to parse and validate the CF formula_terms
.
To do that I need to steal a little bit more from cfchecker
.
In the cells below there are some snippets to,
- identify a coordinate variable,
- check for valid
formula_terms
, - and check is all
formula_terms
variables are presented.
from netCDF4 import Dataset
url = ('http://tds.marine.rutgers.edu/thredds/dodsC/roms/espresso/2013_da/avg/'
'ESPRESSO_Real-Time_v2_Averages_Best')
nc = Dataset(url)
formula_terms = lambda v: v is not None
var = nc.get_variables_by_attributes(formula_terms=formula_terms)[0]
formula_terms = var.formula_terms
formula_terms
if nc.dimensions.get(var.name):
print("Hurray I am a coordinate variable!")
import re
if re.search("^([a-zA-Z0-9_]+: +[a-zA-Z0-9_]+( +)?)*$", formula_terms):
print("And I have a valid `formula_terms`!")
for x in formula_terms.split():
if not re.search("^[a-zA-Z0-9_]+:$", x):
if x in nc.variables.keys():
print("{} is present in the file!".format(x))
Now I have all the pieces I need to build a CF-convention formula_terms
parser.
With that information I can improve odvc to automatically find which formula should be used and correctly determine the derived z
variable units. All while avoiding dealing with this:
!head -20 ./data/cf-standard-name-table.xml
HTML(html)