2016: A Geometadata Odyssey
This is a series of blog posts about my experience taking metadata from the Minnesota Geospatial Commons and transforming it as part of the Big Ten Academic Alliance’s (formerly the CIC) Geospatial Data Discovery Project.
Post #1 - Background and FTP
The first post will begin with a little background about the Minnesota Geospatial Commons. It concludes with how I went about pulling XML files from a heavily nested FTP site.
Post #2 - XSLT and GeoNetwork
The next post will introduce GeoNetwork, a web-based metadata editor designed for geospatial metadata, the ISO19139 standard in particular. I’ll describe how I was able to introduce an existing external XSLT (files made for changing XML/HTML from one form to another) to GeoNetwork, and then how I used that XSLT to transform 500 records from the Minnesota Geospatial Commons into ISO 19139 records.
Post #3 - Editing Geometadata on Spreadsheets
Although the initial transformation is almost completely automated, the results required a significant amount of editing in order to meet the requirements of the CICGDDP. With over 500 records , editing each record individually would have taken way more time than I had (and definitely more than I would want to spend). To save editing time and also make editing more consistent across large groups of records, I wrote a python script called csw-update, which utilized the OGC CSW spec to request XML files from GeoNetwork, uses the lxml
python module to adjust them, then uploads back to GeoNetwork. A major advantage of the csw-update script is that it allows editing via spreadsheets, meaning that nearly anyone can be trained to do the work, as opposed to the comparatively complicated process of editing XML files.
Post #4 - Using OpenRefine for a Geometadata deep dive
After initial spreadsheet editing, I talk about the utility and use of OpenRefine for more thorough improvement of the metadata. In particular, I will discuss the use of a controlled vocabulary to make keywords more useful, as well as the use of faceting to make harmonizing values across the records more consistent and I daresay, fun.
Post #5 - GeoNetwork Editor Customization
No matter how much I tried to avoid it, there seems to always be a certain amount of one by one editing that has to be done. To expedite this process , I created a distinct metadata view within GeoNetwork that only displayed the elements we were focused on for the CIC GDDP. This saves on scrolling, clicks, and confusion for myself and the other metadata editors in the CIC GDDP. We decided on some other needed customizations of GeoNetwork, which I describe a bit in this post.
Post #6 - Publishing to GeoBlacklight from GeoNetwork
In the grand finale post, I describe another Python script, csw-to-geoblacklight which pulls ISO XML from the GeoNetwork CSW and transforms it into GeoBlacklight JSON, and then finally pushes it into the Solr index that backs the GeoBlacklight discovery interface.
All of these scripts, along with the metadata in question, are located in our GitHub organization. I’d love to hear any questions or other feedback. I’m on Twitter @kr_dyke or via email at krdyke ( at ) gmail.com.