The City of Chicago has a small team that works to liberate city data on the data portal. We use a variety of tools to clean, transform, and publish data on the portal. Often, datasets are automatically updated every day, but a number of datasets are one-time or infrequent updates.
In a series of posts, we will review the tools and techniques that are used for datasets. Whenever possible, we use open source tools to conduct a process to extract, transform, and load datasets on the portal. In this first post, we will give a detailed description of how we use OpenRefine (formerly GoogleRefine) to clean and transform one-time or infrequently updated datasets.
The first step is to obtain the dataset. The city is a large organization with over 30,000 employees, so finding new data and the person who knows about it can be more complicated than it seems. Fortunately, we have been posting this dataset for a number of years so we have a great partnership with the CPS staff who delivers this information.
In this case, we actually receive the information via email, which is pretty typical for one-time uploads.
Cleaning & Organizing Data
There is a significant amount of work to be ready for primetime. Our primary goal is to organize the data to be usable for the civic developer community, researchers, and the member of the public who wants to use the information. This means transforming the data to be machine-readable, easy to conduct statistical analytics, and ready for data visualization. This process can be messy, but fortunately, we have a great tool, OpenRefine, to help with the process.
OpenRefine is an open-source tool designed around wrangling dirty data. Often, data is cleaned in a spreadsheet, but OpenRefine is better. It allows you to quickly investigate your data for extreme values, blanks, or errant text fields mixed with numbers. It also has a powerful scripting language that allows you to systematically transform data.
We load the data we received into OpenRefine. The applications opens typical spreadsheet-like files, such as comma-separated values (CSV), tab-separated values (TSV), and Excel documents. It also supports file types such as XML and JSON–a favorite for web and application developers.
We use OpenRefine’s facet tool to quickly explore data in each column, looking for extreme values, type-o’s, and to get the general idea for each column.
Upon opening, it’s clear that “No Data” is repeated throughout to represent missing values. This creates additional work for any civic developer or researchers. NWEA percentiles were also tagged with suffixes like 1st, 2nd, 3rd, and 50th, which would require additional effort to analyze or graph. We want to remove these elements before publishing a dataset.
OpenRefine supports several scripting languages: GREL, Clojure, and JRuby, for powerful transformations. We use GREL to eliminate “No Data” and eliminate any “st”, “nd”, “rd”, and “th”. In addition, we also need to transform the values as proper numbers–instead of text that appears to be numbers.
We used the replace() function to search and replace values. By default, Google refers to the existing column of data as “value”. For instance, to replace any instance of “No Data” in a column, we used value.replace(“No Data”, “”). Like other languages, you can “daisy chain” arguments which we use to make several replacements. In addition to removing “No Data”, the “st”, “nd”, “rd”, and “th” is also replaced with no values. Thus, the full expression is value.replace(“No Data”, “”).replace(“st”, “”).replace(“nd”, “”).replace(“rd”, “”).replace(“th”, “”).
Finally, we convert the values to numbers as OpenRefine will treat the result as a text. We add “.toNumber()” at the end to force the results to be displayed as numbers. It’s also important to choose “set to blank” on errors as the toNumber() function is invalid after changing ‘No Data’ values to blanks. There are over a dozen columns using NWEA scores, so we easily repeated this operation by copy and pasting the function for other columns.
Website addresses weren’t normalized, so we performed a couple of operations to normalize the data. It’s easy to edit single entries within the faceting menu.
However, that solution does not scale well if there are many corrections to be made. In that situtation, it’s best to use the transformation tool. OpenRefine supports regular expressions, which we used to add “http://” infront of websites that only began with “www”.
The “CPS Performance Policy Status” needed a number of changes to make it more usable for developers and researchers. The values were ‘NOT ON PROBATION’, ‘NOT APPLICABLE’, or ‘ON PROBATION (x YEARS)’ depending on the probation length. We wanted to separate the status and have another column that just has a single number for any probation length.
First, we isolate the number of years if a school is on probation by adding a column based on “CPS Performance Policy Status”. We use a series of replace() functions to eliminate the ‘NOT APPLICABLE’, ‘NOT ON PROBATION (‘, and ‘ON PROBATION (‘ text. Next, we use a regular expression to replace ” years)”, leaving the x years of probation.
Second, we eschew anything but the probation status using the split() function. We used value.split(” (“) to separate the string into an array using ” (” as the separator. For instance, ‘ON PROBATION (5 years)’ is transformed into [‘ON PROBATION’, ‘5 years’]. We want to just keep ‘ON PROBATION’, so the entire function is written as value.split(” (“).
OpenRefine stores any changes in a JSON file, which can be exported at the end. If needed, we can use the JSON file to repeat any transformations needed from the original file. You can see the JSON for this file on this gist.
OpenRefine is a wonderful tool, but there are limitations. It’s not a full fledged ETL tool and easily allow us to systematically repeat the same transformations on a frequent basis. That limitation leads us to exclusively use OpenRefine on one-time or infrequently updated datasets.
A well-known limitation of OpenRefine is the inability to handle large datasets of several million rows. Usually, it will not allow users to import large datasets into the program, regardless of the performance or specifications.
To fill the gap of these limitations, we have relied on Kettle, an open-source ETL tool for automated updates and will be the subject of the next post in this series.