We are very excited that, this summer, the City of Chicago teamed-up with a team of volunteer data scientists to develop new statistical models to improve the accuracy of beach advisories due to the presence of E. Coli. Sean Thornton, a Program Advisor at Harvard University’s Ash Center, has a terrific article on the project:
For Chicagoans, few things are as enjoyable as a day at the beach. That joy, however, is contingent on clean waters that are free of contaminants such as E. coli bacteria. With the arrival of this year’s beach season, Chicago has built an analytical pilot model that will enhance its Park District’s regular beach water quality inspection process. The model specifically aims to guide which beaches may need to close based on predicted E. Coli readings, which helps protect the public with advisories or closures as soon as possible.
This is not Chicago’s first municipal predictive analytics project; the city has had success with predictive models for rodent baiting operations and food establishment inspections, among others. This model differs, however, because it wasn’t built by the City of Chicago at all, but by a team of volunteers from the city’s civic tech community.
This was both a fascinating and difficult question to deliver a better model. In fact, the team put together an ensemble model ranging from logit regressions to gradient-boosting models to deliver improved performance. Each week, between 4 and 12 data scientists, statisticians, and researchers attempted to develop a model which can reduce the chance of children and Chicagoans getting sick from an enjoyable day at the beach.
Because the project is open source, you can find the entire source code on GitHub to see if you can improve the work done by a group of civic technologists. Notes about the project can be found on the project’s wiki page, including detailed noted on the lab testing process and weekly updates on the teams findings.
The summer of 2016 is needed to compare performance. As with any statistical model, the team was concerned with “over-fitting”–where such effort is made to predict past events, that the model suffers when predicting things that truly happen in the future. This summer provides an excellent testing-ground for the new model, providing feedback on the predictive power of these new techniques.
Open data played a key role in enabling this project. For the past two years, Chicago Park District has published real-time information on the forecasts from the statistical model developed by the United States Geological Survey (USGS). Beginning this week, these results are now being stored on the city’s open data portal.
To support this project, the Chicago Park District and City of Chicago also released the actual results of the lab tests on the portal as well. Over the course of this summer, the team of volunteer civic technologists, the City of Chicago, and Chicago Park District will be comparing the results of the existing USGS model to the new models to compare performance. Both the forecast data and the actual lab results will be used to show the performance of each model.
The above statistical models depend on hourly weather forecasts. For this, the team used the excellent weather models from Forecast.io (which powers the popular Dark Sky apps). Updated weather information for each beach is available here.
Of course, this project was not possible without the team that volunteered their time on the project: Matt, Rebecca, Kevin, Melissa, Scott, David, Daniel, Nick, Scott, Chris, and Forest. Well over 150 hours were dedicated to this project by a team who worked hard on helping improve our summers a little more.