Monday, January 4, 2016

Bad data versus big data (or big bad data!!)

There's currently quite a buzz about "big data" and how water utilities might dig into all the data they collect in order to be "smarter." Several of my colleagues are investigating ways to do this under the banner of Smart Integrated Infrastructure (SII) and Smart Water Analytics. Pretty cool stuff. In a couple of conversations on the topic I half-jokingly said that wastewater doesn't have big data, it has crap data!  To avoid misunderstanding, I should clarify that by "crap" I'm referring to it being bad data and not just data describing the fecal material we treat!

Over the years I've been involved with various projects and discussions on generating and handling data in wastewater treatment. A few years ago I was involved in a couple of WERF Projects focused on developing Decision Support Systems (DSS) to prevent plant upsets, along with Dr Nancy Love and Advanced Data Mining (ADMi). The folks at ADMi did some nice data analytics to pick out anomalies that might indicate toxins in the plant influent, but one of the major hurdles we ran into was distinguishing anomalies due to toxins and anomalies due to measurement problems. This reminded me of what my ex-boss and mentor, Dr John Watts, used to drill into me which is you need to focus on good primary measurements in order to have confidence in your data. Wastewater is a tough place to try to do that! As I said, a lot of our data is bad.

So, here is my brain dump on some of the keys to making big data work in wastewater, and avoiding the pitfalls of bad big data (there's a tongue-twister there somewhere...)!

5 keys to making big data work


1. Focus on data quality rather than quantity

Starting from Dr Watt's sage advice to me years ago, and written up in one of his rare papers here, no amount of fancy analytics can overcome measurement errors, whether that's noise, drift or interferences.  You need to have confident in your primary sensors and analyzers otherwise your big data analytics will be crunching numbers that are meaningless and therefore any results you'll get will be useless.  Crap data = crap analytics!

In order to gain confidence in your data, you need to do 3 things with your sensors/analyzers:
  1. Clean them - wastewater is an extremely fouling environment an not the best place to put scientific equipment.  My experience has been that everyone underestimates how quickly sensors become fouled.  Go for auto-cleaning whenever possible and avoid installing anything in raw sewage or primary effluent unless you really need the measurement (see Key #2!) as these areas are particularly prone to fouling. Mixed liquor is actually an easier place to take measurements and final effluent the easiest of all!
  2. Calibrate them - this is generally understood, though the frequency of calibration, particularly for sensors that tend to drift, is generally shorter than ideal.
  3. Validate them - this is the piece that's overlooked by most instrumentation suppliers, I think. Analytics to validate the measurements, particularly during calibration is an area that needs much more attention.
Much of the work that Dr Watts did at Minworth Systems was focused on automating these 3 things and I've seen very few instruments come close to what he did 20 years ago!

2. Measure what matters most

I could probably make this blog an ode to John Watts and fill it with his anecdotes.  One of my favorites was one where a customer asked him to install a dissolved oxygen (DO) probe in an anoxic zone. He suggested it would be cheaper to install a wooden probe and write 0 mg/L on a fake display!  Maybe that's a little harsh, but the point is that we should only measure things that are useful to help us to run the plant and that we're actually going to use to make some decision. Generally we're lacking many important and basic measurements in our treatment plants (e.g. dissolved oxygen in the aerated basins, airflow to each aeration zone and electricity use by blowers), but we need to be careful in our enthusiasm not to swing to the other extreme and start measuring stuff that's interesting but not useful. You can spend some serious money measuring ammonia and nitrate all over a treatment plant, but unless you're actually using it for control, the measurements will eventually be ignored and the instrument neglected.  It's much better to have a handful of good instruments, positioned in locations where you're actually measuring something you can control, then there's motivation to keep those sensors running well (see Key#1!)

3. Think dynamics, not steady state

A lot of the design and operational guidance in text books and training materials have simple equations into which you plug a single number to get your answer (e.g. sludge age calculation or removal efficiency). Similarly, influent and effluent samples are usually flow-weighted or time-averaged composites (worse-still, grab samples!).  All this means that we're used to thinking and talking about average daily conditions.

Graphic showing difference between composite
 sample and continuous measurement
(Courtesy Dr. Leiv Rieger/WEF,
taken from WEF Modeling 101 Webcast)
However, the reality is that our treatment plants see significant daily variations in flows and concentrations and therefore we need to look at them as a dynamic system. This was first brought home to me when I was working on a plant in the UK doing biological phosphorus removal back in the late 1990's. We had an online phosphate analyzer taking measurements at the end of the aeration basin just prior to the clarifiers and we would see daily phosphate peaks of 1 or 2 mg/L every afternoon for just an hour or so, but the effluent composite sample measurements would be pretty consistently below 0.2 mg/L. To understand our wastewater treatment systems we need to measure their dynamics and then analyze that good data (having adhered to Keys #1 and #2, of course!!)  

4. Recognize different timescales

Hand-in-hand with dynamics is the need to think about different timescales:

  • Diurnal (daily) variations
  • Weekly trends (especially weekend versus weekday differences)
  • Seasonal shifts
For each of these, the data analytics needs are quite different and need to be thought through properly. For diurnal variations, it's useful to compare one day to the next by maybe overlaying the dynamic data; for weekly trends we can do something similar over a 7-day horizon; for seasonal shifts we need to plot out long-term trends and compare them to temperature and maybe rainfall shifts.

5. Consider how to handle outliers and extraordinary events

This blog is getting long, so I'll try to wrap up this 5th key quickly!  In data analytics it's common practice to identify and eliminate outliers, assuming they're either "bad" measurements or not typical and therefore we can ignore them.  However, thinking back to my involvement in the WERF projects on DSS, a lot of what is done at wastewater treatment plants is trying to keep the process stable in response to abnormal events such as upsets from shock loads or toxins, or more typically responding to wet weather.  This means we need to identify these "outliers" but rather than throw them away, we need to decide how to respond. Maybe this is a topic for another blog?!!