Over the years I've been involved with various projects and discussions on generating and handling data in wastewater treatment. A few years ago I was involved in a couple of WERF Projects focused on developing Decision Support Systems (DSS) to prevent plant upsets, along with Dr Nancy Love and Advanced Data Mining (ADMi). The folks at ADMi did some nice data analytics to pick out anomalies that might indicate toxins in the plant influent, but one of the major hurdles we ran into was distinguishing anomalies due to toxins and anomalies due to measurement problems. This reminded me of what my ex-boss and mentor, Dr John Watts, used to drill into me which is you need to focus on good primary measurements in order to have confidence in your data. Wastewater is a tough place to try to do that! As I said, a lot of our data is bad.
So, here is my brain dump on some of the keys to making big data work in wastewater, and avoiding the pitfalls of bad big data (there's a tongue-twister there somewhere...)!
5 keys to making big data work
1. Focus on data quality rather than quantity
Starting from Dr Watt's sage advice to me years ago, and written up in one of his rare papers here, no amount of fancy analytics can overcome measurement errors, whether that's noise, drift or interferences. You need to have confident in your primary sensors and analyzers otherwise your big data analytics will be crunching numbers that are meaningless and therefore any results you'll get will be useless. Crap data = crap analytics!
In order to gain confidence in your data, you need to do 3 things with your sensors/analyzers:
- Clean them - wastewater is an extremely fouling environment an not the best place to put scientific equipment. My experience has been that everyone underestimates how quickly sensors become fouled. Go for auto-cleaning whenever possible and avoid installing anything in raw sewage or primary effluent unless you really need the measurement (see Key #2!) as these areas are particularly prone to fouling. Mixed liquor is actually an easier place to take measurements and final effluent the easiest of all!
- Calibrate them - this is generally understood, though the frequency of calibration, particularly for sensors that tend to drift, is generally shorter than ideal.
- Validate them - this is the piece that's overlooked by most instrumentation suppliers, I think. Analytics to validate the measurements, particularly during calibration is an area that needs much more attention.
Much of the work that Dr Watts did at Minworth Systems was focused on automating these 3 things and I've seen very few instruments come close to what he did 20 years ago!
2. Measure what matters most
I could probably make this blog an ode to John Watts and fill it with his anecdotes. One of my favorites was one where a customer asked him to install a dissolved oxygen (DO) probe in an anoxic zone. He suggested it would be cheaper to install a wooden probe and write 0 mg/L on a fake display! Maybe that's a little harsh, but the point is that we should only measure things that are useful to help us to run the plant and that we're actually going to use to make some decision. Generally we're lacking many important and basic measurements in our treatment plants (e.g. dissolved oxygen in the aerated basins, airflow to each aeration zone and electricity use by blowers), but we need to be careful in our enthusiasm not to swing to the other extreme and start measuring stuff that's interesting but not useful. You can spend some serious money measuring ammonia and nitrate all over a treatment plant, but unless you're actually using it for control, the measurements will eventually be ignored and the instrument neglected. It's much better to have a handful of good instruments, positioned in locations where you're actually measuring something you can control, then there's motivation to keep those sensors running well (see Key#1!)
3. Think dynamics, not steady state
A lot of the design and operational guidance in text books and training materials have simple equations into which you plug a single number to get your answer (e.g. sludge age calculation or removal efficiency). Similarly, influent and effluent samples are usually flow-weighted or time-averaged composites (worse-still, grab samples!). All this means that we're used to thinking and talking about average daily conditions.
Graphic showing difference between composite sample and continuous measurement (Courtesy Dr. Leiv Rieger/WEF, taken from WEF Modeling 101 Webcast) |
4. Recognize different timescales
Hand-in-hand with dynamics is the need to think about different timescales:
- Diurnal (daily) variations
- Weekly trends (especially weekend versus weekday differences)
- Seasonal shifts
5. Consider how to handle outliers and extraordinary events
This blog is getting long, so I'll try to wrap up this 5th key quickly! In data analytics it's common practice to identify and eliminate outliers, assuming they're either "bad" measurements or not typical and therefore we can ignore them. However, thinking back to my involvement in the WERF projects on DSS, a lot of what is done at wastewater treatment plants is trying to keep the process stable in response to abnormal events such as upsets from shock loads or toxins, or more typically responding to wet weather. This means we need to identify these "outliers" but rather than throw them away, we need to decide how to respond. Maybe this is a topic for another blog?!!