Written by Noémie Prévost, M.Sc | 2023
Identifying times when water is unsafe for recreation, for drinking, or for aquatic life is a major challenge. Traditionally, sampling has been the preferred means of determining whether water is safe — but, as we have discussed in a previous blog post, a drawback of this approach is that the time lag between when we perform the sampling and when we receive the results is so great that the situation has often changed in the interim. To overcome this problem, predictive modeling based on artificial intelligence (AI) is an approach that is becoming more and more popular, and it is the one we have used to design our InteliSwim product.
The Safety of Forecasting Models
As stated in the Montréal Declaration for a Responsible Development of Artificial Intelligence, it is essential that any AI algorithm is reliable. Thus, the reliability of a water quality prediction model must be assessed prior to its use. Since each application of a model is different, it is necessary to compare different methods and evaluate their degree of accuracy as objectively as possible.
In Quebec, under the Environnement-Plage program, beach water is considered safe for direct contact aquatic activities when the E. coli concentration is less than 200 CFU/100 ml. When the safety status of the water is determined using such a threshold, the result is a binary classification: an open or closed beach. In these cases, the AI model can be evaluated using a confusion matrix. The confusion matrix compares what is predicted by the model to a reference value, i.e. the concentrations observed during the sampling tests in the laboratory.
Confusion Matrix
Both the observation (sampling) and the prediction (AI model) can be either below or above the threshold, producing a two-by-two array. Cases are generally defined as positive or negative, a terminology borrowed from the field of diagnostic testing. A positive case means that the point is above the threshold and a negative case means that it is below the threshold.
Now, based on this table, the following metrics are constructed:
- The sensitivity, or true positive rate (TPR) is the frequency at which the model detects a positive case when it occurs : TPR = TP / P
- The specificity, or true negative rate (TNR) is the frequency at which the model detects a negative case when it occurs : TNR = TN / N
- The miss rate, or false negative rate (FNR) is the frequency at which the model does not detect a positive case when it occurs : FNR = FN / P
- The fall-out, or false positive rate (FPR) is the frequency at which the model does not detect a negative case when it occurs : FPR = FP / N
Metrics for Model Performance Evaluation
Two metrics that are often considered when evaluating the performance of classification models are sensitivity and specificity. The objective is to maximize these values, i.e. we choose the model for which their value is the largest possible. Equivalently, we can also minimize the failure rate and the false alarm probability rate since the two are complementary.
TP + FN = P and FP + TN = N, i.e., TPR + FNR = FPR + TNR = 1
That being said, many machine learning algorithms require that a single metric be optimized during model selection. A popular single metric in this situation is accuracy (ACC), the measure of how often the model gives the correct answer in the set :
ACC = TP + FN / P + N
The Limits of Accuracy
Accuracy can be misleading, especially if the classes are unbalanced, or if the impact of a false positive and a false negative is very different. Let’s examine what this means.
Class Inbalance
Class imbalance occurs when the observation rate of the event under study is not around 50% — in other words, when the event under study is either very frequent or very rare. In the confusion matrix, this means that the number of negative and positive cases is very different.
In these situations, it is important, when interpreting the indicator, to take into account the rate of positive cases, i.e. the rate of exceedances of the threshold in the assessment of water quality. Indeed, if the frequency of positive cases is quite rare, a model that never predicts a single positive case can have a good accuracy.
The Relative Impact of False Positives and Negatives
As for the relative impact of the two types of errors, let’s see what this means in our context. For the user, a missed day at the beach represents a loss of quality of life that can be seen as a denial of ecological service. Indeed, enjoying water-based recreational activities is good for physical and mental health, reducing stress, etc. On the other hand, engaging in aquatic activities under unsafe conditions represents a health risk that can lead to discomfort, loss of productivity, or sometimes even more serious consequences. From the manager’s point of view, a closed beach represents an economic loss since some direct or indirect revenues will not be collected.
The Manager’s Role
The manager must therefore ask themself the question: is it better to open a beach when it is not safe, or to close a beach when it is safe? And, in each case, what is the real relative impact on the organization’s reputation, the economic consequences, the public health, etc.? In other words, how often would you be willing to close the beach when it is safe so as to avoid keeping it open on a day when it is not? Let’s look at how we can try to answer these questions rationally.
It should be noted that this is a relatively new consideration for decision makers, arising from the more widespread use of predictive models. Under current regulations, beaches are closed when a laboratory result is positive for contamination and they remain closed until a retest shows a negative result. There is no room for interpretation in this context, as it is simply a matter of determining whether the observed value exceeds the regulatory threshold. In the case of the use of a predictive model, there is, to our knowledge, no legislative framework, and it is the beach operators who must exercise their judgment to inform the model developers of the relative impacts of potential model errors.
Now, how would regulatory institutions go about answering these questions? One can argue that there is no universal answer, because of the subjectivity involved. If you asked a group of people how many days at the beach they would be willing to sacrifice to avoid unsafe swimming, their answers would vary because it depends on their relationship to risk and their personal preferences. It is therefore difficult, at this point, to predict which position they would take.
Furthermore, one thing to note about the confusion matrix is that it is an incomplete picture of reality, as the reference value is often also the result of a test, which means it is also subject to error. Indeed, when the value of a parameter is known, it is not necessary to run a model to predict it. The reference value is therefore considered to be the most reliable test. In our case, this reference value is the result of laboratory tests, and when we compare the results obtained by two different laboratories, they may differ significantly.
In conclusion, the risk of illness is directly proportional to the quantity of contaminated water ingested. Simple mitigation measures are easy to suggest and implement, such as keeping the mouth closed while swimming. For people with weakened immune systems, keeping the head entirely out of the water could be a reasonable compromise between summer fun and safety. Informing the public of these simple measures could easily improve public safety at no extra cost.
In this article, we look at how artificial intelligence models can be used to predict water quality; at the methods for determining the accuracy of a model; and at a manager’s responsibility in using these models.