This is an analysis of forecasts made by participants of the UK COVID-19 Crowd Forecasting Challenge. Over the course of 13 weeks (from May 24 2021 to August 16 2021) participants submitted forecasts using the crowdforecastr prediction platform.
These forecasts were aggregated by calculating the median prediction of all forecasts. These aggregated forecasts (later denoted as “epiforecasts-EpiExpert” or “Median ensemble”) were submitted to the European Forecast Hub.
Participants were asked to make one to four week ahead predictions of the weekly number of reported cases and deaths from COVID-19 in the UK. Figure 1.1 shows a visualisation of daily and weekly observed cases and deaths.
Figure 2.1 shows the median predictions of all participants as well as the crowd ensemble. We can see that there is considerable disagreement between individual forecasters on most forecast dates.
Figure 2.2 compares median predictions from the crowd ensemble against the Hub ensemble. The two ensembles are in closer agreements than the individual crowd forecasters.
Figure 2.3 illustrates uncertainty around the predictions made by the crowd ensemble. For all two week ahead forecasts, the 50% (darker) and 95% (lighter) prediction intervals are shown. We see that observed values are outside of the prediction intervals regularly.
Forecasts were evaluated using the Weighted interval score (WIS). This score is negatively oriented, meaning that a lower score is better. You can think of the weighted interval score as a penalty for being less than perfect.
The weighted interval score is the sum of three components (i.e. three different types of penalties): Over-prediction, under-prediction and dispersion. Over-prediction and under-prediction are penalties that occur if the true observed value falls outside of the range of values deemed plausible by a forecast. If a forecast is very uncertain, then the range of plausible values is larger and it is less likely to get penalties for over- and under-prediction. The dispersion term on the other hand penalises a forecast for being overly uncertain.
To make forecasts of deaths and reported infections more comparable, we took the logarithm of all forecasts as well as the logarithm of the “ground truth data” and then calculated the weighted interval score using these.
This is different from the methodology used by the European Forecast Hub, which does not take the logarithm of forecasts and observed values. Taking the logarithm means that forecasts are scored in relative terms rather than absolute terms. On the natural scale it is important whether a forecast is e.g. 10 off or 1000, while on the logarithmic scale we score whether a forecast is 5% or 10% off - regardless of the absolute values. This may make more sense for a pandemic anyway where infections spread exponentially. It also allowed us to combine death forecasts and case forecasts and compute a single score to rank forecasters.
If a forecaster did not submit a forecast for a given forecast date, they were assigned the median score of all participants who submitted a forecast on that day.
Table 3.1 shows the overall leaderboard with scores summarised over different forecast dates, targets and horizons. In order to determine the winner of the forecasting competition, scores were averaged to obtain a single performance metric. However, this metric hides considerable variation in performance as will be explored in the following sections. Averaging across different forecast horizons also deviates from current Forecast Hub practices, as an average across different horizons does not lend itself to any meaningful interpretation.
Ranking | Forecaster | Score |
---|---|---|
1 | anonymous_Stingray | 4.94 |
2 | seb | 6.41 |
3 | aen | 6.56 |
4 | Trebuchet01 | 6.68 |
5 | habakuk (Rt) | 6.70 |
6 | Gw3n | 6.75 |
7 | aurelwu | 6.77 |
8 | Cantabulous | 6.78 |
9 | seb (Rt) | 6.81 |
10 | olane (Rt) | 6.82 |
Forecasts from individual participants were aggregated using a median. As can be seen in Figure 3.1, the median ensemble generally tended to perform better than the majority of individual forecasters (espcially for death forecasts).
Another interesting question is whether individual participants were able to beat the ensemble of all participant forecasts. We can see in Figure 3.2 that the top five forecasters often performed better than the ensemble, but not always. Espeically for deaths it seems like forecasters struggled to beat the ensemble consistently.
Or whether they were able to be better than the median forecaster each forecast date. It is interesting to see that even the top forecasters are not consistently better than the median forecaster, as is shown in Figure 3.3
Figure 3.4 shows how well different forecasters did as a function of the number of forecasts they submitted. Overall there doesn’t seem to be a strong effect. More regular forecasters maybe did slightly better in this competition on deaths, but the results are inconclusive.
We are also interested in whether self-identified ‘experts’ performed better than ‘non-experts’. As Figure 3.5 suggests, experts may not have a clear edge over non-experts. On the contrary, in this competition it seemed experts performed slightly worse than non-experts. Note that the distinction between ‘expert’ and ‘non-expert’ is by no means clear, as there are no fixed criteria for ‘experts’ and ‘non-experts’ and participants were asked to choose what they felt most appropriate.
Let’s have a look at how the crowd forecast ensemble performed against the overall Forecast Hub.
Table 3.2 includes the numeric scores achieved by the crowd forecast ensemble, the Hub-ensemble and the Hub-baseline model.
Model | Target type | Score |
---|---|---|
epiforecasts-EpiExpert | Cases | 3.72 |
EuroCOVIDhub-ensemble | Cases | 3.82 |
EuroCOVIDhub-baseline | Cases | 6.62 |
epiforecasts-EpiExpert | Deaths | 1.84 |
EuroCOVIDhub-ensemble | Deaths | 1.96 |
EuroCOVIDhub-baseline | Deaths | 10.97 |
Users could submit two different forecasts. One was a direct forecast of cases and deaths. The median ensemble of these direct forecasts is called “EpiExpert_direct”. The other one was a forecast of \(R_t\), the effective reproduction number. This \(R_t\) forecast was then mapped to cases and deaths using the so-called renewal equation, which models future cases as a weighted sum of past cases times \(R_t\). The median ensemble that uses only these forecasts is called “EpiExpert_Rt”. The “EpiExpert” ensemble is a median ensemble that used both regular as well as \(R_t\) forecasts.
Figure 3.8 shows scores for these three ensemble types over time.
Summarised scores for the different versions of the crowd forecast ensemble are given in 3.3.
Model | Target type | Score |
---|---|---|
epiforecasts-EpiExpert | Cases | 3.72 |
epiforecasts-EpiExpert_Rt | Cases | 3.85 |
epiforecasts-EpiExpert_direct | Cases | 3.87 |
epiforecasts-EpiExpert | Deaths | 1.84 |
epiforecasts-EpiExpert_direct | Deaths | 2.12 |
epiforecasts-EpiExpert_Rt | Deaths | 3.19 |
Table 4.1 shows the ten most active forecasters. The average number of forecasts per participant was 2.74, while most participants dropped out after their first forecast (Table 4.2). Only two participants submitted a forecast for all thirteen forecast dates.
Model | N forecasts |
---|---|
anonymous_Stingray | 13 |
seabbs | 13 |
seabbs (Rt) | 12 |
2e10e122 | 10 |
BQuilty | 10 |
aurelwu | 8 |
RitwikP | 8 |
seb | 8 |
seb (Rt) | 8 |
Sophia | 8 |
Max | Min | Mean | Median |
---|---|---|---|
13 | 1 | 2.74 | 1 |
On average, 21.9 forecasts were submitted each week, with a minimum of 10 and a maximum of 57 (Table 4.3). The distribution of the number of forecasters over time is shown in Figure 4.1.
Max | Min | Mean | Median |
---|---|---|---|
57 | 10 | 21.92 | 21 |
Rankings between different models change depending on how you evaluate forecasts. For example, if you score on an absolute, rather than a logarithmic scale, results change. On the natural scale, it matters how far off a forecast is in absolute, rather than in relative terms. This implies that scores depend on the order of magnitude of the target one tries to forecast. For example, it is very easy to miss reported cases by 1000, whereas for deaths errors will rather be in the tens or hundreds.
If reported on the natural scale, the crowd forecasts no longer outperform the Hub ensemble, as can be seen in Figure 5.1 and in Table 5.1. Crowd forecasts showed worse performance than the Hub ensemble around the peak of cases in July, where cases were highest, which results in a stronger penalty if scored on the natural rather than the log scale.
Model | Target type | Score |
---|---|---|
EuroCOVIDhub-baseline | Cases | 616337.71 |
EuroCOVIDhub-ensemble | Cases | 645441.59 |
epiforecasts-EpiExpert | Cases | 742296.85 |
EuroCOVIDhub-ensemble | Deaths | 686.58 |
epiforecasts-EpiExpert | Deaths | 741.42 |
EuroCOVIDhub-baseline | Deaths | 1743.88 |