1 Prediction targets and observed values

Participants were asked to make one to four week ahead predictions of the weekly number of reported cases and deaths from COVID-19 in the UK. Figure 1.1 shows a visualisation of daily and weekly observed cases and deaths.

Figure 1.1: Visualisation of daily (bars) and weekly (line) reported numbers.

2 Visualisation of forecasts

Figure 2.1 shows the median predictions of all participants as well as the crowd ensemble. We can see that there is considerable disagreement between individual forecasters on most forecast dates.

Figure 2.1: Median forecasts from all participants

Figure 2.2 compares median predictions from the crowd ensemble against the Hub ensemble. The two ensembles are in closer agreements than the individual crowd forecasters.

Visualsation of forecasts from the crowd ensemble and the Hub ensemble. Lines show the median forecast one to four weeks into the future for a given forecast date.

Figure 2.2: Visualsation of forecasts from the crowd ensemble and the Hub ensemble. Lines show the median forecast one to four weeks into the future for a given forecast date.

Figure 2.3 illustrates uncertainty around the predictions made by the crowd ensemble. For all two week ahead forecasts, the 50% (darker) and 95% (lighter) prediction intervals are shown. We see that observed values are outside of the prediction intervals regularly.

Visualisation of 2 week ahead forecasts of the crowd forecast ensemble with 50% prediction intervals (dark) and 95% prediction intervals (light).

Figure 2.3: Visualisation of 2 week ahead forecasts of the crowd forecast ensemble with 50% prediction intervals (dark) and 95% prediction intervals (light).

3 Forecast evaluation

Forecasts were evaluated using the Weighted interval score (WIS). This score is negatively oriented, meaning that a lower score is better. You can think of the weighted interval score as a penalty for being less than perfect.

The weighted interval score is the sum of three components (i.e. three different types of penalties): Over-prediction, under-prediction and dispersion. Over-prediction and under-prediction are penalties that occur if the true observed value falls outside of the range of values deemed plausible by a forecast. If a forecast is very uncertain, then the range of plausible values is larger and it is less likely to get penalties for over- and under-prediction. The dispersion term on the other hand penalises a forecast for being overly uncertain.

To make forecasts of deaths and reported infections more comparable, we took the logarithm of all forecasts as well as the logarithm of the “ground truth data” and then calculated the weighted interval score using these.

This is different from the methodology used by the European Forecast Hub, which does not take the logarithm of forecasts and observed values. Taking the logarithm means that forecasts are scored in relative terms rather than absolute terms. On the natural scale it is important whether a forecast is e.g. 10 off or 1000, while on the logarithmic scale we score whether a forecast is 5% or 10% off - regardless of the absolute values. This may make more sense for a pandemic anyway where infections spread exponentially. It also allowed us to combine death forecasts and case forecasts and compute a single score to rank forecasters.

If a forecaster did not submit a forecast for a given forecast date, they were assigned the median score of all participants who submitted a forecast on that day.

3.1 Leaderboard

Table 3.1 shows the overall leaderboard with scores summarised over different forecast dates, targets and horizons. In order to determine the winner of the forecasting competition, scores were averaged to obtain a single performance metric. However, this metric hides considerable variation in performance as will be explored in the following sections. Averaging across different forecast horizons also deviates from current Forecast Hub practices, as an average across different horizons does not lend itself to any meaningful interpretation.

Table 3.1: Official leaderboard with performance summarised over all forecasts across all targets and horizons. Scores summarised across different targets and horizons were used to determine the winner of the competition.
Ranking	Forecaster	Score
1	anonymous_Stingray	4.94
2	seb	6.41
3	aen	6.56
4	Trebuchet01	6.68
5	habakuk (Rt)	6.70
6	Gw3n	6.75
7	aurelwu	6.77
8	Cantabulous	6.78
9	seb (Rt)	6.81
10	olane (Rt)	6.82

3.2 Individual vs. ensemble performance over time

Forecasts from individual participants were aggregated using a median. As can be seen in Figure 3.1, the median ensemble generally tended to perform better than the majority of individual forecasters (espcially for death forecasts).

Visualisation of individual participants' scores (grey) together with scores from the median ensemble of all forecasts (red) submitted to the European Forecast Hub. Scores per forecast date in this Figure are averaged across all forecast horizons (one to four weeks ahead) and are likely dominated by forecasts for longer horizons.

Figure 3.1: Visualisation of individual participants’ scores (grey) together with scores from the median ensemble of all forecasts (red) submitted to the European Forecast Hub. Scores per forecast date in this Figure are averaged across all forecast horizons (one to four weeks ahead) and are likely dominated by forecasts for longer horizons.

Another interesting question is whether individual participants were able to beat the ensemble of all participant forecasts. We can see in Figure 3.2 that the top five forecasters often performed better than the ensemble, but not always. Espeically for deaths it seems like forecasters struggled to beat the ensemble consistently.

Scores for the top five forecasters and the median ensemble (in red) across different forecast dates. Scores are summarised across all forecast horizons.

Figure 3.2: Scores for the top five forecasters and the median ensemble (in red) across different forecast dates. Scores are summarised across all forecast horizons.

Or whether they were able to be better than the median forecaster each forecast date. It is interesting to see that even the top forecasters are not consistently better than the median forecaster, as is shown in Figure 3.3

Scores for the top five forecasters, median score from all other participants (in blue) and score of the median ensemble (in red) across different forecast dates. Scores are summarised across all forecast horizons.

Figure 3.3: Scores for the top five forecasters, median score from all other participants (in blue) and score of the median ensemble (in red) across different forecast dates. Scores are summarised across all forecast horizons.

3.3 Forecast performance and participant characteristics

Figure 3.4 shows how well different forecasters did as a function of the number of forecasts they submitted. Overall there doesn’t seem to be a strong effect. More regular forecasters maybe did slightly better in this competition on deaths, but the results are inconclusive.

Figure 3.4: Mean weighted interval score vs. the number of forecasts made by any individual forecaster

We are also interested in whether self-identified ‘experts’ performed better than ‘non-experts’. As Figure 3.5 suggests, experts may not have a clear edge over non-experts. On the contrary, in this competition it seemed experts performed slightly worse than non-experts. Note that the distinction between ‘expert’ and ‘non-expert’ is by no means clear, as there are no fixed criteria for ‘experts’ and ‘non-experts’ and participants were asked to choose what they felt most appropriate.

Figure 3.5: Distribution of weighted interval scores depending on whether or not participants self-identified as an ‘expert’.

Figure 3.6: Scores for ‘experts’ and ‘non-experts’ over time. Thick line is the mean score.

3.4 Comparison against the Forecast Hub

Let’s have a look at how the crowd forecast ensemble performed against the overall Forecast Hub.

Visualisation of scores for the ensemble of all forecasts from participants in the UK Crowd Forecasting Challenge ("epiforecasts-EpiExpert") against the ensemble of all forecasts from the European Forecast Hub (including our own forecasts) and the Forecast Hub baseline model. Scores are averaged across all forecast horizons.

Figure 3.7: Visualisation of scores for the ensemble of all forecasts from participants in the UK Crowd Forecasting Challenge (“epiforecasts-EpiExpert”) against the ensemble of all forecasts from the European Forecast Hub (including our own forecasts) and the Forecast Hub baseline model. Scores are averaged across all forecast horizons.

Table 3.2 includes the numeric scores achieved by the crowd forecast ensemble, the Hub-ensemble and the Hub-baseline model.

Table 3.2: Summary of scores achieved by the crowd forecast ensemble, the Hub-ensemble and the Hub-baseline model. Scores were averaged across all forecast dates and forecast horizons.
Model	Target type	Score
epiforecasts-EpiExpert	Cases	3.72
EuroCOVIDhub-ensemble	Cases	3.82
EuroCOVIDhub-baseline	Cases	6.62
epiforecasts-EpiExpert	Deaths	1.84
EuroCOVIDhub-ensemble	Deaths	1.96
EuroCOVIDhub-baseline	Deaths	10.97

3.5 Comparison of the different EpiExpert forecasts

Users could submit two different forecasts. One was a direct forecast of cases and deaths. The median ensemble of these direct forecasts is called “EpiExpert_direct”. The other one was a forecast of \(R_t\), the effective reproduction number. This \(R_t\) forecast was then mapped to cases and deaths using the so-called renewal equation, which models future cases as a weighted sum of past cases times \(R_t\). The median ensemble that uses only these forecasts is called “EpiExpert_Rt”. The “EpiExpert” ensemble is a median ensemble that used both regular as well as \(R_t\) forecasts.

Figure 3.8 shows scores for these three ensemble types over time.

Scores across time for the different crowd forecast ensembles. epiforecasts-EpiExpert_direct includes only direct case and death forecasts, epiforecasts-EpiExpert_Rt includes only forecasts made through the Rt app and epiforecasts-EpiExpert includes all forecasts. Scores are average scores across all forecast horizons.

Figure 3.8: Scores across time for the different crowd forecast ensembles. epiforecasts-EpiExpert_direct includes only direct case and death forecasts, epiforecasts-EpiExpert_Rt includes only forecasts made through the Rt app and epiforecasts-EpiExpert includes all forecasts. Scores are average scores across all forecast horizons.

Summarised scores for the different versions of the crowd forecast ensemble are given in 3.3.

Table 3.3: Summarised scores for the three variants of the crowd forecast ensemble.
Model	Target type	Score
epiforecasts-EpiExpert	Cases	3.72
epiforecasts-EpiExpert_Rt	Cases	3.85
epiforecasts-EpiExpert_direct	Cases	3.87
epiforecasts-EpiExpert	Deaths	1.84
epiforecasts-EpiExpert_direct	Deaths	2.12
epiforecasts-EpiExpert_Rt	Deaths	3.19

4 Number of available forecasts

Table 4.1 shows the ten most active forecasters. The average number of forecasts per participant was 2.74, while most participants dropped out after their first forecast (Table 4.2). Only two participants submitted a forecast for all thirteen forecast dates.

Table 4.1: Ten most active forecasters by the number of forecast dates on which a participants made a forecast
Model	N forecasts
anonymous_Stingray	13
seabbs	13
seabbs (Rt)	12
2e10e122	10
BQuilty	10
aurelwu	8
RitwikP	8
seb	8
seb (Rt)	8
Sophia	8

Table 4.2: Summary of the number of forecasts per participant.
Max	Min	Mean	Median
13	1	2.74	1

(#fig:number_forecasters)Distribution of the number of forecasts made by any individual participant.

On average, 21.9 forecasts were submitted each week, with a minimum of 10 and a maximum of 57 (Table 4.3). The distribution of the number of forecasters over time is shown in Figure 4.1.

Table 4.3: Summary - number of forecasts available per forecast week
Max	Min	Mean	Median
57	10	21.92	21

Figure 4.1: Distribution of the number of available forecasts per forecast week

5 Sensitivity analysis

Rankings between different models change depending on how you evaluate forecasts. For example, if you score on an absolute, rather than a logarithmic scale, results change. On the natural scale, it matters how far off a forecast is in absolute, rather than in relative terms. This implies that scores depend on the order of magnitude of the target one tries to forecast. For example, it is very easy to miss reported cases by 1000, whereas for deaths errors will rather be in the tens or hundreds.

If reported on the natural scale, the crowd forecasts no longer outperform the Hub ensemble, as can be seen in Figure 5.1 and in Table 5.1. Crowd forecasts showed worse performance than the Hub ensemble around the peak of cases in July, where cases were highest, which results in a stronger penalty if scored on the natural rather than the log scale.

Figure 5.1: Visualisation of scores for the ensemble of all forecasts from participants in the UK Crowd Forecasting Challenge (“epiforecasts-EpiExpert”) against the ensemble of all forecasts from the European Forecast Hub (including our own forecasts) and the Forecast Hub baseline model. Scores were calculated on the natural scale and averaged across all forecast horizons.

Table 5.1: Summary of scores achieved by the crowd forecast ensemble, the Hub-ensemble and the Hub-baseline model. Scores were calculated on the natural scale and averaged across all forecast dates and forecast horizons.
Model	Target type	Score
EuroCOVIDhub-baseline	Cases	616337.71
EuroCOVIDhub-ensemble	Cases	645441.59
epiforecasts-EpiExpert	Cases	742296.85
EuroCOVIDhub-ensemble	Deaths	686.58
epiforecasts-EpiExpert	Deaths	741.42
EuroCOVIDhub-baseline	Deaths	1743.88

Analysis UK Crowd Forecasting Challenge