A couple of months ago I wondered whether the Colorado State University hurricane forecast was better than drawing a number from a hat. Well, CSU itself has the scoop in this paper, elegantly titled Extended Range Forecast of Atlantic Seasonal Hurricane Activity and Landfall Strike Probability for 2010.
The short answer: CSU does a good job – significantly better than the long-term average – but not great. I re-crunched the numbers and reached some interesting conclusions about some bias in the estimates.
The paper uses a statistic abbreviated NTC, which stands for Net Tropical Cyclone activity, a number that measures the accumulated intensity of all storms in a year. (Wonk: That number is based on the Accumulated Cyclone Energy statistic I posted about yesterday. NTC is the Accumulated Cyclone Energy normalized so that the average year has an NTC of 100. /wonk)
The CSU researchers used the June 2010 version of their model and ran a hindcast (most actuaries would call it a backtest) on the years 1950 to 2007. The results are below:
Col. 1 tells you how bad a year was. Col. 2 tells you what CSU’s model would have forecast had it been around. (The red numbers represent years where CSU’s hindcast points the wrong way – the hindcast predicts a bad year when it turned out to be a good one, or vice versa.)
Col. 3 shows how much the CSU forecast missed by. Col. 4 shows how a year differed from the long term average – remember that the long-term average is defined as 100, so Col. 4 = Col. 1 – 100.0.
Col. 5 compares the accuracy of CSU’s hindcast with the accuracy of just using the long-range average. Black numbers signify that CSU did better than the long-range average. Red numbers signify that CSU did worse.
Overall, CSU gives itself a pat on the back:
The hindcast went the right way with regards to an above- or below-average season in 42 out of 58 years (72%), while hindcast improvement over climatology occurred in 31 of 58 years (53%).
So the model predicts the correct direction most of the time but does better than the long-term average only a little more than half the time. But that understates its accuracy. Here, I’ve ordered the years from the mildest (lowest NTC) to the harshest:
The red line is the actual intensity of the year. The line slopes upward because, recall, I’ve ordered the years from mildest to harshest.
The green line is CSU’s hindcast. Notice that – though it is rarely spot on, it drifts upward as the red line does. So it’s modeling behavior OK.
CSU researchers note that their model has an R-square of .62, meaning it explains 62% of the variability from year to year. That’s not terrific, but it’s better than a random guess.
Next, I looked at the difference between the model and the observed value:
If the model was well-behaved, you should see about as many data points above aero as below. I count 31 points above zero and 27 below, so that’s pretty good.
You should also not see a run – a bunch of points in a row above or below zero. But look to the right. The last eight points are above zero. That means when there was a really bad year, the model had undershot.
In some ways that’s not surprising. It’s the old tail risk problem. The worst years are outliers, and it’s hard for a model to predict outliers.
Now these charts examine how well the model did after the fact. In other words, given a bad year, did the model predict it? Of course, what we really want to know is how well the model does before the fact. If the model predicts a bad year, will there be a bad year? Secondarily, will the year be as bad as the model predicts?
To examine that, I ordered the data from mildest prediction to harshest. (Recall that before I had ordered from mildest observation to harshest.)
As you might expect, the observations track the predictions pretty well, but the difference between observed and predicted doesn’t quite look random. The next chart sort of shows what I’m talking about:
This chart shows the difference between the observation and the prediction. Notice that to the left, when predictions are low, the actual observation tends to be higher than the prediction. Said phenomenon seems to repeat itself on the right of the chart. And there seems to be a dip in the middle.
Anyhow, I cleared this up by looking at the cumulative sum of the deviations. In other words, the lowest prediction was low by five, so the chart plots the point (1,5). The next prediction was low by seven. With 5 (the deviation at the prior point) plus 7 equal to 12, the chart plots (2,12). The third prediction is low by 28, so the chart plots (3,40) and so on.
If the deviation is randomly distributed around zero, the cumulative sum should average zero, as the underestimates and the overestimates consistently offset. But that’s not what happens here:
The line on the chart rises for predictions under 100. (Remember 100 is an average year.) That means that when the model predicts a good year, it tends to be too rosy in its prediction.
For predictions over 100, the line drifts down. That means that when the model predicts a bad year, it tends to be too pessimistic, at least until the predictions get dire – over 200 NTC. The little upturn in the far right implies that when the prediction is really bad, it is not bad enough.
Now be careful! This doesn’t mean that the prediction is way off base. If CSU predicts a good year, it will probably be a good year. And if it predicts a bad year, it will probably be a bad year. But the good years tend to be a little worse than predicted – but still good, and the bad years aren’t quite as bad as predicted – but still bad. And the truly awful years are even more awful than predicted.
This year, CSU forecasts an NTC of 195, which happens to be right at the border between bad and truly awful. That makes it hard to guess how this particular data point is biased. Averaging deviations just smaller and just larger than 195 shows a bias close to zero. The fact remains: a bad, bad year looms.