## School testing: An actuarial analysis, Part 2

[Ed. note: In this series, I use actuarial techniques on data behind the Nation’s Report Card, a regular testing of 9-, 13- and 17-year-olds, to argue that:

• The gap between white and black students has been declining and will continue to decline.
• Whatever gap exists is in place by the age of 9 and stubbornly resists change thereafter.

Part One is here.]

As I wrote in Part One, to learn how well test scores among 9-year-olds translate into test scores among 17-year-olds, we should be testing by cohort – testing a group of kids at 9, then testing them again and again as their education progresses. Since testing occurs at 9, 13 and 17, the tests should be conducted every four years.

Unfortunately, the testing behind the Nation’s Report Card doesn’t do that. Reading has been tested 12 times, and math has been tested 11. The testing has occurred at irregular intervals. Sometimes five years passed between tests; sometimes only two.

There’s a second concern about the data – the people being tested changed in 2004. The test was changed to accommodate Spanish-speaking students and was also administered to more special needs students. So the scores before 2004 are not directly comparable to those in 2004 and later.

However, we can work with the data to handle these problems.

First problem: The test was administered at irregular intervals. Solution: We can interpolate between years. I used a simple linear interpolation, though the analysis is not really sensitive to the method of interpolation. That’s because from test to test, scores didn’t move a tremendous amount.

For example, black 9-year-olds scored a 190 on math in 1973. Five years later, the 9-year-olds scored 193, a three point improvement (10 points is roughly equivalent to one grade.) Simple linear interpolation gives the following estimates for the interim:

• 1974: 191
• 1975: 191
• 1976: 192
• 1977: 192

Had the test been administered those years, kids might not have hit those averages exactly. But they probably would not have been far off.

Second problem: The test subjects changed in 2004. Solution: Standardize the pre-2004 scores so they are comparable to the later test. This is fairly simple. The database shows 2004 scores as tested and adjusted to be comparable with earlier scores. I have just extended that adjustment proportionately. The impact is pretty small, reducing scores by about a point or two in most cases.

Neither of these adjustments really gives you any outlandish numbers, so I’m comfortable making them and inferring results from them.

I’d rather not make adjustments like this. However, some data sets are imperfect, and one skill actuaries possess is the ability to make reasonable adjustments to an imperfect data set, while understanding that the conclusions – though not suitable for rigorous statistical analysis – can still provide insight.

There is a third issue. The same kids aren’t tested every time out. In other words, the nine-year-olds tested in 2004 weren’t tested again as 13-year-olds four years later. In that sense, the before-and-after test isn’t perfect.

However, all sets of kids are selected randomly. So we can’t directly measure how the same group of kids performed, but because kids were selected randomly, we can conclude they reasonably represent all kids.

I made these adjustments for the scores of white and black 9-, 13- and 17-year-old reading and math scores gong back to 1973. Then I created a cohort based on year of birth. So the 9-year-olds tested in 2008 belong to the 1999 cohort, the 13-year-olds tested that year are part of the 1995 cohort, and the 17-year-olds are part of the 1991 cohort.

The result is four data sets. To give you an idea of what I’m looking at I’ll show one, reading scores for black students.

Data here.

The first column shows the year the kids were born. The second column shows the scores achieved by each cohort at age 9. The next column shows scores for the cohort at 13. The final column shows scores at 17.

You can read it the first row of data this way: Kids born in 1962 scored a 167 at age 9, a 221 at age 13 and a 241 at age 17.

A couple other points: Testing began in 1971, but this data set begins in 1962. That’s because the 9-year-olds tested in 1971 were born in 1962.

Also, most of the scores here are interpolated. The boldface numbers are actual test scores.

And if you look at the bottom of the table, it will seem incomplete. For example, there are only two scores for kids born in 1992. That’s because that group’s third test would have been in 2009, and no test was administered that year. The 1999 cohort has only one entry, for age 9. Those students can’t be tested as 13-year-olds until 2012 and as 17-year-olds until 2016.

Finally, the 9-year-olds tested in 1971 didn’t really score a 167. They scored 170. The difference is the adjustment I had to make so the 1971 scores were comparable to the 2008 scores.

The data set will look familiar to property-casualty actuaries. It looks like standard loss development. (Actuaries call data formatted like this a triangle, as the number of cohorts we look at is generally equal to the number of evaluation points, meaning the data presentation resembles a triangle.)

At this point, we can compare black students and white students by cohort across the years. The results appear in the next part of this series.