Risk Management of Your Exam: Inside the Numbers

Originally Published in Certification Magazine, 1/2005

Would you have predicted that the Pittsburgh Steelers would go 15-1 during this season or would you have predicted that the Indianapolis Colts would be knocked out of the playoffs in the second round? The answer is it depends on which statistics you used to make your decisions.  There are some numbers that are used consistently to predict the outcomes of  sporting events and others that are used by television broadcasters such as Chris Berman with higher success rates. The same is true in the testing business. We use results from a field test or beta sample to make predictions about how well a test will perform. Do you stop there or should you be looking at  additional data? The focus of this month’s article is on looking inside the numbers, the numbers you should be tracking to monitor the health of your test.

There is a great body of knowledge available about what statistics are appropriate for use in the initial development of a test. These statistics vary depending on whether you want to use fixed form, multi-stage, computer adaptive, or linear on the fly testing models. There is much less information widely available about what statistics are most commonly used to monitor the health of your test. The literature tends to focus on item level performance using item drift statistics. While these statistics are important, there are many other statistics about the overall test or item pool that can be monitored more easily before you even think about looking at these more advanced statistics.

There are three key areas that you should be tracking, at a minimum, if your program uses computer based testing. They are test scores, data forensics, and response latency. The first two also apply to paper-n-pencil tests. First and foremost, it’s essential that you monitor any changes in test scores over time. You can choose to either track the scaled test score or you may want to simply monitor pass rates in the case of licensure and certification testing. I recommend the development and review of a report that includes an analysis of scores over time. You can look at the overall test performance across geographies as well as countries and test sites, in the case of licensure and certification; and school districts, schools and test administrators in the case of education. You may also want to consider conducting a retake analysis to determine if there are any extreme changes in test performance by examinees over time.

A second area of analysis that you should investigate is an area sometimes referred to as data forensics. Similar to the way that crime scene investigators use forensic techniques to solve crime cases, program managers can use data forensics to detect test fraud. Test fraud can be divided into the following five areas: cheating, collusion, piracy, retake violations and volatile retakes. Depending on what type of test you deliver, one ore more of these potential types of test fraud may occur within your testing program. A rise in pass rates may be one indicator of cheating, but irregular patterns of responding (missing easy items and passing difficult ones) is another indicator of cheating. Collusion occurs when there are similar patterns of responding during the same test administration (such as a test site or classroom). Not being able to complete the test is one indicator of a potential pirate. Retake violators are easy to identify by simply comparing the time between test events to the program policy. Volatile retakes would be identified by extreme changes in test scores for any given examinee.

Finally, the time that an examinee takes on an individual item and the overall exam should be monitored on an ongoing basis and may warrant further investigation. If examinees in Beijing are completing exams three times faster than those in Boston, cheating may be occurring.  If examinees can complete the exam in less than 40% time allotted compared to the field test, then it’s likely that your exam may be overexposed and that your exam content is available on the Internet.

As I indicated at the opening of this article, it’s important to track test scores and response time on an ongoing basis. If you can also allocate resources to generate any or all of the data forensic analyses mentioned, you will have a much clearer picture of the health of your test.

Alison Foster

Test Security Specialist, Caveon Test Security