# Can you prove cheating on tests using statistics?

There is a children’s game known by various names as “Whisper,” “Secrets,” or “Gossip” where a secret is shared and passed from one player to the next. The last player hearing the secret says it aloud, often with hilarious results. These same distortions happen in the news media, as journalists cite other reports or each other. Such a misquote from the Star-Telegram concerning additional security announced by the TEA (Texas Education Agency) for the TAKS (Texas Assessment of Knowledge and Skill) caused me to pause and reflect about using statistical evidence to “prove” that someone cheated on a test.

The reporter wrote, “**Among other security measures: … **Scramble field test questions on tests to provide *proof* if someone is copying someone else’s answer sheet.” (Italics added.) http://www.star-telegram.com/news/story/433614.html. Being well aware of the controversy surrounding the use of statistics, *alone*, to prove cheating, I immediately doubted the accuracy of the above statement. Actually, on June 7, 2007, Shirley Neeley announced that “the Texas Education Agency today will immediately initiate the following: … analyze scrambled blocks of test questions to detect answer copying…” TEA later clarified that the scrambling would only involve field test items. The Dallas Morning News was quick to criticize the scrambling plan, but I applauded TEA’s intent to detect cheating behavior using statistics.

We naturally ask whether statistical evidence can be relied on to detect cheating. Many authors have expressed the opinion that statistical evidence must be corroborated by eye-witness accounts before making allegations of cheating. I can understand this position if the statistics are not reliable. In my opinion, reliable evidence must meet the following conditions:

- It must be factual,
- It must be objective,
- It must be credible, and
- It must be defensible.

If statistical evidence meets the above conditions, I believe that it can be relied upon, whether corroborating eye-witness accounts are available or not. Statistical evidence is

- factual when it is based on test result data (an actual record of the test event),
- objective when it provides a statistic with a probability statement,
- credible when the statistics have been shown to work because the models accurately depict actual test taking, and
- It is defensible when the underlying science withstands scrutiny.

An additional fifth criterion the evidence must meet for taking action on a suspected instance of cheating is that the evidence must be strong. Statistical evidence is strong when the calculated probabilities are so small that we no longer believe the observed data are the result of normal test taking. Statistics can provide guidance for determining how strong is strong enough to take action, but ultimately the establishment of a probability threshold (i.e., the strength of the statistic) is a matter of policy that must be answered by the testing program administrator.

It is important with any statistical investigation to choose statistics that are well-suited and designed for the task at hand. For example, if the concern is that answer sheets are being modified, then erasure counts should be analyzed. Having analyzed over one hundred data sets for a wide variety of clients including state Departments of Education, admissions tests, certification programs, and licensure exams, I can unequivocally state that answer copying is the predominant means of cheating on tests. Therefore, it is especially relevant in this discussion concerning the reliability of statistical evidence to discuss answer copying and statistics that are designed to detect answer copying.

As you reflect upon the principles that I have outlined, I would ask you to consider the data in Table 1. The table contains differing probability values that a testing program administrator might be asked to evaluate. These are sampled answer-copying statistics (i.e., counts of identical answers) from a test having 240 items. With this many items on the test, the central limit theorem will generally apply so I have included a Z-Score in the table, as a point of reference.

**Table 1: Sampling of test similarity statistics**

Number of identical answers |
Expected number of identical answers |
Standard Deviation |
Z-Score |
Probability Index |

168 |
81.3 |
7.2 |
12.0 |
30.3 |

171 |
102.3 |
7.4 |
9.3 |
19.9 |

130 |
76.4 |
7.1 |
7.5 |
12.4 |

154 |
107.7 |
7.4 |
6.3 |
9.5 |

128 |
87.9 |
7.3 |
5.5 |
7.3 |

108 |
74.3 |
7.1 |
4.7 |
5.5 |

107 |
75.1 |
7.1 |
4.5 |
5.0 |

120 |
89.4 |
7.3 |
4.2 |
4.6 |

115 |
86.1 |
7.3 |
4.0 |
4.2 |

128 |
103.9 |
7.4 |
3.3 |
3.1 |

At Caveon we deal with extremely small probability values, so we typically express those using “an index” where the probability is one in 10 to the power of the index (p=10^{-index}). The most extreme case in Table 1 has a probability of one in 10 to the thirtieth power. These data are definitely not due to normal test taking.

Assuming that you accept the statistical evidence as being reliable, the decision needed by you, the testing program administrator, is how low in Table 1 should you go? Where do you set the cut point? These data illustrate if you set the cut point too low, you might accuse some individuals of answer copying without having strong evidence. If you set the cut point too high, you might allow several individuals who have cheated to escape discipline.

I will elaborate more on this topic, next time. Until then, may your tests remain secure.