Use of Statistics for Detecting Cheating on Tests
Occasionally I search for the latest thinking about how to prevent and detect cheating on tests. I saw this presentation from the Annual Conference (2007) of the Arizona State BON (Board of Nursing) and Statewide Nurse Educators (URL is below). In my opinion this presentation is very good and provides a lot of perspective for dealing with test security issues.
Using test result data to detect and prevent cheating was not discussed in this presentation. I think there are good reasons for the omission: (1) cheating detection software has mostly been created for large testing programs and is not readily accessible to anyone who administers tests, and (2) many people are not comfortable with using statistics to make inferences about cheating. My purpose in writing is to discuss this second issue.
Being a statistician, I admit to having specific ideas about data and test scores. Some of these ideas are not generally accepted and may not be popular. However, the idea of using statistics to detect problems with the test administration seems natural and reasonable. Anyone who would accept test scores as being valid and reliable but would not use test result data to make inferences about the quality of the test administration holds an inconsistent position. I say this because the very act of administering a test and obtaining a test score is a statistical procedure with the intent of making a statistical inference. When we give tests we are not interested in the test taker’s performance on the actual questions that were presented. Instead, we are interested in inferring or estimating the test taker’s knowledge or competence in the tested domain. Making such an inference implicitly acknowledges that the test score is a statistical measure and subject to uncertainty. If other questions had been presented, there is no doubt that the test scores would have been different.
If you do not agree with the above perspective you will not agree with the corollary that I now present. Despite disagreements, I now stipulate that the best and most reliable record of the testing session is the actual set of recorded responses (and any other measurements that can be obtained such as erasures or response times). These data are more reliable than proctoring observations, or video recordings, or any other externally derived measure of the testing session. If you can trust the recorded responses to calculate a test score and make decisions about a test taker’s future, you should be equally comfortable using the recorded responses to make inferences about the quality of the testing session and whether testing irregularities may have occurred.
Because many statistical techniques may appear to be arcane or even “mystical,” the statistician must be very careful in selecting and using techniques that are based in solid statistical principles. Statistics will be most easily defended if they are derived from a probability model that describes the behavior being observed and if they provide objective probability statements concerning the extremeness of any observation. These criteria are rather stringent and lead to the natural exclusion of many techniques that have been investigated by researchers. For example, person-fit statistics are ideal for describing whether a test taker’s response pattern is consistent with the normal pattern of test taking (In Caveon we usually use the word “aberrant” to describe inconsistent response patterns). However, even though there is a large literature on person-fit statistics no researcher has yet published how to make objective probability statements about aberrant test taking. Without having statistically sound inferential models, the practitioner must devise ad-hoc methods that are empirically derived from the analysis of the data. There are two problems with this approach: (1) the judgment of what constitutes an extreme observation is subjective and may vary depending upon the situation, and (2) the modeling technique, itself, is not easily defended or replicated. I think these problems are fundamental reasons why test administrators have been uncomfortable with using statistics to make inferences about cheating.
At Caveon, we have worked very hard to create algorithms that are capable of computing probabilities for the statistics that we use in data forensics work. Part of that work involves understanding the probability models and assumptions that underlie the models. For example, “answer-copying” statistics that are based on the idea of similarity and excess similarity should be derived from probability models. One such example is the class of answer-copying statistics presented by van der Linden and Sotaridona (2006): Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31, 283-304. In this paper the authors make the assumption that tests are taken independently in deriving the probability model for the number of identical responses (being the statistic of interest). We have currently implemented person-fit statistics (for detecting aberrance), similarity statistics (for detecting collusion, test coaching, answer copying and proxy test taking), erasure statistics (for detecting test tampering), gain-score statistics (for detecting unusual learning patterns), response latency statistics (for detecting content exposure), and we continue to explore other statistics. I will discuss each of these as time permits, later.