Archive for the 'answer copying' Category


Are identical answers to exam questions proof of cheating on tests?


Monday, February 18th, 2008

When it comes to supporting an allegation of cheating on tests, there is rarely better statistical evidence than having two (or more) tests with identical sets of responses, or identical answers. Having a great interest in this topic, I have read carefully the abstracts of Rice University Honor Council meetings where these types of allegations are taken very seriously. In several instances of alleged academic fraud, the Honor Council has found the evidence of identical solutions and identical answers to be compelling.

“The Rice Honor System was created by students in 1916. That it has functioned so well for so long is a reflection of the trust and respect that Rice students show to one another and to the University. It is one of Rice’s most highly valued traditions and a vital part of your education–education in responsibility and integrity.” http://honor.rice.edu/

In one instance, the Council minutes read:

Witness 1, the professor for the class, stated that he believed the similarities between the True / False answers and the essay answers given by Student A and Student B to be strikingly similar. He … presented a statistical analysis of the probability of this occurring in certain situations.

In the above case, despite having a probability analysis, the Honor Council did not find that the honor code had been violated (i.e., cheating was not found).

In another instance, the Honor Council had a different finding:

Some members felt that the identical answers on some portions of the exam were beyond coincidence or having similar notes or studying together. Members were suspicious of the fact that these similarities would arise after the students used different sources of information when answering the questions. … Some members were not convinced by the explanations …

Despite denials of cheating in the above situation, both students were found in violation of the honor code.

Here’s a Google search link if you wish to read some of these abstracts.

It is evident from these two abstracts that the Honor Council attempts to find plausible explanations for identical answers and excessive similarities between test questions. It is also evident that the Honor Council may act without having definitive proof. As an example of the degree of “proof” or evidence that may be required to take action in a case of suspected cheating, consider this statement from the University of Western Ontario:

It is particularly important to understand that the conclusion that a student committed a scholastic offence does not have to be supported by evidence beyond a reasonable doubt. In an exam writing situation, that means that a decision maker may conclude that cheating took place, even if it is possible that two people got some identical answers by chance.

The observation that two tests have identical answers is very reliable evidence as defined by the criterion I proposed in my most recent post, because the observation is (1) factual, (2) objective, (3) credible, and (4) defensible. We require that the evidence have one additional attribute before believing that cheating probably occurred. The evidence must be strong.

In order to evaluate the strength of evidence of identical answers on tests, we require the probability of the observed responses. At Caveon, the probability for the observed item responses is estimated using item response theory. We compute this probability by multiplying all the probabilities together of the selected responses (we assume the selected responses are conditionally independent) and then normalizing the product by the marginal probability of the observed score. Formulas for computing exact probabilities are difficult to derive and program, which means that most practitioners who encounter these situations will rely upon judgment and intuition in the same way the Rice Honor Council does.

I have pasted in a table of sampled probabilities for an 18 item test, below. The probabilities are calculated knowing the score that was obtained on the test. So, if we know a person answered all 18 items correctly the probability that another person who answered all 18 items correctly would match is equal to one. If the answer was correct, it is highlighted in gold in the table.

Probabilities of identical tests

Even though I routinely evaluate these types of probabilities, I have been surprised by some instances of identical response data. For example, the probability of an identical test when all items are answered correctly is 1 (as in the first row of the table). But, the probability of an identical test when all but one or two questions are answered correctly may be as high as .10 or .25 (see the second and fourth rows of the table). On the other hand, if several questions are answered incorrectly, the probability of an identical test may be 1 in 100 million or even smaller. The wide variation in these probabilities is a function of the number of correctly answered test questions and the selected responses.

If the probabilities of some test response patterns are sufficiently high (because the tests are easy or the examinees are very proficient) and if we have a large enough group, we might expect to see many identical tests. Probability computations for the number of observed identical tests can be very difficult. This is an instance of the “birthday problem” with unequal probabilities.

At the beginning of this discussion, it appeared that we had a relatively straightforward and simple problem. It often occurs with statistics that many apparently simple problems become very complex, very quickly. The analysis of identical answers for two exams is one of those problems. The answer to the question with which we began the discussion must be: We cannot prove that cheating occurred when we have identical answers for two test instances, but in many situations we can obtain very strong, reliable evidence leading us to conclude that cheating occurred and the conclusion would be right, nearly always.



Use of Statistics for Detecting Cheating on Tests


Friday, November 16th, 2007

Occasionally I search for the latest thinking about how to prevent and detect cheating on tests. I saw this presentation from the Annual Conference (2007) of the Arizona State BON (Board of Nursing) and Statewide Nurse Educators (URL is below). In my opinion this presentation is very good and provides a lot of perspective for dealing with test security issues.

http://www.azbn.gov/documents/news/Statewide%20Educators%20Academic%20Dishonesty.10.05.07.pdf

Using test result data to detect and prevent cheating was not discussed in this presentation. I think there are good reasons for the omission: (1) cheating detection software has mostly been created for large testing programs and is not readily accessible to anyone who administers tests, and (2) many people are not comfortable with using statistics to make inferences about cheating. My purpose in writing is to discuss this second issue.

Being a statistician, I admit to having specific ideas about data and test scores. Some of these ideas are not generally accepted and may not be popular. However, the idea of using statistics to detect problems with the test administration seems natural and reasonable. Anyone who would accept test scores as being valid and reliable but would not use test result data to make inferences about the quality of the test administration holds an inconsistent position. I say this because the very act of administering a test and obtaining a test score is a statistical procedure with the intent of making a statistical inference. When we give tests we are not interested in the test taker’s performance on the actual questions that were presented. Instead, we are interested in inferring or estimating the test taker’s knowledge or competence in the tested domain. Making such an inference implicitly acknowledges that the test score is a statistical measure and subject to uncertainty. If other questions had been presented, there is no doubt that the test scores would have been different.

If you do not agree with the above perspective you will not agree with the corollary that I now present. Despite disagreements, I now stipulate that the best and most reliable record of the testing session is the actual set of recorded responses (and any other measurements that can be obtained such as erasures or response times). These data are more reliable than proctoring observations, or video recordings, or any other externally derived measure of the testing session. If you can trust the recorded responses to calculate a test score and make decisions about a test taker’s future, you should be equally comfortable using the recorded responses to make inferences about the quality of the testing session and whether testing irregularities may have occurred.

Because many statistical techniques may appear to be arcane or even “mystical,” the statistician must be very careful in selecting and using techniques that are based in solid statistical principles. Statistics will be most easily defended if they are derived from a probability model that describes the behavior being observed and if they provide objective probability statements concerning the extremeness of any observation. These criteria are rather stringent and lead to the natural exclusion of many techniques that have been investigated by researchers. For example, person-fit statistics are ideal for describing whether a test taker’s response pattern is consistent with the normal pattern of test taking (In Caveon we usually use the word “aberrant” to describe inconsistent response patterns). However, even though there is a large literature on person-fit statistics no researcher has yet published how to make objective probability statements about aberrant test taking. Without having statistically sound inferential models, the practitioner must devise ad-hoc methods that are empirically derived from the analysis of the data. There are two problems with this approach: (1) the judgment of what constitutes an extreme observation is subjective and may vary depending upon the situation, and (2) the modeling technique, itself, is not easily defended or replicated. I think these problems are fundamental reasons why test administrators have been uncomfortable with using statistics to make inferences about cheating.

At Caveon, we have worked very hard to create algorithms that are capable of computing probabilities for the statistics that we use in data forensics work. Part of that work involves understanding the probability models and assumptions that underlie the models. For example, “answer-copying” statistics that are based on the idea of similarity and excess similarity should be derived from probability models. One such example is the class of answer-copying statistics presented by van der Linden and Sotaridona (2006): Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31, 283-304. In this paper the authors make the assumption that tests are taken independently in deriving the probability model for the number of identical responses (being the statistic of interest). We have currently implemented person-fit statistics (for detecting aberrance), similarity statistics (for detecting collusion, test coaching, answer copying and proxy test taking), erasure statistics (for detecting test tampering), gain-score statistics (for detecting unusual learning patterns), response latency statistics (for detecting content exposure), and we continue to explore other statistics. I will discuss each of these as time permits, later.

 



HOME :: SERVICES :: RESOURCES :: COMPANY :: PRESS :: LINKS