Archive for the 'Cheating detection' Category


Trojan Items and Answer-key Arbitrage


Sunday, March 2nd, 2008

Today is the first day of the annual ATP Conference (Association of Test Publishers). This afternoon I will present a workshop titled, “Strategies and Tactics for Limiting Item Exposure.” We will be exploring innovative ideas for protecting tests and items from theft. It’s easy to understand why test publishers are concerned about test theft. High-quality items are expensive to produce and represent a substantial investment. Item development costs of $1,000 or higher per item are not unusual. In an afternoon, a thief can compromise an investment of $250,000 or more, easily. Most testing professionals will state that item theft is their number one security concern. I discussed this previously in: What is your top security concern?

I can’t share the entire workshop content with you in this short essay. But, I can share with you Gene Radwin’s (of EMC Corporation) intriguing idea of answer-key arbitrage and Trojan items. The idea was briefly mentioned in: Student outwits FCAT with secret pattern. Just as the Trojan horse was the Greeks’ surprise weapon for outwitting the people of Troy, we hope to outsmart users of brain-dump content using Trojan items.

The basic idea of the Trojan item as developed and presented to me by Gene Radwin (email: radwin_gene at emc.com) is to place very easy items on the test which are miskeyed. If a test taker gives the miskeyed answers (and not the correct, easy answers) we have strong evidence that braindump content is being used. The fundamental principle is to create a test-within-a-test to detect test fraud. We booby trap selected items by changing them so that a different answer choice is now correct, and the compromised answer is incorrect. Without knowing which items are booby-trapped, the brain-dump user proceeds in ignorance, until detected. Just to illustrate, consider a math item that I “borrowed” from the SAT practice test.

Table 1: Example of a Trojan item

Example of trojan item
We do not expect the brain-dump user who has memorized the “Exposed” item to notice the small change in the “Trojan” item. As a result, the cheater will give the originally correct, but now incorrect, answer “C,” and at the same time the honest test taker will give the correct answer “E.” The change in the answer key gives us a leverage or arbitrage point, creating a powerful difference in the statistical expectations.

In order to be effective, several Trojan items will be required on the exam. I haven’t done a rigorous analysis of the statistical power of the procedure, but my current intuition suggests that ten to twelve questions will be needed.

We recently analyzed data where one individual was suspected of having prior access to the test content. Six miskeyed items were present on the exam and we found that the suspect answered all the miskeyed items correctly (i.e., with the wrong answer key). Using item response models, we analyzed the “score” for the miskeyed items. (We do not use standard regression techniques because the data are not normally distributed, being highly constrained and skewed.) These data are shown in Figure 1.

Figure 1: Analysis of 6 miskeyed items

We see two extreme data points in Figure 1, corresponding to the suspected exam and another exam (they had probabilities of one in 5,000 and one in 1,000, respectively). The expected score on the miskeyed items was approximately two. We note that there is no correlation between the raw score on the test and the score on the miskeyed items.

In the above example, analysis of miskeyed items detected a potential testing irregularity. When Trojan items are specifically designed as described above, we expect to see a strong negative relationship between the Trojan items and the total score. In other words, high scoring individuals will provide the correct answer and not the original answer. This negative relationship improves our ability to detect users of brain-dump content.

In addition to my own analyses, one of our clients has told me of great success in using these techniques. For obvious reasons, the client does not want brain-dump users to know which tests are treated with Trojan items and how their cheating is being detected. When cheaters realize they are being punished for using brain-dump content, they will quit using the content. Then we will be satisfied. We just want test takers to do their own work and demonstrate their own ability when they take tests.



Are identical answers to exam questions proof of cheating on tests?


Monday, February 18th, 2008

When it comes to supporting an allegation of cheating on tests, there is rarely better statistical evidence than having two (or more) tests with identical sets of responses, or identical answers. Having a great interest in this topic, I have read carefully the abstracts of Rice University Honor Council meetings where these types of allegations are taken very seriously. In several instances of alleged academic fraud, the Honor Council has found the evidence of identical solutions and identical answers to be compelling.

“The Rice Honor System was created by students in 1916. That it has functioned so well for so long is a reflection of the trust and respect that Rice students show to one another and to the University. It is one of Rice’s most highly valued traditions and a vital part of your education–education in responsibility and integrity.” http://honor.rice.edu/

In one instance, the Council minutes read:

Witness 1, the professor for the class, stated that he believed the similarities between the True / False answers and the essay answers given by Student A and Student B to be strikingly similar. He … presented a statistical analysis of the probability of this occurring in certain situations.

In the above case, despite having a probability analysis, the Honor Council did not find that the honor code had been violated (i.e., cheating was not found).

In another instance, the Honor Council had a different finding:

Some members felt that the identical answers on some portions of the exam were beyond coincidence or having similar notes or studying together. Members were suspicious of the fact that these similarities would arise after the students used different sources of information when answering the questions. … Some members were not convinced by the explanations …

Despite denials of cheating in the above situation, both students were found in violation of the honor code.

Here’s a Google search link if you wish to read some of these abstracts.

It is evident from these two abstracts that the Honor Council attempts to find plausible explanations for identical answers and excessive similarities between test questions. It is also evident that the Honor Council may act without having definitive proof. As an example of the degree of “proof” or evidence that may be required to take action in a case of suspected cheating, consider this statement from the University of Western Ontario:

It is particularly important to understand that the conclusion that a student committed a scholastic offence does not have to be supported by evidence beyond a reasonable doubt. In an exam writing situation, that means that a decision maker may conclude that cheating took place, even if it is possible that two people got some identical answers by chance.

The observation that two tests have identical answers is very reliable evidence as defined by the criterion I proposed in my most recent post, because the observation is (1) factual, (2) objective, (3) credible, and (4) defensible. We require that the evidence have one additional attribute before believing that cheating probably occurred. The evidence must be strong.

In order to evaluate the strength of evidence of identical answers on tests, we require the probability of the observed responses. At Caveon, the probability for the observed item responses is estimated using item response theory. We compute this probability by multiplying all the probabilities together of the selected responses (we assume the selected responses are conditionally independent) and then normalizing the product by the marginal probability of the observed score. Formulas for computing exact probabilities are difficult to derive and program, which means that most practitioners who encounter these situations will rely upon judgment and intuition in the same way the Rice Honor Council does.

I have pasted in a table of sampled probabilities for an 18 item test, below. The probabilities are calculated knowing the score that was obtained on the test. So, if we know a person answered all 18 items correctly the probability that another person who answered all 18 items correctly would match is equal to one. If the answer was correct, it is highlighted in gold in the table.

Probabilities of identical tests

Even though I routinely evaluate these types of probabilities, I have been surprised by some instances of identical response data. For example, the probability of an identical test when all items are answered correctly is 1 (as in the first row of the table). But, the probability of an identical test when all but one or two questions are answered correctly may be as high as .10 or .25 (see the second and fourth rows of the table). On the other hand, if several questions are answered incorrectly, the probability of an identical test may be 1 in 100 million or even smaller. The wide variation in these probabilities is a function of the number of correctly answered test questions and the selected responses.

If the probabilities of some test response patterns are sufficiently high (because the tests are easy or the examinees are very proficient) and if we have a large enough group, we might expect to see many identical tests. Probability computations for the number of observed identical tests can be very difficult. This is an instance of the “birthday problem” with unequal probabilities.

At the beginning of this discussion, it appeared that we had a relatively straightforward and simple problem. It often occurs with statistics that many apparently simple problems become very complex, very quickly. The analysis of identical answers for two exams is one of those problems. The answer to the question with which we began the discussion must be: We cannot prove that cheating occurred when we have identical answers for two test instances, but in many situations we can obtain very strong, reliable evidence leading us to conclude that cheating occurred and the conclusion would be right, nearly always.



Can you prove cheating on tests using statistics?


Monday, February 11th, 2008

There is a children’s game known by various names as “Whisper,” “Secrets,” or “Gossip” where a secret is shared and passed from one player to the next. The last player hearing the secret says it aloud, often with hilarious results. These same distortions happen in the news media, as journalists cite other reports or each other. Such a misquote from the Star-Telegram concerning additional security announced by the TEA (Texas Education Agency) for the TAKS (Texas Assessment of Knowledge and Skill) caused me to pause and reflect about using statistical evidence to “prove” that someone cheated on a test.

The reporter wrote, “Among other security measures: … Scramble field test questions on tests to provide proof if someone is copying someone else’s answer sheet.” (Italics added.) http://www.star-telegram.com/news/story/433614.html. Being well aware of the controversy surrounding the use of statistics, alone, to prove cheating, I immediately doubted the accuracy of the above statement. Actually, on June 7, 2007, Shirley Neeley announced that “the Texas Education Agency today will immediately initiate the following: … analyze scrambled blocks of test questions to detect answer copying…” TEA later clarified that the scrambling would only involve field test items. The Dallas Morning News was quick to criticize the scrambling plan, but I applauded TEA’s intent to detect cheating behavior using statistics.

We naturally ask whether statistical evidence can be relied on to detect cheating. Many authors have expressed the opinion that statistical evidence must be corroborated by eye-witness accounts before making allegations of cheating. I can understand this position if the statistics are not reliable. In my opinion, reliable evidence must meet the following conditions:

  1. It must be factual,
  2. It must be objective,
  3. It must be credible, and
  4. It must be defensible.

If statistical evidence meets the above conditions, I believe that it can be relied upon, whether corroborating eye-witness accounts are available or not. Statistical evidence is

  1. factual when it is based on test result data (an actual record of the test event),
  2. objective when it provides a statistic with a probability statement,
  3. credible when the statistics have been shown to work because the models accurately depict actual test taking, and
  4. It is defensible when the underlying science withstands scrutiny.

An additional fifth criterion the evidence must meet for taking action on a suspected instance of cheating is that the evidence must be strong. Statistical evidence is strong when the calculated probabilities are so small that we no longer believe the observed data are the result of normal test taking. Statistics can provide guidance for determining how strong is strong enough to take action, but ultimately the establishment of a probability threshold (i.e., the strength of the statistic) is a matter of policy that must be answered by the testing program administrator.

It is important with any statistical investigation to choose statistics that are well-suited and designed for the task at hand. For example, if the concern is that answer sheets are being modified, then erasure counts should be analyzed. Having analyzed over one hundred data sets for a wide variety of clients including state Departments of Education, admissions tests, certification programs, and licensure exams, I can unequivocally state that answer copying is the predominant means of cheating on tests. Therefore, it is especially relevant in this discussion concerning the reliability of statistical evidence to discuss answer copying and statistics that are designed to detect answer copying.

As you reflect upon the principles that I have outlined, I would ask you to consider the data in Table 1. The table contains differing probability values that a testing program administrator might be asked to evaluate. These are sampled answer-copying statistics (i.e., counts of identical answers) from a test having 240 items. With this many items on the test, the central limit theorem will generally apply so I have included a Z-Score in the table, as a point of reference.

Table 1: Sampling of test similarity statistics

Number of identical answers Expected number of identical answers Standard Deviation Z-Score Probability Index

168

81.3

7.2

12.0

30.3

171

102.3

7.4

9.3

19.9

130

76.4

7.1

7.5

12.4

154

107.7

7.4

6.3

9.5

128

87.9

7.3

5.5

7.3

108

74.3

7.1

4.7

5.5

107

75.1

7.1

4.5

5.0

120

89.4

7.3

4.2

4.6

115

86.1

7.3

4.0

4.2

128

103.9

7.4

3.3

3.1

At Caveon we deal with extremely small probability values, so we typically express those using “an index” where the probability is one in 10 to the power of the index (p=10-index). The most extreme case in Table 1 has a probability of one in 10 to the thirtieth power. These data are definitely not due to normal test taking.

Assuming that you accept the statistical evidence as being reliable, the decision needed by you, the testing program administrator, is how low in Table 1 should you go? Where do you set the cut point? These data illustrate if you set the cut point too low, you might accuse some individuals of answer copying without having strong evidence. If you set the cut point too high, you might allow several individuals who have cheated to escape discipline.

I will elaborate more on this topic, next time. Until then, may your tests remain secure.



Trouble in Section K


Thursday, February 7th, 2008

Elf mistress Heloise entered Elvin’s office (Head of Section K) quickly. “For the eighth week in a row, the reject rate from Section K is three times the rate from the previous twelve months,” she said, handing the weekly quality report to Elvin. She continued, “I was so impressed when your section scored higher on the elf proficiency exam than any other section in the Mechanical Doll Department nine weeks ago that I awarded your elves with assemblage of gears and levers, but this is unacceptable.” Heloise crossed her arms and waited for a reply.

Elvin wrinkled his brow and frowned ruefully. This was unwelcome, but not unexpected, news. He picked up a thick folder and opened it. He leafed through one report after another and muttered, “We have eliminated transportation, storage, tools, assembly, parts, fatigue, and sabotage as explanations. There’s only one conclusion. At least one, and maybe several, of the elves in Section K is incompetent. But how can that be? Is the proficiency exam flawed?”

“Let’s find out,” replied Heloise. And together, they visited the proficiency exam designer. After explaining the problem, the proficiency exam designer shook her head and said, “You need to see the data forensics analyst.” The data forensics analyst listened with deep concentration, scanned page after page of test results, whistled softly, and finally exclaimed, “It looks like elves in Section K have cheated on the elf proficiency exam. Now, how to prove it?” he said mysteriously, and then immersed himself in complex symbols and calculations. Heloise and Elvin excused themselves, but the data forensics analyst didn’t even turn his head as they left. Much later, the proficiency exam designer listened intently while the data forensics analyst described his plan for catching the cheaters in Section K.

Three weeks later, the schedule for the quarterly elf proficiency exam was posted throughout the Mechanical Doll Department. On the day of the test, elf examiners throughout Santa’s workshop reported to a different department than usual to conduct the examination. For example, elf examiners from Remote-Controlled Toys reported to the Games and Puzzles Department. It so happened that an elf examiner from each of the other departments reported to the Mechanical Doll Department. Some administered the elf proficiency exam, and others just watched and waited. All test responses were recorded meticulously. After a long and grueling day, all the elves had been tested.

The data forensics analyst worked all night, making calculations and graphs and charts. At the break of day, Heloise and Elvin knocked at his door. “Enter!” they heard. They stepped into a bizarre scene: scraps of paper were strewn about, charts with bars and circles were plastered on the walls, and a wizened elf was humming in the midst of chaos. “Done!” he shouted. “Oh, it’s you. Well, I have the answer,” he said with absent-minded aplomb.

Then noticing their impatient expressions, he said, “Oh, let me explain.”

“None of the examiners are involved. I know this because there are no patterns of inconsistent answering associated with the examiners. It was important that no examiner give the test to any elf with whom he or she normally associates.

“There were extremely similar test answers between four elves in Section K. It is almost certain that they did not take the tests independently,” The data forensics analyst concluded.

“But, how can that be?” queried Heloise. “They were all watched carefully. There was no way that they could have shared answers or communicated during the test!”

The data forensics analyst minutely explained, “I suspected this might be the case. So, I asked the proficiency exam designer to create two test forms. She very carefully changed a few of the questions between the first and second test forms, so that the correct answers would be close, but not the same. The master test booklet for the first form was locked away in test booklet storage. The proficiency exam designer kept the master test booklet for the second form with her at all times. Even though the elves in the Mechanical Doll Department were given the second form of the test, our four culprits answered all the changed questions with answers from the first form of the test. There is no doubt in my mind. They broke into test booklet storage and memorized the test answers!”

Elvin brought the four suspected cheaters into Heloise’s office. Each elf vigorously denied any wrongdoing. At that point, the data forensics analyst dimmed the lights. He splayed an infrared beam across the hands of each suspected cheater. All of their hands glowed eerily with a blotchy red hue. Then, using gloves to handle the master test booklet from storage he shined the beam on the pages. They glowed red. He touched the booklet pages against his bare arm. Shining the bean on his arm, it also glowed with a blotchy red hue. Heloise barked, “You are red-handed! Now stand still while I consider your punishment!”

“Tomorrow,” pronounced Heloise. “You will report to the master of the Quality Department for ‘R and R,’ where you will begin the repair and refurbishment of all toys in the Rejected Toy Warehouse. You will work there until all the broken toys are operating perfectly and to the satisfaction of the master of quality.”

“Elvin,” Heloise continued. “Section K can no longer be responsible for assemblage of gears and levers. Your section must repair its damaged reputation from producing so many rejected mechanical dolls. Even though you will not receive replacements for these culprits, your production quota will remain the same.”

Elvin wrinkled his brow and frowned ruefully. This was unwelcome, but not unexpected, news. He remembered another time, when he was an impetuous, lazy elf; and when he had cheated. The punishment seemed harsh, but he had learned his lesson and was glad that the cheaters had been apprehended.

Moral: Just as dishonesty betrays the cheater, it injures all who are around him.

Addendum: The cheating detection and prevention techniques described in this story are among best practices. I have described use of the data forensics methodologies in two actual cases we have analyzed at Caveon: The case of the waylaid answer key and The case of the befuddled answer copier.

The State of Mississippi has put together a very nice power-point presentation on test administration auditing and monitoring: www.mde.k12.ms.us/ACAD/osa/DTC_Test_Security_Fall_07.pps

If you are interested in learning more about these or other solutions to test fraud please contact us, at Caveon Test Security.



Moore’s law favors the cheater


Monday, January 21st, 2008

In 1965, Gordon Moore of Intel observed that transistor densities were doubling roughly every 2 years. Since then the exponential nature of faster, smaller and more powerful computational units has continued. Initially, the observation was a remarkable statement of trends. Later, it became an expectation. And, it is now considered an unrelenting challenge for high technology. http://en.wikipedia.org/wiki/Moore’s_law

The trend of faster, smaller and more powerful electronic devices has spilled over from computers into all forms and types of electronics. Notably, consumer electronics commonly used by cheaters on tests are no exception. While Internet-capable PDAs have been available for some time, it was in 2007 that Apple introduced the iPhone, a cellular phone integrated with a browser and digital camera. It would be surprising if iPhones and text-messaging are not replaced with even more sophisticated cheating technology within the next few years. Those who administer tests must anticipate the appearance of these newer, faster, and more easily concealed cheating devices.

Small, fast devices appeal to two broad classes of consumers: (1) persons who want mobile and wearable electronic devices, and (2) persons who have a need for spy gadgetry. Wearable computing (http://www.media.mit.edu/wearables/) trends are very interesting, including smaller keyboards (http://www.frogpad.com/), head-mounted displays (http://en.wikipedia.org/wiki/Head-mounted_display), USB watches (http://www.amazon.com/Timex-Data-Link-Watch-T5C291/dp/B000B545B4), and PDAs and ultra-small computers (examples are: Nokia’s Internet Tablet http://reviews.cnet.com/pdas/nokia-n800-internet-tablet/4505-3127_7-32309517.html and OQO’s Model 02 http://en.wikipedia.org/wiki/OQO).

Spy gadget shops sell tiny pin-hole cameras, but our research at Caveon indicates that the tiny digital cameras have insufficient resolution to capture high quality images of test questions. (See this review of the Casio WQV-1CR Wristwatch camera http://reviews.cnet.com/watches-and-wrist-devices/casio-wqv-1cr-wristwatch/4505-3512_7-2660570.html.) While we found that the pin-hole spy cameras did not have sufficient resolution to steal a high-quality image of a test, we did confirm that the hand-held scanner DocuPen (http://planon.com/) could be used very easily to steal a paper-and-pencil test. There is a clear trend for higher resolution digital cameras in smaller packages, such as the BenQ 8 megapixel camera which is 4 inches by 2.5 inches by one-half inch thick http://blogs.zdnet.com/digitalcameras/?p=151.) We expect to see eight megapixel cameras in cell phones before long due to Samsung’s announcement of a CMOS package for cell phones (http://blogs.zdnet.com/ip-telephony/?p=2737).

In 2007, we saw the introduction of ExamEar, an earpiece with a radio that was specifically marketed to cheaters on tests. This caused a lot of concern in Great Britain (http://news.bbc.co.uk/1/hi/education/6951524.stm, see also http://www.engadget.com/2007/08/20/examear-helping-students-make-the-best-of-exam-day/) and the website owners decided to cease operations. The ExamEar domain is now for sale. But, it would be very surprising if this technology does not resurface. In fact, two Chinese students were recently caught cheating on a test when they couldn’t remove their earpieces and needed medical attention (http://www.chinadaily.com.cn/china/2007-12/31/content_6361740.htm). We don’t know where they obtained these earphones, but they may have been ExamEar models.

Cheaters are usually engaged in one of four behaviors which may be bolstered by technology. These are:

  1. Communicate with or copy from another (requires a miniature radio, cell phone, or other signaling device),
  2. Smuggle test taking aids into the testing event (requires a miniature high-capacity data retrieval device with visual display, such as a PDA, iPod, or DataLink wristwatch)
  3. Steal a copy of the test content (requires a miniature camera)
  4. Engage in impersonation (requires an ability to tamper with or defeat identification safeguards)

Many of the current devices used by cheaters (e.g., cell phones, DocuPens, and PDAs) can be easily slipped past most test administrators, because they are so small. One of the gadgets shown at the 2008 CES (Consumer Electronics Show) which may cause concern for test administrators is the Bug Labs do-it-yourself modular electronics kit (http://gizmodo.com/346789/bug-labs-store-launches-monday-minus-wi+fi). It seems that the device will not include Wi-Fi initially, but it has support for a wide range of other functions, including cameras and cell phones.

Another recent innovation is the Bionic Eye (http://www.msnbc.msn.com/id/22731631/). This is a contact lens that features LCD circuitry which allows projection of an image into the wearer’s field of view. Researchers at the University of Washington have tested it successfully on rabbits. These researchers are the same people who developed the virtual retinal display (http://en.wikipedia.org/wiki/Virtual_retinal_display). It will be sometime before these contact lenses are used by people, but the technology is fascinating.

Another interesting product introduced in 2007 was the FlyPen, a pen-top computer. The company’s marketing literature states, “Meet the FLY Fusion Pentop Computer, the only pentop platform to offer a complete set of high-speed homework solutions and innovative note-taking applications for students of all ages. This next-generation FLYTM system harnesses the same sophisticated Anoto technology as its predecessor, enhanced by PC connectivity, four times the memory, on-the-go calculating functionality, and a 1,000-word Spanish dictionary. Best of all, students can upload handwritten notes and drafts, digitizing them instantly into Microsoft Word documents or emails.” (See http://www.flyworld.com/presskit.pdf.) It will be interesting to see if students use this device for stealing test content.

Because consumer electronics are changing and adapting so quickly, it is very important that testing program administrators review current policies, procedures, and practices to ensure that these devices are not used by cheaters to gain an unfair advantage.



Improving your odds at winning the lottery


Friday, December 28th, 2007

Beginning New Year’s Day 2008, lottery ticket retailers in Ontario will have a new set of rules to follow if they will continue selling lottery tickets. “Most of the changes are the result of Ontario ombudsman Andre Marin and his scathing investigation of the province’s lottery corporation.”

http://canadianpress.google.com/article/ALeqM5jEvfDbJoJ7C3KoaNxekmT8DuUDNA

The previous set of rules allowed lottery ticket retailers to steal lottery winnings from those to whom they sold the tickets. An example of the scam is described in this story where after three years, bilked lottery ticket purchasers were finally awarded their prize.

http://www.ctv.ca/servlet/ArticleNews/story/CTVNews/20071219/opp_lottery_071219/20071219?hub=CTVNewsAt11

In the above situation, the retailer apparently exchanged a non-winning ticket for the winning ticket when the purchasers presented the ticket to claim their prize. The problem is that the retailer is in a position to game the system because two functions are performed: selling the tickets and verifying the tickets. A clever and practiced cheater can manipulate such a situation.

This “man-in-the-middle” attack illustrates an obvious weakness in most paper-and-pencil testing scenarios. An answer sheet may be misdirected or even falsified by an adult who is acting in a trusted test administration position.

For example, it is common practice in elementary schools for teachers to review the student’s answer sheets and make sure that the marked answers are dark, legible, and between the lines on the scan sheet. This practice allows a teacher to not only “clean up stray marks” but also to tamper with the answer sheet. An example of the procedure is described in this document from Dallas Independent School District: http://www.window.state.tx.us/tspr/dallas/ch02h.htm

Another example is more blatant. A teacher could very easily fill-out blank answer sheets for students and then replace the student’s answer sheets with the prepared answer sheets. Erasure or light marks analyses are routinely performed on answer sheets that are scored, but it is unlikely that “fouled” answer sheets (which would also be returned) are subjected to the same analysis.

As a variation of the above exploit, it is well-known that a certification exam can be manipulated by a proxy test taker in a similar manner. The test taker and the proxy test taker both appear at the test site. They have both registered to take the test, and both will take the test. They switch names on the answer sheets (e.g., the proxy test taker puts the name of his or her employer on the answer sheet). If the answer sheets are controlled by document identifiers, the two can breach the security by exchanging answer sheets if they are together when they receive their test materials.

The above vulnerabilities (and others that use the same theme) may be addressed with revised procedures, just as procedures are being revised for the Ontario lottery. For example, instead of stray marks being cleaned up at the school they may be cleaned up at the processing center (where those reviewing the answer sheets do not have a motive for tampering). All returned answer sheets could be scanned, allowing for any fouled answer sheets to be detected. If the answer sheets have document control numbers provided using a readable encoding (such as a bar code), then every control number should be accounted for and none should be duplicated (prevents unauthorized destruction of fouled answer sheets).

To prevent document exchange (such as in the above scenario with the proxy test taker), a digital scan of the test taker signature on the answer sheet may be preserved. This allows for verification of the signature on the answer sheet with the signature on the application. Another way to prevent document exchange between two test takers is to distribute test taking materials to candidates after all are seated, and to collect testing materials from candidates before any leave their seats at the end of the testing session.

While preventative measures are usually the best, analysis of the data may detect these types of attacks. For example, analysis of lottery wins by retailers should have detected there was a problem long before the complaints started to pile up. In the same way, it is very difficult for a person who is tampering with the test results to conceal the effect of their work.

In summary, every aspect of a test administration system and procedure should be carefully reviewed under the assumption that some individual will attempt to exploit that system, and then reasonable security measures should be taken.



Can unproctored online assessments be trusted?


Wednesday, December 19th, 2007

As more and more online courses are developed and offered, instructors of online courses need to consider the potential for cheating on the assessments. The following article describes some measures being implemented by FGCU (Florida Gulf Coast University):

http://www.nbc-2.com/articles/readarticle.asp?articleid=16460&z=3&p=

One of the measures is to track IP addresses and determine if more than one test is being submitted from the same computer. Other measures include randomization of answer choices and random selection of items from an item bank. The software also prevents the test questions from being printed. Kathleen Davey, Dean of Academic Technology, said, “”You can’t prevent everything from happening. You must rely on the integrity of the individual students up to a certain point.”

Ultimately, the above statement is true. If a test taker is sufficiently determined he or she will be able to successfully cheat on the test or steal the test content.

I have been very interested lately in the security of online assessments. They are becoming more prevalent and indications are that they will become a dominant technology in testing if security concerns can be adequately addressed. The problem is that most online assessments are essentially unproctored assessments. Until unproctored Internet tests can be delivered securely, they should not be used for high-stakes exams. By definition, an exam has high stakes if passing or failing the exam has significant life consequences for the test taker. Usually this means getting a job, getting licensed in a profession, getting admitted to a school, getting a diploma, etc.

Recently, Boston Globe released an investigative report concerning Army Correspondence Courses. Yesterday, Senator Edward Kennedy M. Kennedy, Chairman of the Armed Services Committee, reacted strongly to the report, writing, “I was shocked to read of one website that provides answer keys and boasts that “[w]ith cheap prices and fast service, you can be wearing that E-5 [sergeant] rank before you know it.”

http://www.boston.com/news/nation/washington/articles/2007/12/19/kennedy_urges_army_to_deter_cheating_on_promotional_exams/

The essential problem is that the assessments being used for the correspondence courses are unproctored Internet tests.

I remember taking unproctored tests as a student at the university. We called them “take home” tests. Our take-home tests had implicit security built into them:

  1. They were really hard. You couldn’t just find the answer to the questions in the university library.
  2. You might find someone to take the test for you or help you out, but eventually you would take a few in-class tests (where you couldn’t use your friend).
  3. The tests were written in your own handwriting, which was easily compared with prior copies of your handwritten assignments.

Later, as an instructor at the university we added another twist to take-home tests: Every student got the same problems but with different data and different answers.

The above simple principles highlight the issues that must be addressed to administer a test securely online in an unproctored setting:

  1. Biometrics should be used to authenticate test taker identity.
  2. The questions must not be answerable using simple “Google” searches.
  3. A verification process needs to be in place that allows the unproctored test result to be trusted.
  4. Other security measures may assist with authenticating that the test taker actually did his or her own work.
  5. Algorithms that produce item clones or variants can reduce the ability of test takers to share test content or profit from another’s answers.

I remember the day that I took my oral exams. There was no faking. There was no cheating. I was in a room, face-to-face, with three professors. Each of them had taught me in at least one course. Of course, it is not realistic to do this for every single individual being certified in a profession or being admitted into the university. But, it demonstrates the importance of having several observations which together confirm that the candidate does indeed possess the requisite competence.There has been interesting progress in the area of secure administrations of unproctored Internet tests. I will mention just a few items that I can recall readily:

  1. Kryterion (www.kryteriononline.com) is using data forensics and biometrics to establish that a test is being taken properly.
  2. SHL (www.shlgroup.com) is using an initial unproctored test followed by a verification test in a proctored setting to ensure that the test results can be trusted.
  3. An instructor named Simon at the School of DCIT, University of Newcastle, used an innovative detection system with online unproctored tests that relied on font colors in Word documents to detect cheaters: http://crpit.com/confpapers/CRPITV42Simon.pdf

At this URL: http://www.westga.edu/~distance/ojdla/summer72/rowe72.html you will find a paper that is very interesting in this context.

Two things are clear: (1) online assessment is here to stay, and (2) ubiquitous security solutions are needed if online assessments are to be trusted.



Student outwits FCAT with secret pattern


Friday, December 14th, 2007

A senior from Manatee High School passed the FCAT (Florida Comprehensive Assessment Test) in ten minutes by using a “secret pattern” after flunking the test three times. His score was invalidated. Apparently the test score was not invalidated because he used a pattern. Carla Frazier told the news, “FCAT rules do not prohibit students completing the test using any patterns, nor does the test have a minimum time requirement.”

http://www.bradenton.com/local/story/242473.html

We don’t know why the principal invalidated the score. We don’t know what “secret pattern” was used by the student. But, I have an idea what it might have been: “a-n-s-w-e-r-k-e-y.” Ok, I admit to being a cynic and a skeptic at times. This is one of those times.

Consider the facts, and then decide for yourself if you believe the student’s story.

  1. Test publishers are very careful to make answer keys as unpredictable as possible. They are well aware of the guesser’s adage, “If you don’t know, choose ‘C’.”
  2. Item writers and item reviewers are careful in writing distractors and answer choices to prevent guessers from gaming the test and gaining an advantage. They know that guessers will attempt to deduce the correct answer by analyzing the answer choice lengths and details.
  3. Having analyzed a lot of high school exit exam data, I know that pass rates go down with every make up test. Students who fail three times are very lazy, easily confused or just not proficient. Passing the test in ten minutes is not consistent with any of these.
  4. Cheaters are often very creative liars and they prey on our gullibility. The news reporter was gullible in writing the story and, for some reason, expects us to be equally gullible.

There are a lot of ways to detect cheating. In this particular case we might have seen any of the following:

  1. An extremely high score after having flunked three times previously would be a clear warning sign to the principal.
  2. The FCAT, according to the district FCAT coordinator, often contains pilot questions. If the student did very well on all the questions, except the pilot questions, and the answers to those questions matched the answer key form a different form of the test, then the principal would definitely have a “smoking gun.”
  3. Sometimes the answer sheet can be modified after the fact. With the right inducement, an insider may be persuaded to change the answers. Erasure analysis would detect this kind of tampering. Perhaps the principal was suspicious and saw a lot of erasures on the answer sheet.
  4. It is often the case that the cheaters boast of their exploits and in this case the principal may have gotten wind of the boasting.

Being a student of statistics, I imagine that the student could have finally gotten lucky and passed the test. Distribution theory states that the maximum observed value in a distribution has a much higher mean than the distribution from which the value was drawn. In this case, we have repeated scores on the FCAT for the student. Just by chance alone, if the student’s expected score is reasonably close to passing, after repeatedly taking the test a passing score will be observed eventually.

But, suppose that in my skepticism I am correct. Suppose the student did have the answer key. How would the forensics analyst detect that an answer key had been stolen and used? I have seen three answer-key arbitrage techniques used for exam security purposes, and which could be used in similar situations.

  1. The FCAT coordinator disclosed that pilot questions are often used on the exam. Scoring the pilot questions with alternate keys could provide probability evidence that an answer key was in play.
  2. I know of a situation where items were intentionally miskeyed and left unscored with the goal of determining whether the answer key had been stolen and used.
  3. In another situation, the exam contained a few poorly written questions where the provided answer was ambiguous (This often happens on exams). These questions were exploited in a similar manner to compute probability evidence that an answer key was stolen and used.

The test publisher has many tools and techniques that can be used to trap the unsuspecting cheater. Answer-key arbitrage is one of those.



Suing to prevent cheating vs. suing to allow cheating


Friday, November 23rd, 2007

A student at the Dayton School of Law is suing the school because it did not fix a glitch in the test administration software that allowed other students to upload pre-written exam answers during the exam. He feels disadvantaged because he had to type in his answers, while others uploaded their answers electronically.

http://www.daytondailynews.com/n/content/oh/story/news/local/2007/11/22/ddn112307lawschool.html

The student said, “I’m not upset with my actual grade, but that other students used the technique to get better grades and the law school didn’t try to prevent it.” (quote from article.)

This position seems rather tenuous to me. Without further information concerning the test, there is no reason to suspect that the other students would have received lower grades if they were required to type in their answers also. Cheating occurs when one or more individuals gain an unfair advantage. From the information presented, it is not obvious that these students have gained an unfair advantage in being awarded a higher grade.

(The company that produced the software used by the Dayton School of Law is ExamSoft: http://www.examsoft.com/)

The lawsuit does raise an interesting question. What conditions need to be present to hold an organization which administers tests accountable for preventing cheating?

On the other hand, last March several high school students filed suit against TurnItIn.com for adding their term papers to a massive anti-plagiarism database. Ostensibly, the students claimed copyright infringement. The link to the article is below.

http://www.washingtonpost.com/wp-dyn/content/article/2007/03/28/AR2007032802038.html

From the article we have the following quote: “Kevin Wade, that plaintiff’s father, said he thinks schools should focus on teaching students cheating is wrong. ‘You can’t take a person’s work and run it through a computer and make an honest person out of them,’ Wade said. ‘My son’s major objection is that he does not cheat, and this assumes he does. This case is not about money, and we don’t expect to get that.’”

Admittedly, this is old news. I couldn’t determine the current status of the law suit. When I first saw this story, I thought to myself that the use of metal detectors at airports doesn’t presume everyone is a terrorist and, similarly, that scanning term papers does not presume that all papers are the result of plagiarism. I was left wondering the question, “What rational motive do students have for preventing a term paper (which could be republished and redistributed despite having been added to TurnItIn’s database) from being detected as a potential source of plagiarism?” I thought that perhaps the students were planning careers as ghostwriters and wanted to amass their collection of papers, but the students say that they don’t cheat, so it couldn’t be that. (http://en.wikipedia.org/wiki/Essay_mill)

But, I find it interesting that one student sues to prevent cheating and others sue to allow cheating.

Other references on the TurnItIn lawsuit:

http://volokh.com/posts/1175270161.shtml

https://turnitin.com/static/pdf/us_Legal_Document.pdf



Benefits-payment cheater caught using statistics


Tuesday, November 20th, 2007

The other day a woman in the UK was caught in a lie where she fabricated the existence of seven children to receive government benefits. She claimed to have given birth to quadruplets in 2005, to twins (who were delivered one week apart) in the same year, and then to a seventh child in 2007. None of these children existed.

http://www.dailymail.co.uk/pages/live/articles/news/news.html?in_article_id=494261&in_page_id=1770

The article starts: “Any mother who has given birth to quadruplets needs all the help she can get. So benefits staff were happy to provide support for Victoria Young in raising babies Kier, Kie, Kyla and Conrad. There was just one problem – none of them existed. …”

The benefits staff got suspicious on the seventh child and investigated the crime. By that time, Victoria Young “had swindled more than £40,000 in benefits payments with her bogus brood of seven babies in the space of 18 months.” (direct quote)

It’s natural to ask how data forensics techniques could be applied to this situation. We start with models that describe the population. To test the above claims we need to know about multiple birth probabilities, fertility rates, and birth spacing statistics. I found the needed statistics at a government website: http://www.statistics.gov.uk/downloads/theme_population/FM1_32/FM1no32.pdf

In 2003, only one set of quadruplets survived birth, making the probability of live quadruplets to be approximately 1 in 600,000 (Table 6.4 from the government report, see Multiple Births in Wikipedia also: http://en.wikipedia.org/wiki/Multiple_Births). From the table of statistics, the probability of twins is about 9,001 in 615,787, of triplets is about 127 in 615,787, and of quadruplets is about 3 in 615,787. If we use these values and assume that birth multiplicity is independent of each occurrence of maternity, then we can test Victoria Young’s claims with the conditional probabilities in Table 1 (computed using standard convolution equations).

Table 1: Conditional Probabilities of number of maternities given family size

 

Number of Children

Number

of

Maternities

1

2

3

4

5

6

7

1

1.00000

0.014836942

0.000209

4.95E-06

*

*

*

2

 

0.985163058

0.029234

0.000629

1.59E-05

1.88E-07

2.03976E-09

3

 

 

0.970557

0.043201

0.001251

3.57E-05

6.89055E-07

4

 

 

 

0.956165

0.056747

0.002064

6.70445E-05

5

 

 

 

 

0.941987

0.069882

0.003059676

6

 

 

 

 

 

0.928019

0.082614512

7

 

 

 

 

 

 

0.914258077

The conditional probabilities are read down the columns. (Asterisks are used to indicate values that could not be estimated from the government statistics.) For example, the probability of three maternities given seven children is in row 3 and column 7 and is equal to 6.89055E-07. (This number is in scientific notation and indicates the value of 0.00000068905507, or one in 1,450,000.)

We see that Victoria’s initial claim of quadruplets was very extreme (even though the data show that quadruplets are delivered in the UK) with a probability of one in 200,000 (this is a very extreme number and the sort of value that we typically find with extreme occurrences in Caveon Data Forensics). Her claim of six children with two maternities is even more extreme, with a probability of one in 500,000. And her final claim of seven children in three or fewer maternities has a probability of one in 1.4 million.

The claimed birth spacing is very unusual also. Victoria claimed the twins were born eight months after the quadruplets in September 2005. Birth spacing statistics from the UK website only provide a median statistic of 37 months between the first and second maternity and 42 months between the second and third maternity (Table 11.3 from the UK government report). We don’t have a lot of statistical information but for the purposes of this exposition we assume the birth spacing data follow an exponential distribution (waiting time distribution; this assumption should be tested in practice). The median will be a good estimator for the mean. Using this estimate we find that the probability of having a second maternity within 8 months or less is about one chance in one trillion. We also find that the probability of having two maternities within 18 months or less (we need the distribution of the sum, so we add the medians together) is 1 in 1025 (one trillion squared).

We have found that it is always useful to combine the probability evidence together. After all, Victoria’s motive was to acquire a large family as quickly as possible so as to maximize benefits payments. Using techniques developed at Caveon, we evaluate her final claim of seven children with three or fewer maternities in 18 months or less. The estimated probability is one chance in 1031 (one in ten billion cubed). Yes, the benefits people were justified in being suspicious. If their systems had implemented these types of probability analysis for fraud detection, they may have been able to save the UK some embarrassment and expense in catching a cheater more quickly.

In data forensics work we proceed just as I have illustrated above. We create population models. We assume the data conform to the models (i.e., there is no cheating). We test the anomalous data against the model and eventually compute probabilities. It is nearly always the case that the data do not conform precisely to the model, but the models provide sufficient guidance that objective statements concerning the improbability of the extreme data may be made.



HOME :: SERVICES :: RESOURCES :: COMPANY :: PRESS :: LINKS