Archive for the 'data forensics methods' Category


Trojan Items and Answer-key Arbitrage


Sunday, March 2nd, 2008

Today is the first day of the annual ATP Conference (Association of Test Publishers). This afternoon I will present a workshop titled, “Strategies and Tactics for Limiting Item Exposure.” We will be exploring innovative ideas for protecting tests and items from theft. It’s easy to understand why test publishers are concerned about test theft. High-quality items are expensive to produce and represent a substantial investment. Item development costs of $1,000 or higher per item are not unusual. In an afternoon, a thief can compromise an investment of $250,000 or more, easily. Most testing professionals will state that item theft is their number one security concern. I discussed this previously in: What is your top security concern?

I can’t share the entire workshop content with you in this short essay. But, I can share with you Gene Radwin’s (of EMC Corporation) intriguing idea of answer-key arbitrage and Trojan items. The idea was briefly mentioned in: Student outwits FCAT with secret pattern. Just as the Trojan horse was the Greeks’ surprise weapon for outwitting the people of Troy, we hope to outsmart users of brain-dump content using Trojan items.

The basic idea of the Trojan item as developed and presented to me by Gene Radwin (email: radwin_gene at emc.com) is to place very easy items on the test which are miskeyed. If a test taker gives the miskeyed answers (and not the correct, easy answers) we have strong evidence that braindump content is being used. The fundamental principle is to create a test-within-a-test to detect test fraud. We booby trap selected items by changing them so that a different answer choice is now correct, and the compromised answer is incorrect. Without knowing which items are booby-trapped, the brain-dump user proceeds in ignorance, until detected. Just to illustrate, consider a math item that I “borrowed” from the SAT practice test.

Table 1: Example of a Trojan item

Example of trojan item
We do not expect the brain-dump user who has memorized the “Exposed” item to notice the small change in the “Trojan” item. As a result, the cheater will give the originally correct, but now incorrect, answer “C,” and at the same time the honest test taker will give the correct answer “E.” The change in the answer key gives us a leverage or arbitrage point, creating a powerful difference in the statistical expectations.

In order to be effective, several Trojan items will be required on the exam. I haven’t done a rigorous analysis of the statistical power of the procedure, but my current intuition suggests that ten to twelve questions will be needed.

We recently analyzed data where one individual was suspected of having prior access to the test content. Six miskeyed items were present on the exam and we found that the suspect answered all the miskeyed items correctly (i.e., with the wrong answer key). Using item response models, we analyzed the “score” for the miskeyed items. (We do not use standard regression techniques because the data are not normally distributed, being highly constrained and skewed.) These data are shown in Figure 1.

Figure 1: Analysis of 6 miskeyed items

We see two extreme data points in Figure 1, corresponding to the suspected exam and another exam (they had probabilities of one in 5,000 and one in 1,000, respectively). The expected score on the miskeyed items was approximately two. We note that there is no correlation between the raw score on the test and the score on the miskeyed items.

In the above example, analysis of miskeyed items detected a potential testing irregularity. When Trojan items are specifically designed as described above, we expect to see a strong negative relationship between the Trojan items and the total score. In other words, high scoring individuals will provide the correct answer and not the original answer. This negative relationship improves our ability to detect users of brain-dump content.

In addition to my own analyses, one of our clients has told me of great success in using these techniques. For obvious reasons, the client does not want brain-dump users to know which tests are treated with Trojan items and how their cheating is being detected. When cheaters realize they are being punished for using brain-dump content, they will quit using the content. Then we will be satisfied. We just want test takers to do their own work and demonstrate their own ability when they take tests.



Are identical answers to exam questions proof of cheating on tests?


Monday, February 18th, 2008

When it comes to supporting an allegation of cheating on tests, there is rarely better statistical evidence than having two (or more) tests with identical sets of responses, or identical answers. Having a great interest in this topic, I have read carefully the abstracts of Rice University Honor Council meetings where these types of allegations are taken very seriously. In several instances of alleged academic fraud, the Honor Council has found the evidence of identical solutions and identical answers to be compelling.

“The Rice Honor System was created by students in 1916. That it has functioned so well for so long is a reflection of the trust and respect that Rice students show to one another and to the University. It is one of Rice’s most highly valued traditions and a vital part of your education–education in responsibility and integrity.” http://honor.rice.edu/

In one instance, the Council minutes read:

Witness 1, the professor for the class, stated that he believed the similarities between the True / False answers and the essay answers given by Student A and Student B to be strikingly similar. He … presented a statistical analysis of the probability of this occurring in certain situations.

In the above case, despite having a probability analysis, the Honor Council did not find that the honor code had been violated (i.e., cheating was not found).

In another instance, the Honor Council had a different finding:

Some members felt that the identical answers on some portions of the exam were beyond coincidence or having similar notes or studying together. Members were suspicious of the fact that these similarities would arise after the students used different sources of information when answering the questions. … Some members were not convinced by the explanations …

Despite denials of cheating in the above situation, both students were found in violation of the honor code.

Here’s a Google search link if you wish to read some of these abstracts.

It is evident from these two abstracts that the Honor Council attempts to find plausible explanations for identical answers and excessive similarities between test questions. It is also evident that the Honor Council may act without having definitive proof. As an example of the degree of “proof” or evidence that may be required to take action in a case of suspected cheating, consider this statement from the University of Western Ontario:

It is particularly important to understand that the conclusion that a student committed a scholastic offence does not have to be supported by evidence beyond a reasonable doubt. In an exam writing situation, that means that a decision maker may conclude that cheating took place, even if it is possible that two people got some identical answers by chance.

The observation that two tests have identical answers is very reliable evidence as defined by the criterion I proposed in my most recent post, because the observation is (1) factual, (2) objective, (3) credible, and (4) defensible. We require that the evidence have one additional attribute before believing that cheating probably occurred. The evidence must be strong.

In order to evaluate the strength of evidence of identical answers on tests, we require the probability of the observed responses. At Caveon, the probability for the observed item responses is estimated using item response theory. We compute this probability by multiplying all the probabilities together of the selected responses (we assume the selected responses are conditionally independent) and then normalizing the product by the marginal probability of the observed score. Formulas for computing exact probabilities are difficult to derive and program, which means that most practitioners who encounter these situations will rely upon judgment and intuition in the same way the Rice Honor Council does.

I have pasted in a table of sampled probabilities for an 18 item test, below. The probabilities are calculated knowing the score that was obtained on the test. So, if we know a person answered all 18 items correctly the probability that another person who answered all 18 items correctly would match is equal to one. If the answer was correct, it is highlighted in gold in the table.

Probabilities of identical tests

Even though I routinely evaluate these types of probabilities, I have been surprised by some instances of identical response data. For example, the probability of an identical test when all items are answered correctly is 1 (as in the first row of the table). But, the probability of an identical test when all but one or two questions are answered correctly may be as high as .10 or .25 (see the second and fourth rows of the table). On the other hand, if several questions are answered incorrectly, the probability of an identical test may be 1 in 100 million or even smaller. The wide variation in these probabilities is a function of the number of correctly answered test questions and the selected responses.

If the probabilities of some test response patterns are sufficiently high (because the tests are easy or the examinees are very proficient) and if we have a large enough group, we might expect to see many identical tests. Probability computations for the number of observed identical tests can be very difficult. This is an instance of the “birthday problem” with unequal probabilities.

At the beginning of this discussion, it appeared that we had a relatively straightforward and simple problem. It often occurs with statistics that many apparently simple problems become very complex, very quickly. The analysis of identical answers for two exams is one of those problems. The answer to the question with which we began the discussion must be: We cannot prove that cheating occurred when we have identical answers for two test instances, but in many situations we can obtain very strong, reliable evidence leading us to conclude that cheating occurred and the conclusion would be right, nearly always.



Can you prove cheating on tests using statistics?


Monday, February 11th, 2008

There is a children’s game known by various names as “Whisper,” “Secrets,” or “Gossip” where a secret is shared and passed from one player to the next. The last player hearing the secret says it aloud, often with hilarious results. These same distortions happen in the news media, as journalists cite other reports or each other. Such a misquote from the Star-Telegram concerning additional security announced by the TEA (Texas Education Agency) for the TAKS (Texas Assessment of Knowledge and Skill) caused me to pause and reflect about using statistical evidence to “prove” that someone cheated on a test.

The reporter wrote, “Among other security measures: … Scramble field test questions on tests to provide proof if someone is copying someone else’s answer sheet.” (Italics added.) http://www.star-telegram.com/news/story/433614.html. Being well aware of the controversy surrounding the use of statistics, alone, to prove cheating, I immediately doubted the accuracy of the above statement. Actually, on June 7, 2007, Shirley Neeley announced that “the Texas Education Agency today will immediately initiate the following: … analyze scrambled blocks of test questions to detect answer copying…” TEA later clarified that the scrambling would only involve field test items. The Dallas Morning News was quick to criticize the scrambling plan, but I applauded TEA’s intent to detect cheating behavior using statistics.

We naturally ask whether statistical evidence can be relied on to detect cheating. Many authors have expressed the opinion that statistical evidence must be corroborated by eye-witness accounts before making allegations of cheating. I can understand this position if the statistics are not reliable. In my opinion, reliable evidence must meet the following conditions:

  1. It must be factual,
  2. It must be objective,
  3. It must be credible, and
  4. It must be defensible.

If statistical evidence meets the above conditions, I believe that it can be relied upon, whether corroborating eye-witness accounts are available or not. Statistical evidence is

  1. factual when it is based on test result data (an actual record of the test event),
  2. objective when it provides a statistic with a probability statement,
  3. credible when the statistics have been shown to work because the models accurately depict actual test taking, and
  4. It is defensible when the underlying science withstands scrutiny.

An additional fifth criterion the evidence must meet for taking action on a suspected instance of cheating is that the evidence must be strong. Statistical evidence is strong when the calculated probabilities are so small that we no longer believe the observed data are the result of normal test taking. Statistics can provide guidance for determining how strong is strong enough to take action, but ultimately the establishment of a probability threshold (i.e., the strength of the statistic) is a matter of policy that must be answered by the testing program administrator.

It is important with any statistical investigation to choose statistics that are well-suited and designed for the task at hand. For example, if the concern is that answer sheets are being modified, then erasure counts should be analyzed. Having analyzed over one hundred data sets for a wide variety of clients including state Departments of Education, admissions tests, certification programs, and licensure exams, I can unequivocally state that answer copying is the predominant means of cheating on tests. Therefore, it is especially relevant in this discussion concerning the reliability of statistical evidence to discuss answer copying and statistics that are designed to detect answer copying.

As you reflect upon the principles that I have outlined, I would ask you to consider the data in Table 1. The table contains differing probability values that a testing program administrator might be asked to evaluate. These are sampled answer-copying statistics (i.e., counts of identical answers) from a test having 240 items. With this many items on the test, the central limit theorem will generally apply so I have included a Z-Score in the table, as a point of reference.

Table 1: Sampling of test similarity statistics

Number of identical answers Expected number of identical answers Standard Deviation Z-Score Probability Index

168

81.3

7.2

12.0

30.3

171

102.3

7.4

9.3

19.9

130

76.4

7.1

7.5

12.4

154

107.7

7.4

6.3

9.5

128

87.9

7.3

5.5

7.3

108

74.3

7.1

4.7

5.5

107

75.1

7.1

4.5

5.0

120

89.4

7.3

4.2

4.6

115

86.1

7.3

4.0

4.2

128

103.9

7.4

3.3

3.1

At Caveon we deal with extremely small probability values, so we typically express those using “an index” where the probability is one in 10 to the power of the index (p=10-index). The most extreme case in Table 1 has a probability of one in 10 to the thirtieth power. These data are definitely not due to normal test taking.

Assuming that you accept the statistical evidence as being reliable, the decision needed by you, the testing program administrator, is how low in Table 1 should you go? Where do you set the cut point? These data illustrate if you set the cut point too low, you might accuse some individuals of answer copying without having strong evidence. If you set the cut point too high, you might allow several individuals who have cheated to escape discipline.

I will elaborate more on this topic, next time. Until then, may your tests remain secure.



Trouble in Section K


Thursday, February 7th, 2008

Elf mistress Heloise entered Elvin’s office (Head of Section K) quickly. “For the eighth week in a row, the reject rate from Section K is three times the rate from the previous twelve months,” she said, handing the weekly quality report to Elvin. She continued, “I was so impressed when your section scored higher on the elf proficiency exam than any other section in the Mechanical Doll Department nine weeks ago that I awarded your elves with assemblage of gears and levers, but this is unacceptable.” Heloise crossed her arms and waited for a reply.

Elvin wrinkled his brow and frowned ruefully. This was unwelcome, but not unexpected, news. He picked up a thick folder and opened it. He leafed through one report after another and muttered, “We have eliminated transportation, storage, tools, assembly, parts, fatigue, and sabotage as explanations. There’s only one conclusion. At least one, and maybe several, of the elves in Section K is incompetent. But how can that be? Is the proficiency exam flawed?”

“Let’s find out,” replied Heloise. And together, they visited the proficiency exam designer. After explaining the problem, the proficiency exam designer shook her head and said, “You need to see the data forensics analyst.” The data forensics analyst listened with deep concentration, scanned page after page of test results, whistled softly, and finally exclaimed, “It looks like elves in Section K have cheated on the elf proficiency exam. Now, how to prove it?” he said mysteriously, and then immersed himself in complex symbols and calculations. Heloise and Elvin excused themselves, but the data forensics analyst didn’t even turn his head as they left. Much later, the proficiency exam designer listened intently while the data forensics analyst described his plan for catching the cheaters in Section K.

Three weeks later, the schedule for the quarterly elf proficiency exam was posted throughout the Mechanical Doll Department. On the day of the test, elf examiners throughout Santa’s workshop reported to a different department than usual to conduct the examination. For example, elf examiners from Remote-Controlled Toys reported to the Games and Puzzles Department. It so happened that an elf examiner from each of the other departments reported to the Mechanical Doll Department. Some administered the elf proficiency exam, and others just watched and waited. All test responses were recorded meticulously. After a long and grueling day, all the elves had been tested.

The data forensics analyst worked all night, making calculations and graphs and charts. At the break of day, Heloise and Elvin knocked at his door. “Enter!” they heard. They stepped into a bizarre scene: scraps of paper were strewn about, charts with bars and circles were plastered on the walls, and a wizened elf was humming in the midst of chaos. “Done!” he shouted. “Oh, it’s you. Well, I have the answer,” he said with absent-minded aplomb.

Then noticing their impatient expressions, he said, “Oh, let me explain.”

“None of the examiners are involved. I know this because there are no patterns of inconsistent answering associated with the examiners. It was important that no examiner give the test to any elf with whom he or she normally associates.

“There were extremely similar test answers between four elves in Section K. It is almost certain that they did not take the tests independently,” The data forensics analyst concluded.

“But, how can that be?” queried Heloise. “They were all watched carefully. There was no way that they could have shared answers or communicated during the test!”

The data forensics analyst minutely explained, “I suspected this might be the case. So, I asked the proficiency exam designer to create two test forms. She very carefully changed a few of the questions between the first and second test forms, so that the correct answers would be close, but not the same. The master test booklet for the first form was locked away in test booklet storage. The proficiency exam designer kept the master test booklet for the second form with her at all times. Even though the elves in the Mechanical Doll Department were given the second form of the test, our four culprits answered all the changed questions with answers from the first form of the test. There is no doubt in my mind. They broke into test booklet storage and memorized the test answers!”

Elvin brought the four suspected cheaters into Heloise’s office. Each elf vigorously denied any wrongdoing. At that point, the data forensics analyst dimmed the lights. He splayed an infrared beam across the hands of each suspected cheater. All of their hands glowed eerily with a blotchy red hue. Then, using gloves to handle the master test booklet from storage he shined the beam on the pages. They glowed red. He touched the booklet pages against his bare arm. Shining the bean on his arm, it also glowed with a blotchy red hue. Heloise barked, “You are red-handed! Now stand still while I consider your punishment!”

“Tomorrow,” pronounced Heloise. “You will report to the master of the Quality Department for ‘R and R,’ where you will begin the repair and refurbishment of all toys in the Rejected Toy Warehouse. You will work there until all the broken toys are operating perfectly and to the satisfaction of the master of quality.”

“Elvin,” Heloise continued. “Section K can no longer be responsible for assemblage of gears and levers. Your section must repair its damaged reputation from producing so many rejected mechanical dolls. Even though you will not receive replacements for these culprits, your production quota will remain the same.”

Elvin wrinkled his brow and frowned ruefully. This was unwelcome, but not unexpected, news. He remembered another time, when he was an impetuous, lazy elf; and when he had cheated. The punishment seemed harsh, but he had learned his lesson and was glad that the cheaters had been apprehended.

Moral: Just as dishonesty betrays the cheater, it injures all who are around him.

Addendum: The cheating detection and prevention techniques described in this story are among best practices. I have described use of the data forensics methodologies in two actual cases we have analyzed at Caveon: The case of the waylaid answer key and The case of the befuddled answer copier.

The State of Mississippi has put together a very nice power-point presentation on test administration auditing and monitoring: www.mde.k12.ms.us/ACAD/osa/DTC_Test_Security_Fall_07.pps

If you are interested in learning more about these or other solutions to test fraud please contact us, at Caveon Test Security.



The case of the waylaid answer key


Thursday, January 17th, 2008

Recently there have been many reports of lost databases, stolen computers, and misplaced documents. Is it any wonder that tests and exams are also experiencing the same problems? For example, last November in New Zealand the home of an employee of the Qualification Authority was burglarized and a laptop containing math items for the National Certificate of Educational Achievement was stolen. Despite assurances of password protection, the Qualification Authority revised and reprinted 150,000 test booklets: http://www.stuff.co.nz/stuff/4331442a7694.html

As another example, the completed answer sheets from an exam for the Arkansas State Board of Cosmetology were lost or misaddressed in the FedEx shipment to the scoring agency. Ninety candidates will have to retake the exam: http://www.nwanews.com/adg/News/213242/

Two years ago Caveon’s assistance was sought in dealing with a similar situation. The car of an employee of a major test publisher was stolen. In the car were secured test materials, including an answer key to an upcoming state-wide public school examination. When the car was recovered the answer key was missing. There was not enough time to revise the test. The exam would be administered as scheduled. Our client wanted to know if the answer key was being distributed and if the integrity of the test administration had been compromised.

As we discussed the situation with the client, I was confident that we could detect a widespread breach. But, could we detect a situation when just a few students used the lost answer key? There was no doubt in my mind if the thief knew the market value of the answer key that it would be sold on the Internet. I knew this from first-hand experience. While I was teaching at the University, a dual-campus administration of the test coupled with a time lag between administrations led to the answer key being disclosed. Three of my students obtained the answer key to the exam through a Yahoo chat room. They scored 100% on all the questions, except the essay question, which they refused to answer.

The client gave us the following details about the test. There were 54 questions on the exam with 10 field test items and 44 core items. There were about 2 dozen different forms of the test. The forms all contained the same core items in the same locations, with form differences due to different sets of field test items. Slowly an analysis plan began to emerge. Because the answer key for only one of the forms was lost, we could score the field test items for all the other forms using the waylaid answer key. Scores on the field test items would be the keystone of the analysis.

We assumed that any student using the stolen answer key would not know which items were field test items and which were core items. We also assumed that the student would answer all the items (with potentially a few mistakes) using the stolen answer key. It was easy to determine that a widespread dissemination of the answer key had not occurred. Statistical methodology dictates that statistical tests are performed assuming the null hypothesis (i.e., the answer key was not in play) is true. Under this assumption we found that less than 2% of the tests had “high scores” (i.e., scores above the 95th percentile of the distribution), when 5% were expected. This was very good news. There was not a wide-spread dissemination of the answer key.

Next, we hypothesized that a few teachers or school administrators might have received and used the stolen answer key. Using a probability inversion formula, we rank ordered the schools by the proportion of tests where more than six correct answers on the field test items (using the stolen answer key) were found. We found that the proportion of schools in the upper tail (above 10%) was less than 7% when 10% were expected. This was good news. It meant that if the answer key was disseminated, it was not likely to have occurred through teachers or administrators. (We also visually inspected the 30 most extreme schools for “perfect” scores of 10 on the field test items for all the other forms except the one associated with the lost answer key. Nothing untoward was found in any of those schools.)

Finally, searching for the proverbial needle in the haystack, we hypothesized that a few isolated students may have been able to receive the answer key through personal contact with the thief on the Internet. In order to attack this problem we created a Bayesian probability model, where we estimated the probability that the stolen answer key was used by a particular student conditional upon the test score. Using this model we inferred a 95% upper bound on the proportion of student who used the answer key to be less than .09% (or nine in ten thousand). The five most extreme tests were visually inspected, and not one of them had a “perfect” score on the field test items, using the lost answer key.

The results of the analysis gave our client sufficient confidence to trust the integrity of the test administration. In order to place perspective on these statistical estimates, we note that the estimated bound (i.e., .09%) on answer key compromise is much, much lower than the actual proportion of students who copy from each other in the normal test taking situation. While we could not prove that the stolen answer key had not been used, we concluded the following:

If any students have gained access to the answer key, the data indicate the answer key has not been shared with friends. And, if the answer key was used, its use was isolated.

With 95% confidence, no more than .09% of students used the compromised answer key. It is very likely, in fact, that no student actually used the compromised answer key.

The above situations illustrate the importance of properly securing test materials. They also illustrate that by using innovative and defensible statistical analyses, testing program administrators may know the degree of security risk that is present. The analysis of the waylaid answer key illustrates the power of data forensics in protecting and maintaining exam and test security.



No-Fly List shenanigans


Monday, January 14th, 2008

Just last week a five-year old boy was detained by TSA (Transportation Security Administration) because his name was similar to a suspected terrorist on the no-fly list. The reporter wrote, “A five-year-old boy was taken into custody and thoroughly searched at Sea-Tac because his name is similar to a possible terrorist alias. As the Consumerist reports, ‘When his mother went to pick him up and hug him and comfort him during the proceedings, she was told not to touch him because he was a national security risk. They also had to frisk her again to make sure the little Dillinger hadn’t passed anything dangerous weapons or materials to his mother when she hugged him.’”

http://www.schneier.com/blog/archives/2008/01/fiveyearold_boy.html

On the other hand, 13 News in Indianapolis interviewed a woman, Lisa Skaggs, who described an incident two rows in front of her, where a man occupied the same seat that was assigned to another passenger. The man refused to produce his ID, only showing his boarding pass with the same seat number. The plane was finally evacuated in order to remove the recalcitrant passenger. http://www.wthr.com/Global/story.asp?S=7369309&nav=menu188_6

A United Airlines representative confirmed that the passenger’s name did not match the boarding pass. In my opinion, the most shocking statement about this incident came from a TSA official. “TSA’s Christopher White believes the system worked. ‘The fact that one of two million may not have a boarding pass that does not match and I.D., does not overly concern us when they’re exposed to all these other layers of security,’ said White.”

It’s not illegal to fly without having an ID. In fact TSA’s regulations explicitly allow for passengers to board an aircraft without an ID. You might find the experience and perspective of Joby Weeks to be interesting in this context: http://www.thetraveljunkie.ca/articles.php?articleid=146

The fact that boarding passes are an element of TSA’s security and that boarding passes may be printed from home represents a security hole in TSA’s security rules and regulations. This was documented by Senator Charles Schummer of New York, who vividly described how “Joe Terrorist” circumvents the no-fly list, in a letter dated February 11, 2005 to TSA officials.

http://www.csoonline.com/read/020106/caveat021706_pf.html

The insecurity of “print-from-home” boarding passes was demonstrated convincingly a year ago by Christopher Soghoian, a Ph. D. student in Computer Science at Indiana University. The FBI raided the home of Indiana University grad student Christopher Soghoian, who created a Web site that lets users forge their own airline boarding passes. Soghoian said he intended to call attention to an airport security loophole.”

http://www.slate.com/id/2152507/ See Christopher’s description of the FBI raid here: http://paranoia.dubfire.net/2006/10/fbi-visit-2.html

There are several security principles that are illustrated in the above scenario:

  1. If security is not implemented properly and has glaring security weaknesses, your organization may receive intense negative attention.
  2. If security is not designed into the overall system, but it is added in after the fact, security holes will be present that will be difficult to patch.
  3. A proper view of security requires understanding the true risk that is represented by anomalous and unusual behaviors (such as understanding what a one-in-one-million anomaly potentially represents).
  4. Simple lists and blindly following ad-hoc rules (such as detaining five-year olds) can make your organization look ridiculous.
  5. When you use elements in your security system that were not designed to provide security (such as print-from-home boarding passes), you are likely to have security holes.

We don’t know why the passenger without the ID refused to present his identification documents. Here are some possible scenarios.

  1. He could have learned how to hack United Airlines’ reservation system.
  2. He could be an actual wanted fugitive who paid for or fabricated a false boarding pass.
  3. He could be a terrorist who was probing airline security in order to learn how to board an airplane without presenting an ID and without drawing attention to himself.

All of these possibilities show the inanity of the TSA comment: “The fact that one of two million may not have a boarding pass that does not match and I.D., does not overly concern us when they’re exposed to all these other layers of security.” We have learned at Caveon that the unusual circumstance is that which requires the greatest care and scrutiny.

A few years ago a large number of test booklets were lost. Even though the large number of lost booklets was a very small percent of the total number of printed booklets, the fact remained that those lost test booklets represented a substantial security risk to the testing program. It only takes one lost booklet to compromise an entire exam. It only takes one or two terrorists out of a million flyers to represent a significant security risk to the public safety.

Caveon Data Forensics is based on the premise that unusual and extremely anomalous data are those that should receive the greatest scrutiny. We are extremely concerned when test takers go outside the country to take tests. We are especially vigilant when tests are extremely similar, even when or especially when they represent a very small proportion of the total tests administered. From my view, the unusual and the anomalous data are those that should receive our highest attention. The comment from the TSA official suggested that such data do not represent a significant worry. In my opinion, such an attitude is short-sighted and imprudent.



‘Sabermetrics,’ baseball and steroids


Tuesday, January 8th, 2008

Prognostications are that Mark McGwire will not be inducted into Baseball’s Hall of Fame this year again, because of admitted steroid use. Here is the URL to the article:

http://www.nationalpost.com/sports/story.html?id=221516

In 2005, McGwire ducked the direct question whether he had used steroids or performance-enhancing drugs (PEDs). Many statisticians think that steroids do not improve performance, because “most baseball skills depend primarily on reaction times and judgments, factors unaffected (or even degraded) by these drugs.” Those who study the numbers, “sabermetricians,” (coined from SABR – Society for American Baseball Research) “think the writers should set aside their biases and moral indignation and look at the facts: there’s simply no evidence steroids or other PEDs actually improve performance in baseball.”

One of the quotes in the article states, “While Bonds’ home run output rose significantly in the years after he supposedly started taking drugs, his profile is strikingly similar to Babe Ruth’s high performance level almost right until the [end] of his storied career, they say.” The actual data do not support this statement as you can see in Figure 1, which compares Barry Bonds offensive performance against three of the other great hitters of the game: Babe Ruth, Ted Williams and Ty Cobb. I used http://www.baseball-reference.com/ as the source for my statistics.

Figure 1: Offensive performance comparison

Comparison of hitters

The OPS+ statistic is a normalized statistic that is adjusted for opponents’ defensive strengths and ball park friendliness to hitters. A value of 100 is average performance. The above statistic shows that Barry Bonds performance was below that of the compared hitters for the first 15 years of his career and then suddenly and dramatically his performance soared for the remaining years of his career surpassing all prior years, when the offensive performance of the other hitters was definitely declining. Admittedly, this is arm-chair forensics, but the data suggest that steroid use did improve Barry Bonds’ performance.

Currently, Roger Clemens has emphatically denied that he took steroids. His trainer, McNamee is reported in the Mitchell report as stating that he injected Clemens with steroids from 1998 to 2001. Clemens is scheduled to testify before Congress and there are allegations of defamation of character being “batted” around.

http://www.bloomberg.com/apps/news?pid=20601079&sid=a0z.L9DGg68A&refer=home

Figure 2 compares Roger Clemens ERA (earned runs allowed) performance against three other great pitchers of their time.

Figure 2: ERA comparison

Comparison of pitchers

The ERA+ statistic is a normalized earned-runs-allowed statistic which has been adjusted for opponents’ strengths and other factors. A value of 100 is average. Clemens’ first year of baseball is 1984 and the four year period of 1998 to 2001 corresponds to his 15th through 18th years of play. The data show that during this time, Clemens’ performance was average. However, these data are unusual because some of Roger Clemens’ best years came after he turned forty, an age when nearly all players have retired from baseball and several years after the alleged steroid use.

While I did not expect to arrive at a definitive answer concerning these two players, I found it intriguing to apply forensic thinking to the current allegations of cheating and doping that are being circulated.

 



Anatomy of the meltdown of a forensic procedure


Friday, December 7th, 2007

The CBS News program “60 Minutes” and the Washington Post aired an investigative report on November 16 criticizing the FBI for failing to notify relevant jurisdictions that hundreds of inmates have been jailed using a flawed forensic methodology. Despite discontinuing the use of “bullet lead” analysis in 2005 because of validity concerns, the FBI had taken no action to inform the courts that some defendants were potentially innocent and wrongfully imprisoned.

http://www.cbsnews.com/stories/2007/11/16/60minutes/main3512453.shtml

Bullet lead analysis was first used in the investigation of the assassination of JFK, and was routinely used in the 1980’s when bullets were so misshapen that ballistic evidence was unobtainable. The essential idea is that trace elements in lead vary naturally and that bullets could be “matched” as coming from the same source (i.e., the same box of bullets) by comparing the compositions of these trace elements. In the 2005 press release, the FBI stated, “One factor significantly influenced the Laboratory’s decision to no longer conduct the examination of bullet lead: neither scientists nor bullet manufacturers are able to definitively attest to the significance of an association made between bullets in the course of a bullet lead examination.”

http://www.fbi.gov/pressrel/pressrel05/bullet_lead_analysis.htm

We naturally ask, “How is it possible that a procedure could be trusted for 40 years, be invoked in 2,500 investigations, be used as testimony in about 500 of those cases, and then be discredited?” The FBI commissioned an independent review of the procedure in 2002 by the National Research Council. Their report is very fascinating to read, is very comprehensive, and was completed in 2004. A copy may be purchased at the following URL: http://www.nap.edu/catalog.php?record_id=10924. The findings of this report convinced the FBI to discontinue the bullet lead analysis.

After browsing through this report and reading the findings and recommendations, it is clear that the FBI procedure devised in the 1960’s could not withstand public scrutiny. From my perspective, the most troubling aspect of the analysis was that it was (and is) unknown how many compositionally similar bullets were produced and where they were distributed. This means that a probability statement concerning the likelihood of a false positive (i.e., saying the bullets came from the same box when they didn’t) was impossible. Without such a statement the forensic examiner cannot state with any reliability or objectivity that the bullet found at the crime scene came from the same box as bullets found in the possession of the suspect.

The NRC also indicated that the method of computing the statistical match should be revised. From my perspective this is because the FBI’s computational procedure was not based on a statistic. It was computed using statistical ideas, but not supported with statistical distribution theory. This procedure falls into the realm of “ad-hoc analytics.” It seemed good at the time. There wasn’t a better idea. But, there was no way to determine error rates and probabilities associated with the procedure. I have seen a lot of ad-hoc statistical procedures in my day and they nearly always fail eventually because they are based on some statistical idea but they have no statistical theory that supports them. In the long run, the queen of statistics (i.e., natural variability) overwhelms all procedures that do not estimate probability models from empirical data.

I have a good friend who quoted the maxim, “Models before algorithms” often. By this he meant that you should analyze the processes that generate the data and the variability associated with the data before you build detection methodologies. I have tried to follow this rule assiduously in devising detection methodologies for Caveon Data Forensics. Without the guidance of reasonable probability models, statistical interpretations of the data are subjective and indefensible.



Benefits-payment cheater caught using statistics


Tuesday, November 20th, 2007

The other day a woman in the UK was caught in a lie where she fabricated the existence of seven children to receive government benefits. She claimed to have given birth to quadruplets in 2005, to twins (who were delivered one week apart) in the same year, and then to a seventh child in 2007. None of these children existed.

http://www.dailymail.co.uk/pages/live/articles/news/news.html?in_article_id=494261&in_page_id=1770

The article starts: “Any mother who has given birth to quadruplets needs all the help she can get. So benefits staff were happy to provide support for Victoria Young in raising babies Kier, Kie, Kyla and Conrad. There was just one problem – none of them existed. …”

The benefits staff got suspicious on the seventh child and investigated the crime. By that time, Victoria Young “had swindled more than £40,000 in benefits payments with her bogus brood of seven babies in the space of 18 months.” (direct quote)

It’s natural to ask how data forensics techniques could be applied to this situation. We start with models that describe the population. To test the above claims we need to know about multiple birth probabilities, fertility rates, and birth spacing statistics. I found the needed statistics at a government website: http://www.statistics.gov.uk/downloads/theme_population/FM1_32/FM1no32.pdf

In 2003, only one set of quadruplets survived birth, making the probability of live quadruplets to be approximately 1 in 600,000 (Table 6.4 from the government report, see Multiple Births in Wikipedia also: http://en.wikipedia.org/wiki/Multiple_Births). From the table of statistics, the probability of twins is about 9,001 in 615,787, of triplets is about 127 in 615,787, and of quadruplets is about 3 in 615,787. If we use these values and assume that birth multiplicity is independent of each occurrence of maternity, then we can test Victoria Young’s claims with the conditional probabilities in Table 1 (computed using standard convolution equations).

Table 1: Conditional Probabilities of number of maternities given family size

 

Number of Children

Number

of

Maternities

1

2

3

4

5

6

7

1

1.00000

0.014836942

0.000209

4.95E-06

*

*

*

2

 

0.985163058

0.029234

0.000629

1.59E-05

1.88E-07

2.03976E-09

3

 

 

0.970557

0.043201

0.001251

3.57E-05

6.89055E-07

4

 

 

 

0.956165

0.056747

0.002064

6.70445E-05

5

 

 

 

 

0.941987

0.069882

0.003059676

6

 

 

 

 

 

0.928019

0.082614512

7

 

 

 

 

 

 

0.914258077

The conditional probabilities are read down the columns. (Asterisks are used to indicate values that could not be estimated from the government statistics.) For example, the probability of three maternities given seven children is in row 3 and column 7 and is equal to 6.89055E-07. (This number is in scientific notation and indicates the value of 0.00000068905507, or one in 1,450,000.)

We see that Victoria’s initial claim of quadruplets was very extreme (even though the data show that quadruplets are delivered in the UK) with a probability of one in 200,000 (this is a very extreme number and the sort of value that we typically find with extreme occurrences in Caveon Data Forensics). Her claim of six children with two maternities is even more extreme, with a probability of one in 500,000. And her final claim of seven children in three or fewer maternities has a probability of one in 1.4 million.

The claimed birth spacing is very unusual also. Victoria claimed the twins were born eight months after the quadruplets in September 2005. Birth spacing statistics from the UK website only provide a median statistic of 37 months between the first and second maternity and 42 months between the second and third maternity (Table 11.3 from the UK government report). We don’t have a lot of statistical information but for the purposes of this exposition we assume the birth spacing data follow an exponential distribution (waiting time distribution; this assumption should be tested in practice). The median will be a good estimator for the mean. Using this estimate we find that the probability of having a second maternity within 8 months or less is about one chance in one trillion. We also find that the probability of having two maternities within 18 months or less (we need the distribution of the sum, so we add the medians together) is 1 in 1025 (one trillion squared).

We have found that it is always useful to combine the probability evidence together. After all, Victoria’s motive was to acquire a large family as quickly as possible so as to maximize benefits payments. Using techniques developed at Caveon, we evaluate her final claim of seven children with three or fewer maternities in 18 months or less. The estimated probability is one chance in 1031 (one in ten billion cubed). Yes, the benefits people were justified in being suspicious. If their systems had implemented these types of probability analysis for fraud detection, they may have been able to save the UK some embarrassment and expense in catching a cheater more quickly.

In data forensics work we proceed just as I have illustrated above. We create population models. We assume the data conform to the models (i.e., there is no cheating). We test the anomalous data against the model and eventually compute probabilities. It is nearly always the case that the data do not conform precisely to the model, but the models provide sufficient guidance that objective statements concerning the improbability of the extreme data may be made.



HOME :: SERVICES :: RESOURCES :: COMPANY :: PRESS :: LINKS