Archive for the 'Statistics' Category


You can manage and you can measure!


Thursday, March 6th, 2008

The Association of Test Publishers (ATP) Conference of 2008 ended yesterday. As always, it was a good conference. In 2004 we stated, “You can’t manage what you don’t measure.” Being a sponsor of the conference, we placed a bag of M&M’s (i.e., manage and measure) in each attendee’s conference packet. And, we printed the message on the hotel room key cards.

I have just completed analyses for three testing programs and I am so impressed with what they have done that I want to share their results with you. Good news concerning exam security is refreshing in the midst of so many cheating stories. We recognize dramatic acts of heroism, but often ignore the good that happens with steady, persistent progress. I am so proud of these three programs. They are achieving their common goals: “Reduce cheating, strengthen exam security and emphasize ethical test taking.” The data demonstrate this convincingly. Caveon’s message at ATP this year was, “The answer is in the data.” So let’s look at the data.

Figure 1: Percent of anomalous tests for three programs

Side-by-side comparison of cheating reduction
Let me describe the data in Figure 1. The percent of anomalous tests for successive analyses are plotted for each program. A trend line has been fit to the data to aid your eye in visualizing the trend pattern. An anomalous test is one that deviates from normal test taking, and will exhibit at least one of the following: aberrance (answering hard questions correctly and missing easy questions), large numbers of erasures, inexplicable score changes from a previous test score, or excessive similarity in the selected answers with at least one other test. An anomalous test does not mean the test taker cheated. For example, when we observe excessively similar tests it is very likely that one person cheated (the copier) and the other person did not (the source). The percent of anomalous tests does not measure the precise number of people who have cheated, but it is highly correlated with that number.

These data are important because they demonstrate that all high-stakes testing programs, irrespective of industry or application, can effectively reduce cheating. They illustrate that reductions in cheating can occur with persistence and dedication. Let me briefly describe each program and some of the positive steps they have taken.

Program 1: This program provides a professional certification with high security. We estimate that there was a 45% reduction in cheating in three years. They have followed up on every case that appeared to be a security violation and every test site that appeared to have lax security. They have emphasized proctor training. They are now reviewing their test taker agreements, proctor training, identification procedures, and physical security with the intent of using the best known security protocols.

Program 2: This program is a public education program. We estimate that there was a 72% reduction in cheating in two years. They have rewritten their test administration manuals and have begun test administration monitoring. They assign a conditional status to extremely anomalous test results and require local review of those test results. They are receiving reports that the students being flagged are admitting to having cheated.

Program 3: This program administers tests in the service industry. We estimate that there was a 78% reduction in cheating in one year. They have stressed ethical test taking. They have revised their test taking agreements and strengthened test administration policies to allow for scores to be invalidated with an appeals process. They have refreshed test forms which appeared to be exposed. They are researching the next phase of security improvements: test site monitoring and appropriate disciplinary measures for test administration personnel who may be helping test takers inappropriately.

These very different programs were the same in one important way: They started where they were, they created a plan, and they were not discouraged. Each was taken back by the first data forensics report (we always find something disconcerting), but they pressed forward and executed their plan. Best practices used by these programs include: test site monitoring, emphasis on ethical test taking, invalidating scores as per policy, refreshing tests which appear to be over exposed, and updating their security procedures.

Let’s give credit where credit is due. The numbers are impressive and the data do not lie. These programs have earned our respect and admiration.



Trojan Items and Answer-key Arbitrage


Sunday, March 2nd, 2008

Today is the first day of the annual ATP Conference (Association of Test Publishers). This afternoon I will present a workshop titled, “Strategies and Tactics for Limiting Item Exposure.” We will be exploring innovative ideas for protecting tests and items from theft. It’s easy to understand why test publishers are concerned about test theft. High-quality items are expensive to produce and represent a substantial investment. Item development costs of $1,000 or higher per item are not unusual. In an afternoon, a thief can compromise an investment of $250,000 or more, easily. Most testing professionals will state that item theft is their number one security concern. I discussed this previously in: What is your top security concern?

I can’t share the entire workshop content with you in this short essay. But, I can share with you Gene Radwin’s (of EMC Corporation) intriguing idea of answer-key arbitrage and Trojan items. The idea was briefly mentioned in: Student outwits FCAT with secret pattern. Just as the Trojan horse was the Greeks’ surprise weapon for outwitting the people of Troy, we hope to outsmart users of brain-dump content using Trojan items.

The basic idea of the Trojan item as developed and presented to me by Gene Radwin (email: radwin_gene at emc.com) is to place very easy items on the test which are miskeyed. If a test taker gives the miskeyed answers (and not the correct, easy answers) we have strong evidence that braindump content is being used. The fundamental principle is to create a test-within-a-test to detect test fraud. We booby trap selected items by changing them so that a different answer choice is now correct, and the compromised answer is incorrect. Without knowing which items are booby-trapped, the brain-dump user proceeds in ignorance, until detected. Just to illustrate, consider a math item that I “borrowed” from the SAT practice test.

Table 1: Example of a Trojan item

Example of trojan item
We do not expect the brain-dump user who has memorized the “Exposed” item to notice the small change in the “Trojan” item. As a result, the cheater will give the originally correct, but now incorrect, answer “C,” and at the same time the honest test taker will give the correct answer “E.” The change in the answer key gives us a leverage or arbitrage point, creating a powerful difference in the statistical expectations.

In order to be effective, several Trojan items will be required on the exam. I haven’t done a rigorous analysis of the statistical power of the procedure, but my current intuition suggests that ten to twelve questions will be needed.

We recently analyzed data where one individual was suspected of having prior access to the test content. Six miskeyed items were present on the exam and we found that the suspect answered all the miskeyed items correctly (i.e., with the wrong answer key). Using item response models, we analyzed the “score” for the miskeyed items. (We do not use standard regression techniques because the data are not normally distributed, being highly constrained and skewed.) These data are shown in Figure 1.

Figure 1: Analysis of 6 miskeyed items

We see two extreme data points in Figure 1, corresponding to the suspected exam and another exam (they had probabilities of one in 5,000 and one in 1,000, respectively). The expected score on the miskeyed items was approximately two. We note that there is no correlation between the raw score on the test and the score on the miskeyed items.

In the above example, analysis of miskeyed items detected a potential testing irregularity. When Trojan items are specifically designed as described above, we expect to see a strong negative relationship between the Trojan items and the total score. In other words, high scoring individuals will provide the correct answer and not the original answer. This negative relationship improves our ability to detect users of brain-dump content.

In addition to my own analyses, one of our clients has told me of great success in using these techniques. For obvious reasons, the client does not want brain-dump users to know which tests are treated with Trojan items and how their cheating is being detected. When cheaters realize they are being punished for using brain-dump content, they will quit using the content. Then we will be satisfied. We just want test takers to do their own work and demonstrate their own ability when they take tests.



Can you prove cheating on tests using statistics?


Monday, February 11th, 2008

There is a children’s game known by various names as “Whisper,” “Secrets,” or “Gossip” where a secret is shared and passed from one player to the next. The last player hearing the secret says it aloud, often with hilarious results. These same distortions happen in the news media, as journalists cite other reports or each other. Such a misquote from the Star-Telegram concerning additional security announced by the TEA (Texas Education Agency) for the TAKS (Texas Assessment of Knowledge and Skill) caused me to pause and reflect about using statistical evidence to “prove” that someone cheated on a test.

The reporter wrote, “Among other security measures: … Scramble field test questions on tests to provide proof if someone is copying someone else’s answer sheet.” (Italics added.) http://www.star-telegram.com/news/story/433614.html. Being well aware of the controversy surrounding the use of statistics, alone, to prove cheating, I immediately doubted the accuracy of the above statement. Actually, on June 7, 2007, Shirley Neeley announced that “the Texas Education Agency today will immediately initiate the following: … analyze scrambled blocks of test questions to detect answer copying…” TEA later clarified that the scrambling would only involve field test items. The Dallas Morning News was quick to criticize the scrambling plan, but I applauded TEA’s intent to detect cheating behavior using statistics.

We naturally ask whether statistical evidence can be relied on to detect cheating. Many authors have expressed the opinion that statistical evidence must be corroborated by eye-witness accounts before making allegations of cheating. I can understand this position if the statistics are not reliable. In my opinion, reliable evidence must meet the following conditions:

  1. It must be factual,
  2. It must be objective,
  3. It must be credible, and
  4. It must be defensible.

If statistical evidence meets the above conditions, I believe that it can be relied upon, whether corroborating eye-witness accounts are available or not. Statistical evidence is

  1. factual when it is based on test result data (an actual record of the test event),
  2. objective when it provides a statistic with a probability statement,
  3. credible when the statistics have been shown to work because the models accurately depict actual test taking, and
  4. It is defensible when the underlying science withstands scrutiny.

An additional fifth criterion the evidence must meet for taking action on a suspected instance of cheating is that the evidence must be strong. Statistical evidence is strong when the calculated probabilities are so small that we no longer believe the observed data are the result of normal test taking. Statistics can provide guidance for determining how strong is strong enough to take action, but ultimately the establishment of a probability threshold (i.e., the strength of the statistic) is a matter of policy that must be answered by the testing program administrator.

It is important with any statistical investigation to choose statistics that are well-suited and designed for the task at hand. For example, if the concern is that answer sheets are being modified, then erasure counts should be analyzed. Having analyzed over one hundred data sets for a wide variety of clients including state Departments of Education, admissions tests, certification programs, and licensure exams, I can unequivocally state that answer copying is the predominant means of cheating on tests. Therefore, it is especially relevant in this discussion concerning the reliability of statistical evidence to discuss answer copying and statistics that are designed to detect answer copying.

As you reflect upon the principles that I have outlined, I would ask you to consider the data in Table 1. The table contains differing probability values that a testing program administrator might be asked to evaluate. These are sampled answer-copying statistics (i.e., counts of identical answers) from a test having 240 items. With this many items on the test, the central limit theorem will generally apply so I have included a Z-Score in the table, as a point of reference.

Table 1: Sampling of test similarity statistics

Number of identical answers Expected number of identical answers Standard Deviation Z-Score Probability Index

168

81.3

7.2

12.0

30.3

171

102.3

7.4

9.3

19.9

130

76.4

7.1

7.5

12.4

154

107.7

7.4

6.3

9.5

128

87.9

7.3

5.5

7.3

108

74.3

7.1

4.7

5.5

107

75.1

7.1

4.5

5.0

120

89.4

7.3

4.2

4.6

115

86.1

7.3

4.0

4.2

128

103.9

7.4

3.3

3.1

At Caveon we deal with extremely small probability values, so we typically express those using “an index” where the probability is one in 10 to the power of the index (p=10-index). The most extreme case in Table 1 has a probability of one in 10 to the thirtieth power. These data are definitely not due to normal test taking.

Assuming that you accept the statistical evidence as being reliable, the decision needed by you, the testing program administrator, is how low in Table 1 should you go? Where do you set the cut point? These data illustrate if you set the cut point too low, you might accuse some individuals of answer copying without having strong evidence. If you set the cut point too high, you might allow several individuals who have cheated to escape discipline.

I will elaborate more on this topic, next time. Until then, may your tests remain secure.



‘Sabermetrics,’ baseball and steroids


Tuesday, January 8th, 2008

Prognostications are that Mark McGwire will not be inducted into Baseball’s Hall of Fame this year again, because of admitted steroid use. Here is the URL to the article:

http://www.nationalpost.com/sports/story.html?id=221516

In 2005, McGwire ducked the direct question whether he had used steroids or performance-enhancing drugs (PEDs). Many statisticians think that steroids do not improve performance, because “most baseball skills depend primarily on reaction times and judgments, factors unaffected (or even degraded) by these drugs.” Those who study the numbers, “sabermetricians,” (coined from SABR – Society for American Baseball Research) “think the writers should set aside their biases and moral indignation and look at the facts: there’s simply no evidence steroids or other PEDs actually improve performance in baseball.”

One of the quotes in the article states, “While Bonds’ home run output rose significantly in the years after he supposedly started taking drugs, his profile is strikingly similar to Babe Ruth’s high performance level almost right until the [end] of his storied career, they say.” The actual data do not support this statement as you can see in Figure 1, which compares Barry Bonds offensive performance against three of the other great hitters of the game: Babe Ruth, Ted Williams and Ty Cobb. I used http://www.baseball-reference.com/ as the source for my statistics.

Figure 1: Offensive performance comparison

Comparison of hitters

The OPS+ statistic is a normalized statistic that is adjusted for opponents’ defensive strengths and ball park friendliness to hitters. A value of 100 is average performance. The above statistic shows that Barry Bonds performance was below that of the compared hitters for the first 15 years of his career and then suddenly and dramatically his performance soared for the remaining years of his career surpassing all prior years, when the offensive performance of the other hitters was definitely declining. Admittedly, this is arm-chair forensics, but the data suggest that steroid use did improve Barry Bonds’ performance.

Currently, Roger Clemens has emphatically denied that he took steroids. His trainer, McNamee is reported in the Mitchell report as stating that he injected Clemens with steroids from 1998 to 2001. Clemens is scheduled to testify before Congress and there are allegations of defamation of character being “batted” around.

http://www.bloomberg.com/apps/news?pid=20601079&sid=a0z.L9DGg68A&refer=home

Figure 2 compares Roger Clemens ERA (earned runs allowed) performance against three other great pitchers of their time.

Figure 2: ERA comparison

Comparison of pitchers

The ERA+ statistic is a normalized earned-runs-allowed statistic which has been adjusted for opponents’ strengths and other factors. A value of 100 is average. Clemens’ first year of baseball is 1984 and the four year period of 1998 to 2001 corresponds to his 15th through 18th years of play. The data show that during this time, Clemens’ performance was average. However, these data are unusual because some of Roger Clemens’ best years came after he turned forty, an age when nearly all players have retired from baseball and several years after the alleged steroid use.

While I did not expect to arrive at a definitive answer concerning these two players, I found it intriguing to apply forensic thinking to the current allegations of cheating and doping that are being circulated.

 



Anatomy of the meltdown of a forensic procedure


Friday, December 7th, 2007

The CBS News program “60 Minutes” and the Washington Post aired an investigative report on November 16 criticizing the FBI for failing to notify relevant jurisdictions that hundreds of inmates have been jailed using a flawed forensic methodology. Despite discontinuing the use of “bullet lead” analysis in 2005 because of validity concerns, the FBI had taken no action to inform the courts that some defendants were potentially innocent and wrongfully imprisoned.

http://www.cbsnews.com/stories/2007/11/16/60minutes/main3512453.shtml

Bullet lead analysis was first used in the investigation of the assassination of JFK, and was routinely used in the 1980’s when bullets were so misshapen that ballistic evidence was unobtainable. The essential idea is that trace elements in lead vary naturally and that bullets could be “matched” as coming from the same source (i.e., the same box of bullets) by comparing the compositions of these trace elements. In the 2005 press release, the FBI stated, “One factor significantly influenced the Laboratory’s decision to no longer conduct the examination of bullet lead: neither scientists nor bullet manufacturers are able to definitively attest to the significance of an association made between bullets in the course of a bullet lead examination.”

http://www.fbi.gov/pressrel/pressrel05/bullet_lead_analysis.htm

We naturally ask, “How is it possible that a procedure could be trusted for 40 years, be invoked in 2,500 investigations, be used as testimony in about 500 of those cases, and then be discredited?” The FBI commissioned an independent review of the procedure in 2002 by the National Research Council. Their report is very fascinating to read, is very comprehensive, and was completed in 2004. A copy may be purchased at the following URL: http://www.nap.edu/catalog.php?record_id=10924. The findings of this report convinced the FBI to discontinue the bullet lead analysis.

After browsing through this report and reading the findings and recommendations, it is clear that the FBI procedure devised in the 1960’s could not withstand public scrutiny. From my perspective, the most troubling aspect of the analysis was that it was (and is) unknown how many compositionally similar bullets were produced and where they were distributed. This means that a probability statement concerning the likelihood of a false positive (i.e., saying the bullets came from the same box when they didn’t) was impossible. Without such a statement the forensic examiner cannot state with any reliability or objectivity that the bullet found at the crime scene came from the same box as bullets found in the possession of the suspect.

The NRC also indicated that the method of computing the statistical match should be revised. From my perspective this is because the FBI’s computational procedure was not based on a statistic. It was computed using statistical ideas, but not supported with statistical distribution theory. This procedure falls into the realm of “ad-hoc analytics.” It seemed good at the time. There wasn’t a better idea. But, there was no way to determine error rates and probabilities associated with the procedure. I have seen a lot of ad-hoc statistical procedures in my day and they nearly always fail eventually because they are based on some statistical idea but they have no statistical theory that supports them. In the long run, the queen of statistics (i.e., natural variability) overwhelms all procedures that do not estimate probability models from empirical data.

I have a good friend who quoted the maxim, “Models before algorithms” often. By this he meant that you should analyze the processes that generate the data and the variability associated with the data before you build detection methodologies. I have tried to follow this rule assiduously in devising detection methodologies for Caveon Data Forensics. Without the guidance of reasonable probability models, statistical interpretations of the data are subjective and indefensible.



Benefits-payment cheater caught using statistics


Tuesday, November 20th, 2007

The other day a woman in the UK was caught in a lie where she fabricated the existence of seven children to receive government benefits. She claimed to have given birth to quadruplets in 2005, to twins (who were delivered one week apart) in the same year, and then to a seventh child in 2007. None of these children existed.

http://www.dailymail.co.uk/pages/live/articles/news/news.html?in_article_id=494261&in_page_id=1770

The article starts: “Any mother who has given birth to quadruplets needs all the help she can get. So benefits staff were happy to provide support for Victoria Young in raising babies Kier, Kie, Kyla and Conrad. There was just one problem – none of them existed. …”

The benefits staff got suspicious on the seventh child and investigated the crime. By that time, Victoria Young “had swindled more than £40,000 in benefits payments with her bogus brood of seven babies in the space of 18 months.” (direct quote)

It’s natural to ask how data forensics techniques could be applied to this situation. We start with models that describe the population. To test the above claims we need to know about multiple birth probabilities, fertility rates, and birth spacing statistics. I found the needed statistics at a government website: http://www.statistics.gov.uk/downloads/theme_population/FM1_32/FM1no32.pdf

In 2003, only one set of quadruplets survived birth, making the probability of live quadruplets to be approximately 1 in 600,000 (Table 6.4 from the government report, see Multiple Births in Wikipedia also: http://en.wikipedia.org/wiki/Multiple_Births). From the table of statistics, the probability of twins is about 9,001 in 615,787, of triplets is about 127 in 615,787, and of quadruplets is about 3 in 615,787. If we use these values and assume that birth multiplicity is independent of each occurrence of maternity, then we can test Victoria Young’s claims with the conditional probabilities in Table 1 (computed using standard convolution equations).

Table 1: Conditional Probabilities of number of maternities given family size

 

Number of Children

Number

of

Maternities

1

2

3

4

5

6

7

1

1.00000

0.014836942

0.000209

4.95E-06

*

*

*

2

 

0.985163058

0.029234

0.000629

1.59E-05

1.88E-07

2.03976E-09

3

 

 

0.970557

0.043201

0.001251

3.57E-05

6.89055E-07

4

 

 

 

0.956165

0.056747

0.002064

6.70445E-05

5

 

 

 

 

0.941987

0.069882

0.003059676

6

 

 

 

 

 

0.928019

0.082614512

7

 

 

 

 

 

 

0.914258077

The conditional probabilities are read down the columns. (Asterisks are used to indicate values that could not be estimated from the government statistics.) For example, the probability of three maternities given seven children is in row 3 and column 7 and is equal to 6.89055E-07. (This number is in scientific notation and indicates the value of 0.00000068905507, or one in 1,450,000.)

We see that Victoria’s initial claim of quadruplets was very extreme (even though the data show that quadruplets are delivered in the UK) with a probability of one in 200,000 (this is a very extreme number and the sort of value that we typically find with extreme occurrences in Caveon Data Forensics). Her claim of six children with two maternities is even more extreme, with a probability of one in 500,000. And her final claim of seven children in three or fewer maternities has a probability of one in 1.4 million.

The claimed birth spacing is very unusual also. Victoria claimed the twins were born eight months after the quadruplets in September 2005. Birth spacing statistics from the UK website only provide a median statistic of 37 months between the first and second maternity and 42 months between the second and third maternity (Table 11.3 from the UK government report). We don’t have a lot of statistical information but for the purposes of this exposition we assume the birth spacing data follow an exponential distribution (waiting time distribution; this assumption should be tested in practice). The median will be a good estimator for the mean. Using this estimate we find that the probability of having a second maternity within 8 months or less is about one chance in one trillion. We also find that the probability of having two maternities within 18 months or less (we need the distribution of the sum, so we add the medians together) is 1 in 1025 (one trillion squared).

We have found that it is always useful to combine the probability evidence together. After all, Victoria’s motive was to acquire a large family as quickly as possible so as to maximize benefits payments. Using techniques developed at Caveon, we evaluate her final claim of seven children with three or fewer maternities in 18 months or less. The estimated probability is one chance in 1031 (one in ten billion cubed). Yes, the benefits people were justified in being suspicious. If their systems had implemented these types of probability analysis for fraud detection, they may have been able to save the UK some embarrassment and expense in catching a cheater more quickly.

In data forensics work we proceed just as I have illustrated above. We create population models. We assume the data conform to the models (i.e., there is no cheating). We test the anomalous data against the model and eventually compute probabilities. It is nearly always the case that the data do not conform precisely to the model, but the models provide sufficient guidance that objective statements concerning the improbability of the extreme data may be made.



Use of Statistics for Detecting Cheating on Tests


Friday, November 16th, 2007

Occasionally I search for the latest thinking about how to prevent and detect cheating on tests. I saw this presentation from the Annual Conference (2007) of the Arizona State BON (Board of Nursing) and Statewide Nurse Educators (URL is below). In my opinion this presentation is very good and provides a lot of perspective for dealing with test security issues.

http://www.azbn.gov/documents/news/Statewide%20Educators%20Academic%20Dishonesty.10.05.07.pdf

Using test result data to detect and prevent cheating was not discussed in this presentation. I think there are good reasons for the omission: (1) cheating detection software has mostly been created for large testing programs and is not readily accessible to anyone who administers tests, and (2) many people are not comfortable with using statistics to make inferences about cheating. My purpose in writing is to discuss this second issue.

Being a statistician, I admit to having specific ideas about data and test scores. Some of these ideas are not generally accepted and may not be popular. However, the idea of using statistics to detect problems with the test administration seems natural and reasonable. Anyone who would accept test scores as being valid and reliable but would not use test result data to make inferences about the quality of the test administration holds an inconsistent position. I say this because the very act of administering a test and obtaining a test score is a statistical procedure with the intent of making a statistical inference. When we give tests we are not interested in the test taker’s performance on the actual questions that were presented. Instead, we are interested in inferring or estimating the test taker’s knowledge or competence in the tested domain. Making such an inference implicitly acknowledges that the test score is a statistical measure and subject to uncertainty. If other questions had been presented, there is no doubt that the test scores would have been different.

If you do not agree with the above perspective you will not agree with the corollary that I now present. Despite disagreements, I now stipulate that the best and most reliable record of the testing session is the actual set of recorded responses (and any other measurements that can be obtained such as erasures or response times). These data are more reliable than proctoring observations, or video recordings, or any other externally derived measure of the testing session. If you can trust the recorded responses to calculate a test score and make decisions about a test taker’s future, you should be equally comfortable using the recorded responses to make inferences about the quality of the testing session and whether testing irregularities may have occurred.

Because many statistical techniques may appear to be arcane or even “mystical,” the statistician must be very careful in selecting and using techniques that are based in solid statistical principles. Statistics will be most easily defended if they are derived from a probability model that describes the behavior being observed and if they provide objective probability statements concerning the extremeness of any observation. These criteria are rather stringent and lead to the natural exclusion of many techniques that have been investigated by researchers. For example, person-fit statistics are ideal for describing whether a test taker’s response pattern is consistent with the normal pattern of test taking (In Caveon we usually use the word “aberrant” to describe inconsistent response patterns). However, even though there is a large literature on person-fit statistics no researcher has yet published how to make objective probability statements about aberrant test taking. Without having statistically sound inferential models, the practitioner must devise ad-hoc methods that are empirically derived from the analysis of the data. There are two problems with this approach: (1) the judgment of what constitutes an extreme observation is subjective and may vary depending upon the situation, and (2) the modeling technique, itself, is not easily defended or replicated. I think these problems are fundamental reasons why test administrators have been uncomfortable with using statistics to make inferences about cheating.

At Caveon, we have worked very hard to create algorithms that are capable of computing probabilities for the statistics that we use in data forensics work. Part of that work involves understanding the probability models and assumptions that underlie the models. For example, “answer-copying” statistics that are based on the idea of similarity and excess similarity should be derived from probability models. One such example is the class of answer-copying statistics presented by van der Linden and Sotaridona (2006): Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31, 283-304. In this paper the authors make the assumption that tests are taken independently in deriving the probability model for the number of identical responses (being the statistic of interest). We have currently implemented person-fit statistics (for detecting aberrance), similarity statistics (for detecting collusion, test coaching, answer copying and proxy test taking), erasure statistics (for detecting test tampering), gain-score statistics (for detecting unusual learning patterns), response latency statistics (for detecting content exposure), and we continue to explore other statistics. I will discuss each of these as time permits, later.

 



HOME :: SERVICES :: RESOURCES :: COMPANY :: PRESS :: LINKS