Randomly Parallel TestS
AN ANNOTATED BIBLIOGRAPHY

Written by David Foster, Ph.D.
INTRODUCTION
The purpose of this annotated bibliography is to introduce the reader to the history, principles, and dominant themes presented in the literature around domain-referenced (DR) testing and randomly parallel (RP) tests, particularly in their formative years in the mid-20th century. Many of the ideas presented in these early papers on DR and RP testing, while theoretically solid and promising, were not practically feasible at the time they were presented due to lack of availability and/or limitations of the technology necessary to implement them. At some point around the late-1970’s, these ideas were largely forgotten, until recently.
Research on DR and RP tests, and relevant psychometric models for these tests and their scores has resumed today because people facing the pressures of modern testing (e.g., budget pressures; frequent criticisms of standardized test validity, bias, and fairness; ubiquitous fraud and cheating, etc.) have begun to realize that the core of our problems lies with our traditional test design, an outdated design with which the vast majority of the testing industry has been entrenched for over 100 years. Whereas this bibliography is not comprehensive, it should acquaint (or reacquaint) the reader with DR and RP testing and their applications and benefits, as well as the technology that is now available to bring this test design out of the theoretical realm of the mid-century and into reality today.
Background
It is generally agreed that criterion-referenced (CR) testing was introduced by Robert Glaser in 1963. In its purest form, as defined by Glaser, CR testing should “provide information as to the degree of competence attained by a particular student which is independent of reference to the performance of others” (p. 520). In other words, the score obtained through CR testing should represent a measure of the student’s mastery of the content area, and moreover, CR testing differentiates itself from norm-referenced (NR) testing in that the interpretation of the score obtained by the individual should be independent of the interpretation of scores obtained by others.
Meanwhile, Frederic Lord had been exploring ways to calculate test reliability, including the underlying assumptions about error and “parallel” test forms, as early as 1953. Lord formally introduced the concept of “randomly parallel tests” in 1955, defining RP tests simply in this way: “Suppose that a large number of forms from the same test are administered to the same group of examinees, each form consisting of a random sample of items drawn from a common population of items… Test forms constructed [in this way] will be called randomly parallel forms or randomly parallel tests” (1955a, p. 1).
In 1968, H.G. Osburn introduced the idea of “universe-defined” tests, where items are randomly sampled from a well-defined universe of items, such that the score produced by this method “provides an unbiased estimate of [the test taker’s] score on some explicitly defined universe of item content” (p. 96). Eventually, researchers began replacing the term “universe-defined” with “domain-referenced.”
There is abundant agreement on the definition of DR testing provided in the literature. Domain-referenced tests are generally agreed to be tests where items are randomly sampled from well-defined domains of items to produce a test (or form) for each examinee, which is considered “randomly parallel,” and the score obtained from a DR test form represents an unbiased estimate of the proportion of the domain of items that would be answered correctly.
Moreover, in almost every paper, the condition of a large and well-defined domain, or universe, of items is specified. It seems as if this cannot be emphasized enough, that the domain, and domain strata, be clearly defined and large enough to cover the target domain. In Brennan and Kane (1977), the authors reiterate that the universe of items should be well-defined “so that we may reasonably consider drawing random samples from it” (p. 278). This is critical to ensure that the scores obtained from the random sampling of items can be generalized to the entire domain.
The relationship between DR testing and RP tests is almost inextricable. The fundamental principle of the DR test design is that one creates an extensive, well-defined content domain from which items can be randomly sampled for each person’s test, whereas RP testing could be said to represent the procedure for randomly sampling those items to produce “parallel,” or equivalent, sets of items (or forms) to be administered to the test takers.
Researchers at the time attempted to address questions around the appropriate statistical approach for estimating reliability (or dependability, as Brennan and Kane prefer to call it) and validity for DR tests. Lord (1955a) proposes using the KR-21 formula for calculating test reliability, whereas Kane and Brennan (1980) follow the footsteps of Cronbach in their use of Generalizability Theory as the basis of their dependability calculations. Several processes are developed and presented, indicating there are appropriate options for developing the statistical models for these tests.
There also is debate about whether classical empirical item analysis is necessary for items randomly sampled from a large, well-defined domain. Some authors state definitively that item analysis is not necessary. Millman (1974) states emphatically, “No!” item difficulty and discrimination should not factor into the selection of items for RP tests, because the use of item statistics “destroys the random selection process” (p. 339) and diminishes the interpretability of the test score with respect to the domain. However, Millman goes on to say that item statistics may be used for removal of faulty items or to inform revisions to the domain. Continued discussion and research on this topic is expected. It is our recommendation that it is best not to expect to have available statistics on individual items. This is because today’s technology can produce pool sizes for domains that are effectively infinite in size, precluding the need to collect data on each item to calculate item statistics.
It is not surprising that concerns about the security of test items and cheating are not frequently discussed in the literature from this time, because test fraud, though present, was simpler and less advanced than it is today. Interestingly, while cheating and theft have been enabled by advances in technology over the last 60 years, testing professionals have failed to take advantage of the very same technology to prevent, deter, and/or detect test fraud. We are still designing and administering tests in much the same way we did 100 years ago, yet the world around us has completely changed. Some authors do allude to security concerns, especially with respect to item overuse and disclosure. The ability of DR tests to solve certain security problems is discussed in Hively (1968). Furthermore, in Hively (1974), he makes the damning statement about NR testing: “Secrecy is the hallmark of NRT” (p. 13). This leads the reader to a natural follow-up question: What if I didn’t have to keep my items secret anymore?
Another theme frequently implied or openly discussed in the papers is the limitations of technology at the time and the practical feasibility of creating enough high-quality items and randomly sampling those items from huge domains. Indeed, several of the papers that propose specific technology, such as computer programs, might seem outdated today, since today we have the technology to auto-generate many high-quality items, randomly sample them, and produce a unique test form for every test taker. In some cases, it may just be interesting to read these papers to understand the challenges facing researchers and professionals at the time. This can help bring perspective to the reader and illustrate why DR testing and RP tests were not readily implemented at the time.
In other cases, these papers may help to illuminate a modern analog that researchers and professionals face today: How did they respond to the availability of computers and advancing technology at the time to improve test design and administration? Likewise, how might we use lessons learned from these past experiences to approach the opportunities provided by AI today?
It is apparent from all of these papers that there are significant benefits to DR testing and RP tests, making this design well worth investigation and consideration. As pointed out in many of these papers, the benefits include:
- Using DR testing and RP tests, with technology, you can create the large numbers of items needed, in some cases an effectively infinite pool of items, where each test taker receives a unique form;
- This allows for the use of sample items or practice tests that are mirror copies of the high-stakes exam, where test takers can take and retake the test as many times as necessary without encountering the same items. This also means that item disclosure ceases to be a concern;
- The scores obtained provide unbiased estimates of the test takers’ capability on the entire domain of items; and
- Given that the scores are generalizable to the domain, and a test taker can take the test many times, the DR/RP test design can be used to track student progress over time and inform/improve future instruction, making this design particularly beneficial in educational assessment.
At Caveon we frequently hear our clients complain that their content is being disclosed at a rate faster than it can be created. Our clients pour their valuable resources into carefully concealing their content and items, and when their items are eventually disclosed, they pour more resources into containing and investigating the breach. Testing programs are forced to retire items and test forms early due to disclosure and to create more new, costly items, which themselves are destined to be disclosed presently after publication.
The DR/RP test design has the potential to solve many of our biggest problems today, including most forms of test fraud and cheating. We have the technology to fully implement this, and most of the roadblocks that were present in the past have been cleared. Most forms of test fraud rely entirely on the basis that the items on traditionally constructed tests today are fixed and repeatedly exposed, which brings us back to the question posed previously: What if your items didn’t have to be secret anymore?
Annotated Bibliography
Two reviewers contributed to this annotated bibliography: Dr. David F. Foster, Chairman and CEO of Caveon; and Jennifer L. Palmer, the Data Forensics division Operations Project Manager. For each of the citations, the reviewer’s initials (DFF or JLP) are provided following the citation. Some opinions about the content in the papers are provided by the reviewers in the annotations. We expect readers to discover more interesting information, hopefully supporting the use of this long-overlooked test design.
The citations are given in chronological order by year and then alphabetical order within year.
1955
Lord, F. M. (1955a). Sampling fluctuations resulting from the sampling of test items. Psychometrika, 20(1), 1-22. DFF
In this paper Lord first introduces the concept of RP tests (p. 1) and the terms, “randomly parallel forms or randomly parallel tests.” He proposes that “…a large number of forms of the same test are administered to the same group of examines, each form consisting of a random sample of items drawn from a common population of items” (p. 1). He compares the process with the sampling that occurs when a random sample of examinees is selected for a research study.
He proposes a method for calculating the standard errors for individual test scores, and presents a formula for test reliability, the KR-21. He also describes the shape of the distribution of standard errors as binomial.
DFF Note: This paper is actually a revised version of an earlier, December 1953, paper by Lord, provided as a technical report to the Office of Naval Research based on a contract the ONR had with Educational Testing Service (ETS). The paper was titled, The Standard Errors of Various Test Statistics When the Test Items are Sampled. Therefore, December 1953 is the actual date for the introduction of RP tests.
DFF Note: In a few papers, Lord used different terms for tests that were “constructed” to be “parallel.” He called these rationally parallel, statistically parallel, nominally parallel, and strictly parallel. These terms seem to be interchangeable.
Lord, F. M. (1955b). Estimating test reliability. Educational and Psychological Measurement, 15(4), 325-336. DFF
Lord discusses different issues for persons attempting to estimate test reliability, including the assumptions underlying different statistics, and the importance of the “exact definition of ‘parallel’” (p. 325). To elaborate, he proposes two new definitions of parallel test forms: RP tests and Stratified RP tests. Lord describes RP tests as follows: “If the items in two or more tests may be considered to have been drawn at random from the same large pool of items, the tests are called randomly parallel tests” (p. 328). Meanwhile, Lord referred to the Stratified RP tests as “matched-forms” tests in this paper, defined as stratifying the item pool on one or more characteristics of the items in advance of random sampling. He suggests that these two new definitions of parallelism seem “to have at least as good justification as those usually used, and perhaps better” (p. 325).
DFF Note: We need to remember that the overall testing context for Lord’s writing was, for the most part, traditional paper-and-pencil testing, and it had been for 40 years. Computers were talked about in the 1950’s, but mostly as a future technology. It would have been impractical to build RP tests at that time given the available technology and dominant paper-and-pencil testing approach. Without computers, randomization and having “large pools” of items were simply not feasible for most testing settings, or for testing on a large scale.
1959
Lord, F. M. (1959a). An approach to mental test theory. Psychometrika, 24(4), 283-302. DFF
Here, Lord describes several models, two of which describe the Stratified-RP test and the RP test as he has talked about in earlier articles. Each model is evaluated in terms of its assumptions, its definitions regarding true scores, the distributions of observed scores, and the distributions of measurement errors.
Lord, F. M. (1959b). Randomly parallel tests and Lyerly's basic assumption for the Kuder-Richardson formula (21). Psychometrika, 24(2), 175-177. DFF
This is a short note clarifying the assumptions behind the use of RP tests and the use of the KR-21 reliability statistic.
DFF Note: It is worth careful reading, as it deals with “item equivalence” and provides insight into how the number of items for an RP test affects the comparability of the scores from them: the more items in the RP test, the greater the ability to compare the scores.
Lord, F. M. (1959c). Statistical inferences about true scores. Psychometrika, 24(1), 1-17. DFF
Lord describes the benefits of applying the principles of random sampling of items from a large pool (creating RP tests by Type-2 sampling), to the example of matrix sampling, adding that examinees can also be considered as being randomly sampled from a population of examinees (Type-1 sampling).
Lord, F. M. (1959d). Tests of the same length do have the same standard error of measurement. Educational and Psychological Measurement, 19(2), 233-239. DFF
This is an impressive discussion about standard errors of measurement. Lord provides empirical data-driven evidence from a wide variety of exams analyzed at some point by ETS scientists. The conclusion was stated in the title, but which the paper made clear applied to single individuals who had been or would be administered RP tests, not “rationally equivalent” tests.
DFF Note: In several papers, Lord takes care to point out the mistaken conclusion that a test has a single standard error of measurement. He often points out that standard error of measurement is different for each test score.
1960
Webster, H. (1960). A generalization of Kuder-Richardson reliability formula 21. Educational and Psychological Measurement, 20(1), 131-138. DFF
Webster supports Lord’s notion of RP tests and provides some advice on using the KR-21 as a measure of reliability for them.
1963
Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16(2), 137-163. DFF
In comparing approaches to calculating reliability, Cronbach and his co-authors describe the advantages of assuming that items are randomly sampled. They state, “Randomness of sampling guarantees that in the population the means, variances, and intercorrelations of scores will be equal. One set of such randomly generated data is equivalent to another even though the tests individually are not equivalent” (p. 143).
DFF Note: The authors point out only two objections to the random-sampling model. The first is that universes, like domains, are usually vaguely defined. Second, that strict random sampling never occurs in practice.
1964
Lord, F. M. (1964). Nominally and rigorously parallel test forms. Psychometrika, 29(4), 335-345. DFF
This paper does not reference RP tests, but goes into detail on two alternatives, Nominally Parallel Tests and Rigorously Parallel Tests.
DFF Note: This paper is included in this bibliography to help place Lord’s recommendations of RP tests in a broader context, considering RP tests as a model for test construction and use is a choice that psychometricians have. That choice relies on comparing practical, statistical, and theoretical advantages and disadvantages.
1965
Lord, F. M. (1965). Item sampling in test theory and in research design. ETS Research Bulletin Series, 1965(2), i-39. DFF
In this paper Lord uses the model term for RP tests, “the item sampling model,” where the number of items on a test form “are considered as a random sample from a population of items” (p. 1). He writes about the simplicity of the model, its minimal assumptions, and that it “yields many important results” (p. 1). How to derive those important results is the purpose of the paper. He acknowledges a common objection to the item sampling model as simply that sampling items from a population of items is not ordinarily done. Quoting from p. 4, Lord states, “In line with this reasoning [that the statistical properties of random samples are well known], in testing work it will sometimes be essential to actually select items at random. In certain situations, this will be the only way to secure a firm basis for the necessary statistical significance tests and statistical inferences.” The paper contains much more statistical reasoning, and many more practical insights into errors of measurement and true scores.
Rajaratnam, N., Cronbach, L. J., & Gleser, G. C. (1965). Generalizability of stratified-parallel tests. Psychometrika, Vol. 30(1), March 1965. DFF
In this paper the authors contrast the equivalence of test forms supported by classical theory with the item sampling models (with and without stratified random sampling). A part of the introductory paragraph provides the rationale for their work: “An alternative model (DFF Note: the term “alternative,” referring to the rationally equivalent forms or equivalent-composites model from classical theory) which has received increasing attention in recent years regards a given measure as a random sample from a universe of measures whose homogeneity or equivalence is not specified a priori, and a composite test as a random sample of items from a universe of not-necessarily-equivalent items” (p. 39). The authors add a third model, describing the stratified random sampling model. Given the interest of Cronbach and his colleagues in Generalizability Theory, an important outcome of using RP tests or Stratified-RP tests is the enhanced ability to generalize from a test score to the content universe or content domain of interest.
DFF Note: These authors, as well as others in the middle of last century, dealing with the topic of obtaining actual random samples, generally acknowledge the difficulty, and even the impossibility at that time, of creating a practically useful test by randomly sampling items. While random sampling is a better model for testing in almost every sense, the practical use in operational testing is a significant barrier. Often proposed is an interim assumption and rationale that the items could be considered to have been randomly sampled from a universe or population of items (see p. 43). This paper and others use logic of this sort to circumvent the barrier. Of course, with the technology of the 21st century, pure RP tests are not difficult at all to create.
1968
Hively II, W., Patterson, H.L., & Page, S.H. (1968). A “universe-defined” system of arithmetic achievement tests. Journal of Educational Measurement, 5(4), 275-290. JLP
The authors frame this paper by illustrating how it represents the convergence of two approaches to achievement testing: B.F. Skinner’s “educational behaviorism” and Cronbach’s “Generalizability Theory.” They argue that if subject matter can be organized into well-defined behavioral classes to create tests, then Generalizability Theory can serve as the measurement model. They reference Osburn’s (1968) term, “universe-defined achievement testing.”
In their study, they developed universe-defined arithmetic tests for members of the U.S. Federal Job Corps Program, who receive education and work experience at Youth Conservation Centers. They achieve this by categorizing the subject matter (arithmetic) into well-defined “domains” and then creating “item forms” (as also described by Osburn, 1968), which are sets of rules for generating items representative of the diagnostic category. A set of item forms may represent a “universe,” such as “subtraction” or “addition.”
Individual tests for the Corpsmen were generated by randomly selecting one item from each item form and randomizing the order of their presentation on the test. The authors then examine the generalizability of the test scores, including components of variance in both test scores and item scores, using Generalizability Theory. The findings of this study were mixed; overall, they found the test to be highly reliable, but there were concerns about the generalizability of the scores to the domains.
The authors also highlight a security issue that was encountered by the Job Corps program when the tests were administered in the traditional way (presumably fixed), where Corpsmen were skipping the course instruction and just taking the test. When the candidates failed, they asked for feedback on their errors and then immediately retook the test. The frequent item exposure compromised the test, posing a risk that students would pass the test through pre-knowledge rather than mastery of the content. Since the randomly chosen item sets (or forms) from the universe-defined tests administered in the study satisfied the assumptions for parallel tests, the use of universe-defined tests was a way to circumvent this problem, allowing test takers to take the test multiple times without encountering the same items.
Lord, F. M., Novick, M. R., & Birnbaum, A. (1968). Statistical theories of mental test scores. DFF
Specifically, Chapter 11, titled Item Sampling in Test Theory and in Research Design.
Touted as one of the most important books in the history of psychometrics, this book was originally published in 1968 and republished in 2008. The chapter of interest that covers item sampling, or random sampling of items for tests, is Chapter 11, titled Item “Sampling in Test Theory and in Research Design.” The chapter begins,
“This chapter deals with the case where the test score xa of examinee a is the sum of the item scores yga, g= 1, 2, …, n, and where the n test items are considered as a random sample from a population of items. This item-sampling model makes no other assumptions about the nature of the test. In spite of this, the model yields many important results not obtainable from the classical model” (p. 234).
The chapter makes a strong case for the use of RP tests in a wide range of testing circumstances.
DFF Note: Lord and the other authors provide an interesting footnote on the first page of this chapter. It reads, “Reading of this chapter can be omitted without loss of continuity” (p. 234). To me, this suggestion highlights the unique nature of RP tests, which in 1968 didn’t mesh well with classical theories and were still not feasible as part of an operational testing program, whether for large-scale or small-scale testing. However, it is clear that the authors believe it deserved an important place in discussions of theory and practice. They may have expected that with the coming of computerization of testing, it would be useful to researchers and practitioners sooner rather than later.
Osburn, H. G. (1968). Item sampling for achievement testing. Educational and Psychological Measurement, 28(1), 95-104. DFF
Osburn proposes a way to build RP tests or Stratified RP tests using item forms (DFF: similar to AIG item models or Caveon’s SmartItems™) with the goal of obtaining test scores that generalize to a universe or domain. He points out the importance of describing well the universe or domain.
1969
Shoemaker, D. M., & Osburn, H. G. (1969). Computer-aided item sampling for achievement testing: a description of a computer program implementing the universe defined test concept. Educational and Psychological Measurement, 29(1), 165–172. JLP
This article introduces a computer program designed to implement the “universe-defined” test concept. A definition of a universe-defined test is provided, referencing Osburn, 1968, as a “test constructed and administered in such a way that an examinee’s score on the test provides an unbiased estimate of his score on some explicitly defined universe of item content” (p. 165). This concept involves two key elements: 1) a well-defined content population of items, and 2) the selection of items through random sampling, or stratified random sampling, from this universe of items.
Essentially, a universe-defined test is what we now refer to as a domain-referenced test. The paper details a computer program that utilizes “item forms,” defined as a “principle or procedure for generating a subclass of items having a definite syntactical structure” (p. 166). These item forms allow for the random, or stratified random, sampling of items from the content population, creating RP tests. In essence, item forms act as templates, each capable of generating multiple items based on predefined rules.
The authors argue that the DR testing approach, especially when each test taker receives a unique form (i.e., RP tests), offers numerous advantages. These include ensuring that the test accurately reflects the course content; enabling the creation of countless forms, which allows sample items to be provided to students before the test; facilitating the easy generation of make-up exams; and permitting individuals to retake the test multiple times without encountering the same items. The computer program utilizing item forms represents a significant advancement in making DR testing and randomly parallel forms practically feasible.
1971
Prosser, F., & Jensen, D. D. (1971). Computer generated repeatable tests. In Proceedings of the May 18-20, 1971, spring joint computer conference (pp. 295-301). DFF
These authors describe several problems of traditional testing in higher education. In that context they recommend using computers and Stratified-RP tests to create paper tests that are unique, can be administered more frequently, provide immediate feedback, and are repeatable. Repeatable, in this paper, refers to a process whereby students can take unique, just-printed test forms on a course topic as often as desired to meet instructional goals. The items are created in advance, about 6x to 10x the number of items needed for any particular student’s test form. The paper did not provide any research data for the recommended process actually used for a university course.
DFF Note: Creating unlimited and unique repeatable tests is viewed by these authors as well as Lord (1977) himself as a significant benefit of using RP tests or Stratified RP tests. The concept of test forms being unique and equivalent (parallel), which can be created easily by a computer, should be highly attractive in any instructional setting, as it removes the typical restriction of synchronous use for traditional tests. Unlimited, repeatable tests would be valuable in all areas of testing.
1972
Cronbach, L. J., Gleser, G. C., Nanda, H., and Rajaratnam, N. (1972). The dependability of behavioral measurements. Theory of generalizability for scores and profiles, 1-33. DFF
Cronbach is creating and justifying Generalizability Theory with this book. In Chapter 11, titled Contributions and Controversy – A Summing Up, RP tests are mentioned as coming from Lord’s work, and that they share similar assumptions of random sampling to his (Cronbach’s) proposals (see p. 357). This reference is included also because it details some criticisms of the process of randomly selecting items for a test from a universe or population of items, including some concerns from R. L. Thorndike. Those arguments are presented and commented on by Cronbach, et al. These criticisms and comments are found mainly from pages 376 to 383.
DFF Note: I often wonder why RP tests were never viable, at least until now, as a part of operational testing programs. No doubt the unavailability of computer technology played a role, but it also may be a factor that such criticisms from proponents of traditional testing approaches inhibited testing professionals and other practitioners from changing their test designs away from something that had been “traditional” for decades.
1973
Millman, J. (1973). Passing scores and test lengths for domain-referenced measures. Review of Educational Research, 43(2), 205-216. DFF
Millman’s paper provides practical advice for anyone desiring to create tests where resulting scores indicate the proportion mastered of a domain of content (represented by many items). Sets of randomly sampled items from the pool of items can be considered RP tests as Lord has described them. Millman discusses how mastery scores can be derived and how test lengths can be determined using the RP testing model. In this paper Millman also describes “sequential testing” as computerized testing where the number of items in the test is continuously monitored and the scores compared with a passing standard.
DFF Note: Millman is creating a bridge between RP tests, bringing along their statistical and theoretical advantages, to the growing (at that time) field of DR testing. His paper also presents an early benefit of using computers for test administration.
1974
Emerson, P. L. (1974). Experience with computer generation and scoring of tests for a large class. Educational and Psychological Measurement, 34(3), 703-709. DFF
This paper describes an RP test where the item pool consisted of about 500 items (about 50 per chapter of a textbook) supplied by a textbook publisher. A computer system generated chapter tests by randomly selecting 20 items from the chapter-based strata of the pool. In some circumstances, students were given the opportunity to re-test. Unique final exam forms of 50 items were created using a Stratified RP test procedure (five items randomly selected from each chapter). Even in 1974, the author concluded that the cost associated with this process was no more than it would have been for a conventional testing of course learning, and he concluded with recommending the process to other instructors.
Hively, W. (Ed.). (1974). Domain-Referenced Testing. Educational Technology Publications, Englewood Cliffs, NJ, 07632. JLP
This book is an excellent resource for anyone looking to familiarize themselves with the principles and applications of DR testing. It aims to provide a comprehensive overview of how DR testing can enhance educational assessments and outcomes. The book is divided into three parts:
In Part One, “Basic Ideas,” Hively introduces the book with strong language, highlighting the shortcomings and pitfalls of educational testing, particularly regarding its ability to inform about the test taker’s ability level and progress in specific areas of knowledge and skill. Hively states, “…we know that standardized tests are essentially worthless for evaluating instruction or helping individuals learn more efficiently,” and “We are constantly confused about the relationship of behavioral goals to test construction and score interpretation” (p. 5). Hively then provides basic definitions of items, domains, and item forms and offers a panoramic comparison between DR testing and NR testing, concluding with the powerful statement: “Secrecy is the hallmark of NRT” (p. 13), alluding to the security issues associated with NR testing (JLP Note: or more broadly, the security issues associated with any test designed where the items are fixed and therefore must be kept secret prior to administration of the test).
Hively argues that DR testing, which involves creating “extensive” item pools that “represent, in miniature, the basic characteristics of some important part of the original universe of knowledge” (p. 8), produces scores that can be generalized to the domains. This allows for the evaluation of test takers’ ability levels, tracking student progress over time, and informing instruction, capabilities that NR testing lacks.
In the second article of Part One, Eva Baker elaborates on using DR testing to provide both student assessment data and information to help teachers improve instruction. Baker outlines principles for defining domains and critical elements for domain specification.
Jason Millman concludes Part One with an article on designing efficient item sampling plans for DR tests. Millman emphasizes that all item sampling within a domain or domain stratum must be random and that knowing item difficulty is not necessary to estimate domain scores.
Part Two, “Applications and Innovations,” features several authors discussing DR testing from various perspectives and describing different scenarios where it has been applied, including teacher, program, product evaluation, as well as performance contracting and behavioral modification tracking.
In Part Three, “Perspectives,” Hively summarizes the preceding chapters and offers final thoughts on DR testing, including misconceptions, problems, and practical advice. The book concludes with a bibliography for further research.
Millman, J. (1974). Criterion-referenced measurement. In W. J. Popham (Ed.), Evaluation in Education: Current Applications (pp. 311-397). Berkeley, CA: McCutchan Publishing Corporation. JLP
Millman provides an in-depth overview of DR testing in a section (pp. 327-362) of the chapter titled “Criterion-Referenced Measurement” in this book. The author defines DR tests as “tests intended to describe the current status of an examinee with respect to a well-explicated set of performance tasks called a domain. A random, or stratified random, sample of items from a domain will be called a domain-referenced test (DRT)” (p. 327).
Millman discusses several aspects of DR tests, including:
- Defining the item population, or domains
- Selecting items
- Defining a cut score
- Establishing a domain score
- Determining the test length
- Evaluating the DR test.
The author provides a comprehensive treatment of the development of domains and items (e.g., item forms and amplified objectives). Regarding the definition of good objectives or domains, Millman states, “the goal is to build meaning into the test so that when it is reported that a student answered 95 percent of the items correctly, such a score can be interpreted in terms of a set of tasks on which the student has demonstrated high proficiency. Such an interpretation is not possible unless the population of items is clearly identified” (p. 328).
Later in the chapter, Millman discusses procedures for item selection, specifying that the test should be constructed by choosing a random, or stratified random, sample of items, and saying, “This task has been accomplished by a computer so that different samples of items, that is, tests, are administered to different students” (p. 339). The author emphasizes that item difficulty and discrimination should not factor into item selection, as this “destroys the random selection process” (p. 339), but that item statistics may be appropriate for identifying faulty items and/or informing revisions to the domain.
Millman also explores proposed methods for determining cut scores, estimating the domain score, determining test length, and evaluating DR tests in terms of reliability and validity, stating that traditional methods for assessing these are probably not appropriate for DR tests.
1976
Huynh, H. (1976). On the reliability of decisions in domain-referenced testing. Journal of Educational Measurement, 13(4), 253–264. JLP
This paper introduces a method for calculating the kappa reliability index for a single test administration. The authors suggest that this method is suitable when data from multiple administrations is unavailable and a binary decision is being made based on the scores (e.g., mastery or non-mastery). As the number of test items increases, calculating the kappa coefficient using this method becomes more labor-intensive. Therefore, the authors also provide an estimation method for kappa when large item numbers make the standard calculation too time-consuming. (JLP Note: Of course, this would not be necessary today, as computers can perform these calculations.)
The author discusses factors that impact kappa, including test score heterogeneity, the number of items on the test, and the cutoff score. Given the various ways these factors affect kappa, both positively and negatively, the author concludes that there is “no unique kappa for a given test” (p. 263) and recommends reporting the conditions under which kappa is computed along with the value.
JLP Note: At the time this paper was published, there was debate among authors about the similarities and differences between DR and CR testing. In this paper, the author implies that DR testing is a subcategory of CR testing, describing DR testing, referencing Millman (1974), as follows: “a probability sampling procedure is used to select items from a well-defined universe” (p. 253), where each test form represents an “independent sample” of items “drawn from a specified universe” (pp. 254-255).
1977
Brennan, R. L., & Kane, M. T. (1977). An index of dependability for mastery tests. Journal of Educational Measurement, 14(3), 277–289. JLP
The authors develop and examine an index of dependability for mastery tests, which they define as a “domain-referenced test with a single cutting score” (p. 277). They assume that a mastery test consists of a random sample of items from an infinite universe.
Two assumptions are made about the universe of items: 1) it is well-defined, and 2) the sample drawn from it is small relative to its size. This implies that the universe of items needs to be quite large, if not infinite. The authors reiterate that the universe of items should be well-defined “so that we may reasonably consider drawing random samples from it” (p. 278).
JLP Note: In this paper, the authors assume that a mastery test is “obtained by drawing a random sample of items from some universe or domain of items” (p. 278); however, while the items are pulled randomly from the domain, the randomly chosen items are compiled onto a single test form, and the same test form (i.e., the same set of items) is administered to every test taker. This is not consistent with Lord’s vision of a true RP test (Lord, 1955a). Despite that this paper does not deal with true RP tests, this citation is included in this bibliography because of the compelling and clear language describing DR tests and the importance of large, well-defined domains.
Hahn, Christine T. (1977). Domain referenced testing. An Annotated ERIC Bibliography. ERIC Clearinghouse on Tests, Measurement, and Evaluation, Princeton, NJ. ERIC Number: ED152803. JLP
This annotated bibliography explores references on DR testing sourced from the ERIC database and four other databases available at the time. It was compiled to provide teachers, researchers, and evaluators with access to valuable information about DR testing. The author highlights the usefulness of DR testing, particularly in identifying learners’ strengths and weaknesses.
Lord, F. M. (1977). Some item analysis and test theory for a system of computer-assisted test construction for individualized instruction. Applied Psychological Measurement, 1(3), 447-455. DFF
Following up on the advantages of unlimited repeatable tests from RP tests, and with computers becoming more viable in the test development and test administration efforts, Lord provides direction as to the analysis of item statistics, test reliability, test scores and standard errors. In one data set, he showed that standard errors were lower for RP tests, explaining that “the pool of items is much better represented when each examinee takes a different set of 41 items than when one set of 41 items is used for everyone” (p. 454). Lord also estimates that the number of items in the pool should be 10x to 40x the number of items chosen for any test form.
In this paper Lord also alludes to the security benefits of administering different items to different test takers: “In general, there must be little overlap between the items administered to any two students or to the same student on two different testings. This is necessary in order to prevent a student from obtaining a high score simply by memorizing the scoring key for a test taken by a friend” (p. 447).
DFF: As computers become more common and useful, Lord as well as others are describing more situations where using RP tests will be practical and beneficial. The reader should keep in mind that the perspective of these theorists and researchers is firmly based in test administration that is paper-based and where item pools are formed in advance. It isn’t until a few years later that computers are viewed as a way to administer exams, enabling concepts such as computerized adaptive testing, linear-on-the-fly (LOFT) tests, items that can be created when needed, and even SmartItems that can be used to render “items” on the fly during an exam.
Millman, J. (1977). Creating Domain-Referenced Tests by Computer. Paper presented at the Annual Meeting of the American Educational Research Association, New York, April 4-8, 1977. DFF
This paper seems to be an earlier version of Millman & Outlaw (1978). However, it contains a description of how Millman created 132 “item programs” that serviced seven RP mastery tests for a course on statistics. The paper also reveals tips on how to evaluate the quality of the items produced on-demand by the item program.
1978
Millman, J., & Outlaw, W. S. (1978). Testing by computer. AEDS Journal, 11(3), 57-72. DFF
The authors describe using computers to create RP tests. As a unique feature, the study stored “item programs” rather than actual items in the system. The items themselves were generated for the students’ unique tests just prior to the tests being printed and administered in paper-and-pencil form. This is a different approach to RP tests, using the random elements of item programs to create tests, rather than creating a large pool of items from which the items were randomly selected. Advantages of RP tests listed are familiar by now, including repeatable tests, providing low-cost and faithful practice assessments, control over cheating, providing make-up tests, and others.
DFF Note: The concept of “item programs” from this paper is similar to AIG item models and SmartItems. AIG item models are built to create items for security reasons, to use in traditional tests. This means that the items automatically generated are stored in item banks and follow the same procedures as SME-created items to qualify them for use on operational tests. SmartItems are used directly on tests and render versions of themselves (called renderings, and sometimes, items) in real time. These renderings never see the inside of an item bank and are not “stored” except for research and legal purposes. Quality assurance steps are taken to make sure that SmartItems create item renderings on the fly that are of consistent and sufficient quality.
1979
In this paper, the author proposes using the sequential analysis statistical method to determine the number of items needed to assess a candidate’s mastery of a content area on a DR test. The advantage of this approach is that it requires presenting only the minimum number of items necessary to ascertain the test taker’s mastery. The author presents this as an easy and time-saving method, which can be performed by the teacher or others during the test, and the results can be used to inform further instruction for the student.
While sequential analysis has traditionally been used in the military, industry, business, and engineering, the author suggests that this method is also suitable for educational measurement, particularly for DR tests, because they consist of items randomly sampled from a domain, and the test taker’s performance on those items generalizes to the entire domain.
1980
Kane, M. T., & Brennan, R. L. (1980). Agreement coefficients as indices of dependability for domain-referenced tests. Applied Psychological Measurement, 4(1), 105-126. JLP
In this paper, the authors explore and compare several indices of dependability proposed by various researchers for DR tests. They demonstrate that these indices can be categorized into two broad groups based on how they account for error (or loss) and their underlying assumptions. Additionally, they compare these indices to norm-referenced generalizability coefficients. The authors prefer the term “dependability” over “reliability” to distinguish these indices from the classical understanding of reliability used for NR tests.
The authors discuss the impact of item sampling methodology on the indices, considering the random selection of a set of items from the domain for each individual test taker. This is referred to as the “nested within” design. In a later paper (Brennan, 2024), Brennan refers to this sampling method as “Random Items within Persons with Immediate Scoring.” (JLP Note: These are essentially RP tests.) The authors mention security as a reason for administering different forms to each test taker.
The authors conclude that using different sets of items for different test takers improves estimates of group means. They recommend this nested within item sampling method, whereby every test taker receives a unique set of randomly selected items, because it improves estimates of group means, without loss in the dependability estimates of the test takers’ universe scores.
1989
Millman, J., & Westman, R. S. (1989). Computer-assisted writing of achievement test items: toward a future technology. Journal of Educational Measurement, 26(2), 177-190. JLP
The authors introduce five methods for utilizing computers in the creation of test items. They categorize these methods based on their level of sophistication and the extent to which the computer aids in item development. They provide a scheme and descriptions of the functions of an item writing system using the third method described: “computer-supplied prototype items” (p. 181). While acknowledging the financial and logistical challenges of developing such computer programs at the time, the authors express hope that their paper will “expand the reader’s vision” (p. 177) and “push forward the state of the profession’s thinking about writing achievement test items with the assistance of the computer” (p. 186-187).
JLP Note: This article is remarkably forward-thinking in its exploration of technology for item generation. Artificial intelligence is briefly mentioned on page 177 in connection with the fifth method. The first method involves using the computer merely as an “inanimate slave” (p. 177), where the item writer conceives the entire item and then digitizes it using the computer. The fifth method, termed “Discourse Analysis,” involves providing text, such as lesson or course content, to the computer, which then transforms the text into test questions. Although the authors do not explicitly discuss DR testing, the technology described is essential for developing the extensive domains required for such testing.
While the technical aspects of this paper are outdated, many of the principles and ideas remain relevant. For instance, one could replace the term “computer” with “AI” in the article. The authors encourage readers to think beyond merely digitizing human tasks and to consider using computers to “analyze” information and propose test items. In the present day, we have a similar relationship with AI: we can use generative AI to perform tasks traditionally performed by humans, but faster and more efficiently, but we also have the opportunity to think beyond this and to use AI creatively to develop and execute processes that may be beyond human comprehension, achieving them more quickly and effectively.
1990
Cronbach, L. J. (1990). Essentials of Psychological Testing. New York, NY: Harper Collins Publishers. DFF
This is the 5th and final edition of a popular textbook on principles of psychological measurement. Cronbach describes RP tests in Chapter 2, in the section titled, Testing in the Computer Age, mainly on pages 46 and 47. He begins the section with a statement about standardization: “Because of its consistency, the computer carries standardization to an extreme, yet it can achieve standardized measurement while presenting different questions (and personalized feedback) to every test taker” (p. 46). Then he goes on to describe a context of testing for civil service jobs and that unique tests can be built for each job candidate. He goes on to suggest that a computerized item model or item program could “arbitrarily” (e.g., on the fly during an exam) alter rates, make-up of shipments, and rules, adding a comment that the test taker gains no advantage from knowing the content of questions presented to a friend who had tested a few days earlier.
DFF Note: Cronbach clearly provides, in 1990, a useful and refreshing view of standardization in the coming computer age, describing how the process might work in a manner similar to Lord’s RP tests or Caveon’s SmartItems, along with how it would prevent cheating by pre-knowledge.
2019
This is a set of simulations of SmartItems, and, therefore, RP tests. All of the simulations compare SmartItems (or RP tests) with a test form with fixed items. The basic simulation allows you to vary the number of items on a test, the number of test takers, and the range of difficulty of the renderings from SmartItems. (Some SmartItems can cover a narrow domain where there would be less variability for renderings; others would cover broader domains with a greater range of difficulty in the renderings.) Some test statistics, charts comparing estimated and true ability, and test information curves are produced. The second simulation, labeled Basic: Length, explores the effects of test length on reliability and error (median absolute deviation). The third and fourth simulations are similar to the first two but allow for variation of item exposure and test taker pre-knowledge and measure the effectiveness of pre-knowledge on the output variables. In general, the simulations indicate good comparability of test statistics between tests using SmartItems and fixed-item tests. It also shows the poorer performance of fixed-item tests, compared to SmartItems, with different pre-knowledge conditions. A simulation similar to this, but based on the logic, theory, and analyses for Lord’s RP tests, is currently in production.
2020
A couple of years ago I wrote a booklet about SmartItems. The booklet discussed providing unique exams to each test taker with the goal of making theft of exams and other forms of cheating impossible. SmartItems are programmed to cover a content or skill domain in breadth and depth, or to cover a stratum of a domain. Caveon’s testing system, Scorpion, facilitates the development of SmartItems in several different ways and provides the means to administer tests comprised of SmartItems. It wasn’t until after this book was released that I became aware of Lord’s writing and that SmartItems fit easily into his description of RP tests.
SmartItems can be a reasonable way to use technology to implement RP tests.
2024
Brennan, R. L. (2024). Current psychometric models and some uses of technology in educational testing. Educational Measurement: Issues and Practice, 0(0), 1–5. JLP
This paper is primarily a discussion about the uses of current psychometric models for technology based educational testing and about whether the models should be modified and/or new models developed. The author provides an overview of three common psychometric models (Classical Test Theory, Generalizability Theory, and Item Response Theory), discussing the advantages and disadvantages of each. Additionally, the author addresses concerns related to cheating and item exposure, noting how certain practices can facilitate cheating and disadvantage legitimate test takers.
The technology based test design “Random Items within Persons with Immediate Scoring” (p. 3) is discussed, where items are randomly sampled, and unique forms are administered to every test taker. The author indicates Generalizability Theory would be the most suitable model for tests of this design.
The author acknowledges that one cannot take for granted that the use of technology in test design itself improves the validity of tests and scores, but that more research should be performed, and the models may require revision.
Some DFF Notes on Current and Possible Technology-Based RP Testing Approaches
DFF Note on LOFTs and RP Tests
A LOFT or Linear-On-the-Fly Test is a computerized exam that is built during an exam sitting by drawing items randomly (by strata or not) from a pool, usually from a pool that is stratified by content, item statistics or parameters, exposure rates, etc. In that sense, LOFTs are exactly what Lord described as RP tests in his initial papers written in 1955. It’s the earliest description of LOFT that I have read. I have not seen people referring to LOFT who also credit Lord for its genesis, so that Lord invented LOFT might not be well known.
The only difference that I can see between Lord’s description of LOFT and how LOFT is described and used today is that Lord qualified that the pool had to be “large,” recommending in one paper (Lord, 1977), that the pool be sized in the range of 10x to 40x the number of items in the test.
He made this recommendation for at least two reasons: 1) Test security – Providing unique tests to individuals would make cheating more difficult and 2) Repeatability – Having more items in the pool would allow the test to be repeatable, even for an individual. This would be helpful in educational settings where the test could be used to measure a student’s learning before, during, and after instruction.
Today’s LOFTs do not qualify as RP tests because the pool is too small, limited primarily by the perceived need to collect item statistics on the items.
DFF Note on AIG and RP Tests
AIG, or Automated Item Generation, is becoming more popular as a way to increase the number of items for a testing program, mainly to support security activities, such as the creation of more equivalent forms, increasing the size of a CAT pool, or replacing compromised items in existing operational exams. AIG can be generally characterized as the development of item models or templates designed to generate items automatically that are appropriate for the tests an organization wishes to build. An item model is like a manufacturing assembly line, combining components of items with data sources and according to rules, with the result being up to tens of thousands of new, useful items. Items produced using AIG are stored in item banks, and may need to undergo additional quality steps, such as field testing or expert reviews.
It’s not much of a leap to see that an AIG process could create every item for the large pool of items Lord envisioned for RP tests.
There is perhaps another use of item models than the one described above. Once vetted for the quality of item production, perhaps each item model could serve on an operational exam to produce items in real time as needed by individual test takers. This is similar to how SmartItems are used.
DFF Note on CATs, RP Tests and SmartItems
CATs already produce relatively unique tests for examinees. The larger the pool of items supporting the CAT, the greater the likelihood that individual exams will be unique. Creating forms based on test taker ability also contributes to that uniqueness. A smaller size for the pool results in more overlap of items across test takers. A certain amount of overlap encourages harvesting and sharing of items and contributes to the effectiveness of cheating on CATs. An early scandal in one of the first large-scale uses of a CAT exposed the problem of small pools: (https://www.baltimoresun.com/1995/01/01/kaplan-illegally-obtained-test-questions-suit-says/).
CATs can be considered a variation of RP tests where the items are randomly drawn from strata organized at least by difficulty, but also by exposure rate and content. Another difference is that test taker ability is a major determiner of which items are selected from the pool. Random selection is used as part of many CAT selection algorithms to prevent the higher quality (most informative) items from being used too often.
Lord’s rules for the size of LOFT pools of 10x to 40x would apply equally to CATs. Adjusting those multipliers for today’s security threats and technology would suggest a pool size of 100x or even more.
SmartItems are defined as actual items. Within a content domain, some SmartItems would be more difficult than others. Statistical calibrations would provide the IRT parameters needed for the CAT. As an example, if the four primary mathematical operations were the domain, then a SmartItem producing addition renderings would likely be easier overall than a SmartItem producing multiplication renderings. In that case, those two SmartItems and others could be combined in a CAT item pool to support the adaptive measurement of a young student’s abilities in mathematics operations.
READY TO TALK TO AN EXAM SECURITY EXPERT?
Reach out and tell us about your organization’s needs today!