While many of the terms below have both mathematically precise definitions and lay connotations, the meanings given below are the ones appropriate to test development.
|
Accreditation
|
Evaluation by a professional body that an exam or exam
development process meets established standards of quality.
The two most common exam accreditations in
the U.S. are NCCA and ANSI.
|
|
Adaptive Testing
|
A method of administering items so that a person’s
ability can be estimated from a subset of the items available.
An item or set of items is administered
and scored. Then the next item or set
of items is selected depending on the success with prior items.
|
|
|
Items that appear across all exam forms of a
specialty. These items are used to
equate forms and calibrate items.
|
|
Appeals process
|
A systematic way of reviewing the test scoring and
administration process to respond to questions of the test’s fairness in
evaluating an examinee.
|
|
|
A way of preserving the results of the performances
tested so that there is evidence on which to base a review or appeals
process.
|
|
|
acting with the same range of available performance as
on the job setting. Technically, making a choice on a piece of paper is a
behavior; but it does not meet the
definition above, nor does writing an essay or discussing a topic.
|
|
Bias
|
A skewed distribution of scores attributable to a
trait not demonstrably linked to the ability a test is designed to
measure. Courts usually accept as de
facto evidence of bias a score for a group which passes less than 80% of
the people who would be predicted to pass a test if they were not linked to
the trait. Tests of performances
directly linked to job performance have not been required by the courts to
provide evidence regarding bias.
|
|
Binary scoring
|
Scoring in which items are scored only correct or
incorrect.
|
|
Certification
|
A formal evaluation program which assesses a person’s
ability to fulfill specific job duties.
A legal certification program must evaluate a representative subset of
the job’s requirements and be unbiased toward the gender, racial and ethnic
background of applicants. Cf, exam,
test, credential, licensure.
|
|
Computerized testing
|
Administration of an exam on a computer, as opposed to
paper and pencil or any other form of administration.
|
|
Concepts
|
any logical or structural constructs that direct
behavior. Attainment of concepts
cannot be measured directly; it can
only be measured indirectly by behavior consistent with the concept.
|
|
Confounded
|
mixing the evaluation of two distinct abilities in one
score.
|
|
Controls
|
the user interface to mechanisms enabling the examinee
to influence his trajectory through a test.
Controls usually include Restart, Directions, Finished, and Stop
Test. Additional controls may be More
Info, Skip, Next or Back.
|
|
Credential
|
An assertion by a credentialing body that a person is
capable of performing professional duties.
Earning the credential often involves a number of components such as
classwork, mentoring, and passing certification exams.
Cf, exam, test, certification,
licensure.
|
|
Criterion
|
The standard by which performance on an exam is judged
to be acceptable or not acceptable.
|
|
Criterion referenced test
|
See Domain-referenced test.
|
|
Cutpoint
|
The number of items the examinee must answer correctly
in order to receive a passing score on the exam.
In a weighted exam, this is the item weighted score an examinee
must achieve in order to pass the exam.
See pass/fail exam.
|
|
Descriptive exam
|
An exam which attempts to accurately assess the
candidate’s ability at all points along the ability scale.
The ability scale is typically a normed
scale. See Norm referenced test.
|
|
Dichotomous scoring
|
Scoring each item right or wrong.
See also polytamous scoring.
|
|
Directions time (item)
|
time from the presentation of the stimulus to the time
the user initiates action to complete the item.
|
|
Domain referenced test
|
A test which is scored depending on the candidate’s
mastery of a domain of knowledge or performance, as opposed to being scored
compared to how others perform. See
norm referenced test.
|
|
EEOC Guidelines
|
Testing guidelines set up by the Equal Employment
Opportunity Commission, established in 1986 and revised in 2000.
These guidelines provide for development
and administration of tests that are demonstrably unbiased for all population
groups.
|
|
Elapsed time (item)
|
the time from the presentation of the stimulus to the
time the examinee indicates the item has been completed.
c.f. Involved time.
|
|
Exam
|
A test administered to human subjects.
|
|
Exam Pool
|
Scored items on an exam.
|
|
Exam review
|
A process of evaluating the performance of an exam
over a specific period of time.
Issues typically addressed in an exam review are item performance
characteristics (P-Value, point-biserial), exam performance characteristics
(reliability), and content currency.
|
|
Gantt chart
|
A management chart which depicts tasks on a timeline
according to start time for the task and the time required to complete the
task.
|
|
Inter-rater reliability
|
The consistency in scoring between two observers.
This is usually measured as a correlation
between observers’ ratings of a specific set of observations.
style="mso-spacerun:
|
|
Involved time
|
The time from the beginning of the examinee’s response
to the indication that the response is complete.
|
|
Involved time
|
the time required to respond to the item.
This is measured from the examinee’s first
action on an item to the indication of item completion.
We would have preferred to use the term Response
time, but it has so many psychological associations we chose the more
test-specific term Involved time.
c.f. Elapsed time.
|
|
Item
|
A unit of scoring which includes a stimulus situation
(including directions), an opportunity to respond, and a method of scoring
the correctness of the response. An
item may contain one or more tasks.
An item is the smallest scored unit.
|
|
Item Development Workshop (IDW)
|
A meeting or meetings at which items may be authored,
edited, reviewed, and accepted or rejected for inclusion on an exam.
style="mso-spacerun:
|
|
Item Pool
|
Items that are available for use on an exam.
|
|
Job Task Analysis
|
The process by which an exam development group
establishes the link between practice and the skills tested on the exam.
The job task analysis is typically a
survey which assesses specifically what people do on the job and how often
they do it.
|
|
Licensure
|
The process by which a state or governmental body
approves an individual to perform specific practices.
In licensed professions, it is illegal to
practice the profession without a license.
|
|
Low stakes test
|
A test which doesn’t involve monetary or promotion
consequences. E.g., a test of
prerequisites for a course. Or a test
to determine the starting point in a course.
Typically, these exams are not subject to EEOC Guidelines.
|
|
Multiple-Choice Test
|
a test that constrains examinees’ ability to respond
to a situation to an artificially small number of choices.
Sometimes called a selective response
test.
|
|
Norm-referenced test
|
A test in which an individual’s score is given
relative to the performance of other individuals.
See also Domain referenced test.
|
|
Omega
|
A measure of predictive validity per unit time. Only
relevant for a specific domain. The omega is the last letter of the Greek
alphabet; in this case, Omega is the last word in test worthiness.
style="mso-spacerun:
|
|
Pass/Fail exam
|
An exam in which the results are used only to classify
a candidate, and in which the score scale other than the cutpoint(s) are not
normed or evaluated.
|
|
Performance Test
|
A test in which the response modality is essentially
identical to the response modality of the target task.
|
|
Performance-Based Test
|
a test of a person’s ability to indicate how he believes
he would act in a given situation.
|
|
Point-Biserial Correlation
|
The correlation between a dichotomous (2-valued)
variable and a continuous variable.
In testing, it’s typically the correlation between a
right / wrong item score and a total test score.
.
|
|
Polytamous scoring
|
Scoring items on a scale.
The clearest common example of this is giving partial credit to
a response.
|
|
P-Value
|
The percent of examinees who pass the item in a
calibration sample or during a specific time period.
|
|
Raw score
|
The raw score is the number of items correct on the
exam. Or it may be the percent of items
correct on the exam. The percent
score can be converted into the number correct if the number of scored items
is known, and vice-versa.
|
|
Ready mode
|
the default state of the application which allows the
scoring program to begin. It is
typically also the state at which the examinee starts each item.
|
|
Recertification
|
The process by which a person who has been certified
is certified as competent to continue practice.
|
|
Reliability
|
the correlation between two administrations of the
test a specified time interval apart, or between equivalent forms of the same
test.
|
|
Return on Investment (ROI)
|
The measure of (Return – Expense) / Expense.
In testing, Expense might be the cost of
developing and administering the exam.
Return would be the measurable benefits achieved by using the
exam. For an exam, ROI would more
properly be termed ROX, or return on expenditure, since an investment
typically shows up as an asset on a balance sheet, whereas training is
typically coded as a liability, or expenditure.
|
|
ROI
|
Return on Investment.
Calculated by (Value + Cost) / Cost.
|
|
Sample size
|
The predicted size of sample required to achieve
results at the desired level of statistical significance, assuming a sample
distribution of specific characteristics.
In test design, one sample size of importance in the number of
responses from the job task analysis.
Another important sample size is the number of candidates required to
calibrate an exam or item.
|
|
Scaled scoring
|
An arbitrary score reported to a candidate instead of
the raw score of items correct on an exam.
|
|
Simulation
|
A stimulus situation which mimics another and which
contains elements of the other situation.
|
|
Stem
|
in a multiple-choice test, the part of the item that
presents information and asks the question that has to be answered with the
choices.
|
|
Task Analysis
|
See Job Task Analysis.
|
|
Taxonomy
|
A system of classifying behaviors;
in exam development, a system of
classifying item types. While anyone
can make up an arbitrary taxonomy, the most useful taxonomies are those that
help the practitioner establish comprehensive scope, link stimuli and scoring
methodologies, and create appropriate conditions for observing
responses.
|
|
Test
|
A test is a measure of behavior observed within a
predefined context. A test presents
predetermined stimuli to elicit behavior evaluated consistently across
examinees. The stimuli sample a domain with the goal of predicting behavior
in the entire domain.
|
|
Unscored items
|
Items administered on an exam whose scores do not
contribute to the score achieved by the candidate.
|
|
Unscored pool
|
A set of exam items from which a subset is drawn to
administer as unscored items on an exam.
|
|
Validity
|
The degree to which an exam, certification or license
predicts success on the job. This is
often measured by correlation between a test score and a person’s performance
on the job. Theoretically, the
correlation between true and measured ability.
|
|
Verisimilitude
|
The extent to which a simulation mimics its
template.
|
|
Weighting
|
A process of multiplying the score of each item by a
value (or weight), then summing the products of all item weighted scores to
compute a score for the test.
|