Monday, March 02, 2015, 1:45 PM – 2:45 PM (PST)

Implementing Automated Scoring in k–12 Assessments: Development and Scoring Considerations

Practice Area Division(s): Education
Topic: Testing, Measurement, and Psychometrics

As K-12 assessments are moving online, new item types are being developed that can be automatically scored through rule-based machine rubrics or scoring engines that utilize forms of artificial intelligence to replicate human scoring. Use of these scoring methods has implications for item authoring, development, and review. Online item types currently in use in K-12 assessments are most easily defined by student response types such as selecting one or more response options in a list or table format, entering mathematical numbers or symbols from a keyboard or number palette, manipulating the position of item elements on the screen (drag and drop), creating a graph or diagram by placing points on a grid, or typing a text response on a keyboard. For each of these response types, scoring rules or rubrics need to be created for automated scoring. Item writers must determine how complex responses correspond to item score points. This requires a thoughtful degree of alignment between item authoring and the required blueprint and design that goes beyond the traditional item development plan.

Automated scoring of constructed responses using artificial intelligence has been the subject of research over the past two decades. Artificial intelligence engines have proven to be quite good at scoring general writing ability, but have been less effective scoring shorter responses and evaluating the quality of the content. Nevertheless, given the large numbers of responses to be scored in K-12 assessments, automated constructed response scoring can offer significant benefits in terms of scoring turnaround time and cost efficiencies. Scoring engines may be deployed as an adjunct to a human reader with human adjudication of discrepant scores or multiple scoring engines may be deployed with human ratings provided in the case of discrepancies.

Presenters will discuss their experiences in authoring and scoring the 20,000 items that were field-tested by the Smarter Balanced Assessment Consortium including specifications and validation of rubrics for machine scored Items and feedback from artificial intelligence scoring to inform item development. Examples of items, machine and human scoring rubrics, and student responses will be shared as illustrations. Implications for K-12 assessment and for assessment in other contexts such as certification and licensure will be discussed.

PRESENTERS:
Sally Valenzuela, McGraw-Hill Education
Craig Mills, American Institute of Certified Public Accountants (AICPA)
Kevin Sweeney, The College Board
Vincent Kieftenbeld, CTB/McGraw Hill