Skip to content Skip to navigation

Scoring and Evaluation

Overview

SCALE provides training for the scoring of student work and protocols for the moderation of student scores. These processes ensure that scores are fair, reliable and comparable within and across schools. The processes and protocols for scoring student work are customized depending on the purposes of scoring, which may include formative and summative purposes.

Definitions

  • Scoring system: The procedures used for training teachers to score student work reliably within and across schools, and the processes used to evaluate score reliability and comparability across teachers and schools. 
  • Benchmark work samples: Student work samples representing different levels of performance on the rubric, used for training purposes to illustrate and clarify the difference between score levels. 
  • Training of Trainers: An event designed to build the capacity of teacher leaders, administrators, or coaches to serve as lead scorers or trainers, who then provide scorer training directly to teachers in participating schools or districts 
  • Calibration: A method used to monitor the consistency of teachers trained to score. Teachers independently rate a pre-selected work sample previously scored by the Benchmarking Team. The proximity of their scores to the scores assigned by the Benchmarking Team provides an indication of their ability to score accurately or a need for further training. 
  • Audit: A system-wide check of the consistency of scores across schools. Work samples, including all those with borderline and failing scores, are scored a second or third time to produce reliability reports and inform remediation if needed. 
  • Double-score: The practice of assigning a second scorer to independently rate work samples scoring as failing as a check for consensus agreement.

System Elements

Benchmark work samples. Work samples representing different levels of performance on the rubrics are used for training purposes to illustrate and clarify different scoring levels. During the piloting of performance tasks, student work samples that are produced in response to performance tasks are collected from piloting teachers. Teams of educators with expertise in the content area and grade level select and pre-score benchmarks to be used in scorer training.

Training of trainers.When it is impractical to directly train all educators involved in scoring, a training of trainers model may be used to support regional or local training and to build capacity within local schools to sustain scorer training in future years. Teacher leaders, administrators, or coaches with expertise in the content area may be selected to become lead scorers or trainers. Lead scorers or trainers need to reach a calibration standard in order to be eligible to work as a trainer. The trainers then assume a set of responsibilities that include training, calibrating, and supervising scorers.

Scorer training and calibration. Teachers are trained to score student work with the common scoring rubrics. The training is not task-specific, but content area-specific (meaning that the training allows for transfer across tasks within a content area). A common training module and scoring procedures maximize score reliability and comparability across schools. Trained scorers then independently score several pre-scored tasks to check their ability to score reliably. Those who pass the standard for scoring accurately are considered reliable scorers (calibrated). When the online technology platform allows for online scorer training and calibration, face-to-face training can be replaced with online training and calibration every other year.

In order to ensure consistency and accuracy in scoring, raters must independently score a calibration work sample prior to scoring assessments. Calibration work samples have been previously scored multiple times by experienced raters who come to a consensus on the appropriate scores, and the evidence supporting the scores is fully documented and explained. In order to calibrate, one's set of scores must result in the same pass/fail decision, and be within sufficient proximity to the pre-determined scores, as set out in a Calibration Standard. Facilitators work with scorers who fail to calibrate on the first try until they calibrate.

Local scoring and moderation. Some school organizations may choose a local scoring model to and involve all teachers implementing performance assessments in scoring. These teachers are trained and calibrated to score tasks. Teachers score their own students' work as it is completed (as part of their regular work responsibilities). Their scores are regularly audited to check for score reliability. Work samples with failing and borderline scores should be submitted to administrators to be re-scored by another trained teacher in the same school or district. All scores are collected and analyzed. The results inform program review and instructional practice, as well as provide the basis for further revisions of the performance outcomes, rubrics, and tasks.

Annual score audit and moderation. To check on score reliability and the comparability of scores across teachers and schools, there are several strategies that may be followed. An independent external audit of local school scores may be conducted or some percent of student work may be double-scored at the school site. A combination of these two methods may be used to check score reliability within and across schools. The work samples used in audits typically include all work samples with borderline and failing scores, which are double- and triple-scored if necessary, and a small percentage of additional scored work samples that are representative of all implementing teachers within a school. School-level and department-level reliability reports are produced based on this data and fed back to school administrators to trigger an audit of scorer training procedures and recommendations for remediation. In the case of consistently large discrepancies in scoring reliability over multiple years within a school, direct supervision and training of local teachers by lead trainers from SCALE can be triggered until sufficient improvements in scoring reliability are achieved.

Centralized or regional blind scoring. Some school organizations may choose a regional or centralized scoring model in which teachers gather at a central location and score work samples submitted from across participating schools in the region. Work samples are de-identified so that teachers do not know the students, teachers, or schools from which the work was generated. This model of scoring may be preferable in systems where the scores are used for some high-stakes purpose. Blind scoring maximizes the credibility of the scores. Because of limited time (due to the cost of teacher release time and travel/boarding), it may not be possible to score 100% of student work samples on site. In the future, this limitation can be overcome with electronic scoring, where work samples can be assigned to teachers to score remotely.