文章詳目資料

教育心理學報 ScopusTSSCI

  • 加入收藏
  • 下載文章
篇名 從多層面Rasch模式來檢視不同的評分者等化連結設計對參數估計的影響
卷期 52:2
並列篇名 Investigating the Effects of Rater Equating Designs on Parameter Estimates in the Context of Preservice Principal Oral Performance
作者 謝名娟
頁次 415-436
關鍵字 多層面Rasch模式(MFRM)校長口語評量等化連結設計評分嚴厲度Many-Facet Rasch model preservice principal oral performancerater equating designrater severityTSSCIScopus
出刊日期 202012
DOI 10.6251/BEP.202012_52(2).0008

中文摘要

牽涉到評分的情境下,常見三種評分資料蒐集的方式,第一種為完全評分網絡設計,在此設計下所有的層面的成分(components)有完整的觀測值。第二種為不完全評分網絡設計,成分間有部份程度系統性連結,第三種為不連接評分網絡設計,各成分之間沒有任何系統性的連結,即使這種評分網絡設計具有潛在性的問題,在台灣許多重要考試在成本考量下仍使用這樣的設計。本研究以儲訓校長的口語表現評分資料為實證數據,藉由多層面Rasch模式(many faceted rasch model,簡稱MFRM)的分析模式來進行參數的等化估計,探討這三種不同評分者資料蒐集設計對於各層面參數估計的影響,其研究發現評分者連結性越小,參數估計的穩定性越差,尤其在不連接評分網絡設計,雖然使用MFRM進行校正,其相關參數估計與受試者的能力排序存在很大的誤差,考試單位應避免使用此設計進行評分者的分數評閱。建議未來重要的考試,應至少採用不完全評分網絡設計,並以統計模型(如MFRM)的方式進行評分者嚴厲度的校正。

英文摘要

A problem in performance assessments is the degree to which rater severity and leniency can affect the examinee’s scores. In particular, fairness concerns related to performance systems include the exchangeability of raters. A possible resolution for addressing rater severity is for each rater to score each examinee’s performance; thus, the difference in rater severity would affect each student at the same level. However, this is not always feasible in practice for fully crossed rating designs. In the context of performance assessments, equating procedures create links between raters when performing transformation with a fully crossed rating design is not feasible and could control for differences in rater severity. An effective equating procedure involves a strong statistical model and a systematic data collection approach. The Many– Facet Rasch model (MFRM) is a commonly used approach for adjusting rater differences. Although the use of the MFRM model has gained popularity as an equating approach for rater severity, several key considerations related to data collection designs and model data fit are also crucial. In particular, it is vital to ensure a sufficient level of connectivity in the rating design; that is, the raters can be linked to other assessment components, such as other raters, examinees, or tasks. Three types of data collection design are commonly used for equating. The first type is a complete network design, in which the data consist of complete designs with subjects of all assessment components. This is an ideal design for a rating system. The second type is an incomplete network design. Under an incomplete network design, examinees do not have scores on all assessment components, but a partial and systematic degree of connectivity exists for raters and tasks to produce a connected network of assessment components. The third type is a nonlinked network design, where no systematic linkage exists in the components of facets. Even if the unlinked scoring network has some potential problems, many important exams in Taiwan still use this rater design. The purpose of this study was to examine the effect of differences in data collection designs that could affect parameter estimation in the performance assessment. Using empirical data, this study explicitly related the central role of consideration to data collection designs for the interpretation of results when the MFRM is applied. The study had two main research objectives: (1) To examine the impact of different data collection designs on parameter estimates for examinees’ ability, raters’ severity, and the difficulty of scoring criteria. The indices included infit, outfit, separation index, reliability, and the chi-square test. (2) To evaluate the correlations of ability estimates between different designs and the magnitude of their impact on the ranking of the examinees’ performance level. Examinees for the top 10, middle 10, and last 10 examinees in the complete network design were selected to evaluate their ranking differences for other designs. This study used the MFRM and oral performance score data of preservice principals to explore the effects of the three data collection designs. A total of four raters and 85 preservice principals participated. The raters scored seven criteria for each preservice principal’s oral performance: content, structure, word usage, attitude, pronunciation, intonation, and time control. Each criterion was assigned a grade of 1–3, of which 1 represents the basic level, 2 represents a proficient level, and 3 represents an advanced level. The raters were trained before the actual rating was conducted. The specification of each grading level and the standards were explained; raters were also required to complete rating exercises before conducting the official rating. The anchored videos at various levels for the raters were discussed. Through these anchored videos, the raters could better understand the standards. Four equating designs were considered in this study. Design 1 was the complete network design; four raters rated all preservice principals in this design. Designs 2 and 3 were incomplete network designs. In these two designs, some rating scores overlapped to construct the connectivity of scoring components. In design 2, each student received scores from three raters, whereas in design 3, each student received scores from two raters. Design 4 was a nonlinked network design, in which each rater only reviewed his or her assigned class; there was no connection between raters’ scores. The MFRM, a statistics model of the Rasch family, was employed to perform the four equating designs. When estimating the examinee’s ability level, raters’ severity and scoring criteria were simultaneously considered in the model. This study had two main findings: (1) For the incomplete and nonlinked network designs, some minor problems were related to the model fit, but overall, the infit and outfit indices were close to 1, which indicated that the use of the MFRM was feasible for analyzing the data used in this study. However, the reliability and separation indices for the nonlinked network design were low, and some chi-square tests did not reach significance—results that were quite different from the complete network design. (2) The lower the linkage between assessment components, the more biased the estimated stability of parameters. The fully connected network design provided the strongest connectivity at all levels (subjects, raters, and criteria), and this design was also the most ideal scenario for the data collection design. However, this design costs much in terms of rating time and money; thus, it is difficult to implement such a design in a large-scale test. By contrast, incomplete network designs are more feasible in large-scale tests, namely for establishing overlaps of the evaluation of some subjects of raters. The correlation between the complete network design and nonlinked network design was only 0.69, but the correlation between the complete network design and incomplete network design rose to 0.79–0.94. Moreover, a clear gap existed in participants’ rankings between the ideal fully connected network design and nonlinked network design. For example, student #59 ranked 79th in the complete network design but 21st in the nonlinked network design, equaling a ranking difference gap of 58. The results revealed that even if the MFRM is used for correction, large errors will still exist in the estimation of ability values and the ranking results of examinees for a nonlinked network design. This study provided two suggestions: (1) Examination institutions should avoid using the nonlinked network rater design. Carefully constructed network assessment designs based on effective data collection designs have the chance of obtaining objective and fair measurements within systems with multiple facets. Regarding current large-scale tests, many do not use any statistical models for rater severity correction; furthermore, they use the nonlinked rater design. It is possible that examinees can experience bad luck and encounter a severe grader, resulting in them receiving a low score. Therefore, this study recommended that important examinations in the future should adopt a more complete rating plan. (2) This study used empirical data. A simulation study can be considered to further examine the impact of different designs of component connectivity on parameter estimates. In addition, different experimental designs are worth discussing; for example, if examinees are nested within tasks, would this nested relationship affect the parameter estimates of ability and rater severity levels? The impact of more complex data collection designs are worthy of future research.

相關文獻