Table 7.2 gives the measurement report for raters. We can see that this provides the exact same information about raters as Table 7.1 did for students. Table 7.2.2 arranges the data by fit, and we can immediately see that Rater 991 might be the cause of the misfit shown by Student 996. The infit and outfit figures of 1.39 and 1.38 are somewhat misfitting, and the point-measure correlation is .39 compared with the expected value of .55. This suggests that this rater is interpreting the rubric somewhat differently than the average rater. Table 7.2.2 Raters Measurement Report (arranged by fN).
+-------------------------------------------------------------------------------------------------------------------+
|Total Total Obsvd Fair-M| Model | Infit Outfit |Estim.| Correlation | Exact Agree. | |
|Score Count Average Avrage|Measure S.E. | MnSq ZStd MnSq ZStd|Discrm| PtMea PtExp | Obs % Exp % | Num Raters|
|-----------------------------+--------------+---------------------+------+-------------+--------------+------------|
| 303 139 2.18 2.21| 1.65 .15 | 1.50 3.7 1.44 3.4| .46 | .54 .61 | 44.4 45.0 | 1 1 |
| 334 129 2.59 2.67| .11 .19 | 1.39 2.6 1.38 2.0| .62 | .39 .55 | 56.4 58.7 | 991 991 |
| 287 130 2.21 2.23| 1.60 .16 | 1.16 1.3 1.20 1.6| .76 | .47 .61 | 43.3 46.2 | 975 975 |
| 286 123 2.33 2.39| 1.11 .17 | 1.10 .8 1.10 .8| .86 | .42 .60 | 46.9 50.9 | 891 891 |
| 345 130 2.65 2.73| -.15 .20 | 1.10 .7 .92 -.3| 1.01 | .65 .54 | 61.5 60.6 | 968 968 |
| 349 130 2.68 2.75| -.25 .20 | 1.08 .6 .91 -.3| 1.01 | .60 .53 | 63.2 61.6 | 790 790 |
| 364 130 2.80 2.85| -.88 .24 | .91 -.4 .91 -.2| 1.04 | .46 .47 | 63.7 63.8 | 949 949 |
| 354 130 2.72 2.81| -.55 .21 | .88 -.8 1.12 .5| 1.03 | .48 .51 | 61.2 61.3 | 996 996 |
| 324 130 2.49 2.57| .50 .17 | .81 -1.6 1.01 .0| 1.15 | .62 .57 | 57.3 55.9 | 823 823 |
| 341 130 2.62 2.69| .04 .19 | 1.07 .5 .80 -1.1| 1.07 | .70 .55 | 62.1 60.2 | 915 915 |
| 269 129 2.09 2.13| 1.87 .15 | .75 -2.3 .77 -2.0| 1.24 | .45 .61 | 35.5 41.3 | 973 973 |
| 350 127 2.76 2.81| -.58 .22 | .78 -1.4 .64 -1.5| 1.18 | .58 .49 | 65.4 63.4 | 847 847 |
| 373 129 2.89 2.94| -1.92 .30 | .65 -1.6 .60 -.8| 1.18 | .54 .39 | 63.9 61.6 | 831 831 |
| 371 130 2.85 2.91| -1.45 .27 | .71 -1.5 .50 -1.4| 1.19 | .57 .43 | 64.8 63.2 | 837 837 |
| 366 130 2.82 2.88| -1.08 .25 | .64 -2.1 .46 -2.0| 1.28 | .66 .46 | 67.1 63.5 | 850 850 |
|-----------------------------+--------------+---------------------+------+-------------+--------------+------------|
| 334.4 129.7 2.58 2.64| .00 .20 | .97 -.1 .92 -.1| | .54 | Mean (Count: 15) |
| 32.3 3.1 .25 .26| 1.12 .04 | .25 1.7 .29 1.5| | .09 | S.D. (Population) |
| 33.4 3.2 .26 .27| 1.16 .05 | .26 1.8 .30 1.6| | .09 | S.D. (Sample) |
+-------------------------------------------------------------------------------------------------------------------+
Model, Populn: RMSE .21 Adj (True) S.D. 1.10 Separation 5.26 Strata 7.34 Reliability (not inter-rater) .97
Model, Sample: RMSE .21 Adj (True) S.D. 1.14 Separation 5.45 Strata 7.60 Reliability (not inter-rater) .97
Model, Fixed (all same) chi-square: 475.2 d.f.: 14 significance (probability): .00
Model, Random (normal) chi-square: 13.5 d.f.: 13 significance (probability): .41
Inter-Rater agreement opportunities: 12559 Exact agreements: 7168 = 57.1% Expected: 7171.3 = 57.1%
However, in this case, Rater 1 is the
teacher and is reported as quite misfitting,
but gave an average rating of 2.18 (quite strict),
while four raters are quite overfitting,
with values much less than 1.00, and very lenient,
with average ratings greater than 2.5 out of a maximum
of 3. It is very likely that the apparent consistency
of these raters is misleading,
they appear to have assigned maximum ratings to
anything except a very weak performance. Although they
are consistent, they are consistently very lenient,
but provide us with less information about the
performances than the more severe raters. A reliability coefficient is provided at
the bottom of Table 7.2, but this can be confusing.
This is not a report about how much raters agreed, but about how
confident we are that they are of different severity.
In other words, raw scores from different raters are
not directly comparable, so we should use adjusted
scores. Inter-rater
agreement is reported at the bottom of Table 7.2, in
this case the observed agreements precisely matched
the expected value of 57.1% |