## Statistics Agreement Between Observers

Either Pearsons r {displaystyle r}, Kendalls τ or Spearmans ρ {displaystyle rho} can be used to measure the correlation in pairs between evaluators with an ordered scale. Pearson believes that the rating scale is continuous; Kendall and Spearman`s statistics only suggest that this is an ordinal number. If more than two evaluators are observed, an average degree of concordance for the group can be calculated on average of the values r {displaystyle r}, τ or ρ {displaystyle rho } from any pair of evaluators. For example, rating studies are often used to evaluate a new rating system or instrument. When such a study is conducted during the development phase of the device, there may be a desire to analyze the data with methods that identify how the device could be modified to improve compliance. However, if an instrument is already available in a definitive format, the same methods may not be useful. Consider a situation in which we would like to evaluate the adequacy between hemoglobin measurements (in g/dl) with a hemoglobinometer on the hospital bed and the formal photometric laboratory technique in ten people [Table 3]. The Bland Altman diagram for this data shows the difference between the two methods for each person [Figure 1]. The mean difference between the values is 1.07 g/dl (with a standard deviation of 0.36 g/dL) and the 95% match limits are 0.35 to 1.79. This means that the hemoglobin level measured by a given person`s photometry can vary from 0.35 g/dl greater than 1.79 g/dl measured by photometry (this is the case for 95% of people; for 5% of individuals, variations could be outside these limits). This obviously means that the two techniques cannot be used as substitutes.

It is important that there is no single criterion for acceptable compliance limits; This is a clinical decision that depends on the variables to be measured. Now consider a hypothetical situation in which examiners do just that, that is, assign notes by throwing a coin; heads = pass, tails = fail table 1, situation 2]. In this case, one would expect 25% (= 0.50 × 0.50) of students to receive the “failure” grade and 25% of both to get the “non-existent” grade, i.e. an overall “expected” rate of 50% (= 0.25 + 0.25 = 0.50). It is therefore necessary to interpret the observed approval rate (80% in situation 1), taking into account that 50% of approval was expected by chance. These auditors could have improved this situation by 50% (best possible concordance minus random expected agreement = 100%-50% = 50%), but only 30% (observed agreement minus random expected agreement = 80% – 50% = 30%). Their actual power is therefore 30% / 50% = 60%. Krippendorffs alpha[16][17] is a versatile statistic that evaluates the concordance between observers who classify, evaluate, or measure a certain amount of objects relative to the values of a variable. It generalizes several specialized conformity coefficients by accepting any number of observers, applicable to nominal, ordinal, interval and proportional levels, capable of processing missing data and being corrected for small sample sizes. If the number of categories used is small (for example.B.

2 or 3), the probability that two evaluators agree by chance increases dramatically. This is because both evaluators must limit themselves to the limited number of options available, which affects the overall rate of the agreement, and not necessarily their propensity to enter into an “intrinsic” agreement (an agreement is considered “intrinsic” if it is not due to chance). . . .