Original ArticleWhen to use agreement versus reliability measures
Introduction
Outcome measures in medical sciences may concern the assessment of radiographs and other imaging techniques, biopsy readings, the results of laboratory tests, the findings of physical examinations, or the scores on questionnaires collecting information, for example, on functional limitations, pain coping styles, and quality of life. An essential requirement of all outcome measures is that they are valid and reproducible or reliable [1], [2].
Reproducibility concerns the degree to which repeated measurements in stable study objects, often persons, provide similar results. Repeated measurements may differ because of biologic variation in persons, because even stable characteristics often show small day-to-day differences, or follow a circadian rhythm. Other sources of variation may originate from the measurement instrument itself, or the circumstances under which the measurements take place. For instance, some instruments may be temperature dependent, or the mood of a respondent may influence the answers on a questionnaire. Measurements based on assessments made by clinicians may be influenced by intrarater or interrater variation.
This article first presents an example of an interrater study, then describes the concepts underlying various reproducibility parameters, which can be distinguished in reliability and agreement parameters. The primary aim of this article is to demonstrate the relationship and the important difference between parameters of reliability and agreement, and to provide recommendations for their use in medical sciences.
Section snippets
An example
In an interrater study on the range of motion of a painful shoulder, different reproducibility parameters were used to present the results [3]. To assess the limitations in passive glenohumeral abduction movement, the range of motion of the arm was measured with a digital inclinometer, and expressed in degrees. Two physical therapists (PTA and PTB) measured the range of motion of the affected and the nonaffected shoulder in 155 patients with shoulder complaints. Table 1 presents the results in
Conceptual difference between agreement and reliability parameters
In the literature, agreement and reliability parameters are often used interchangeably, although some authors have pointed out the differences [6], [7].
Agreement and reliability parameters focus on two different questions:
- 1.
“How good is the agreement between repeated measurements?” This concerns the measurement error, and assesses exactly how close the scores for repeated measurements are.
- 2.
“How reliable is the measurement?” In other words, how well can patients be distinguished from each other,
Agreement parameters are neglected in medical sciences
In the 1980s Guyatt et al. [8] clearly emphasized the distinction between reliability and agreement parameters. They explained that reliability parameters are required for instruments that are used for discriminative purposes and agreement parameters are required for those that are used for evaluative purposes. With a hypothetic example they eloquently demonstrated that discriminative instruments require a high level of reliability: that is, the measurement error should be small in comparison
Relationship between the agreement and reliability parameters
The relationship between parameters of agreement and reliability can best be illustrated by elaborating on the variances that are involved in the ICC formulas. Therefore, we first need to explain the meaning of the variance components [12]. Variance (σ2) is the statistical term that is used to indicate variability.
The variance in observed scores can be subdivided into the variance in the objects under study, in our example the persons (σ2p), the variance in observers (the two different PTs) (σ2
Illustration of ICC and SEM calculations in the example
Table 2 presents the values of the variance components for the affected and nonaffected shoulder. The variance components are estimated with SPSS (version 10.1), with the range of motion values as independent variable and persons and PTs as random factors, using the restricted maximum likelihood method. From these variance components, the above-mentioned SEMs can be calculated. For the affected shoulder:
Three ways to obtain SEM values
To facilitate and encourage the use of agreement parameters we will demonstrate how agreement parameters can be derived from the ICC formula, or can be calculated in other ways.
- 1.
SEM values can easily be derived from the ICC formula, if all variance components are presented. In that case, the reader can calculate the ICC of his/her own choice. SEM is calculated as , which equals , if one wishes to take the systematic differences between the PTs into account, otherwise,
Typical parameters for agreement and reliability
For repeated measurements on a continuous scale, as in our example, an ICC is the most appropriate reliability parameter. An extensive overview of the various ICC formulas is provided by McGraw and Wong [11].
In our example, agreement was expressed as the percentage of observations lying between predefined values (Table 1). Presentation in this way makes sense in clinical practice, because every PT knows what 5° and 10° means. This measure was chosen because it can easily be interpreted by PTs
Clinical interpretation
Agreement parameters are expressed on the actual scale of measurement, and not as reliability parameters as a dimensionless value between 0 and 1. This is an important advantage for clinical interpretation. If weights are measured in kilograms, the dimension of the SEM is kilograms. For example, if we know that a weighing scale has a SEM of 300 g, we know that we can use it to monitor adult body weight because changes of less than 1 kilogram are not important. The smallest detectable change
Conclusion
In this article we have shown the important difference between the parameters of reliability and agreement and their relationship. Agreement parameters will be more stable over different population samples than reliability parameters, as we observed in our shoulder example, in which the SEM was quite similar for the affected and the nonaffected shoulder. Reliability parameters are highly dependent on the variation in the population sample, and are only generalizable to samples with a similar
References (14)
- et al.
Measuring change over time: assessing the usefulness of evaluative instruments
J Chronic Dis
(1987) - et al.
Defining clinically meaningful change in health-related quality of life
J Clin Epidemiol
(2003) - et al.
Health Measurement Scales. A practical guide to their development and use
(2003) - et al.
Measuring health. A guide to rating scales and questionnaires
(1996) - et al.
Inter-observer reproducibility of range of motion in patients with shoulder pain using a digital inclinometer
BMC Musculoskel Disord
(2004) - et al.
Statistical methods for assessing agreement between two methods of clinical measurements
Lancet
(1986) - et al.
Psychometric theory
(1994)