OBJECTIVES Routine data collection is now considered mandatory. Therefore, staff rated clinical scales that consist of multiple items should have the minimum number of items necessary for rigorous measurement. This study explores the possibility of developing a short form Barthel index, suitable for use in clinical trials, epidemiological studies, and audit, that satisfies criteria for rigorous measurement and is psychometrically equivalent to the 10 item instrument.
METHODS Data were analysed from 844 consecutive admissions to a neurological rehabilitation unit in London. Random half samples were generated. Short forms were developed in one sample (n=419), by selecting items with the best measurement properties, and tested in the other (n=418). For each of the 10 items of the BI, item total correlations and effect sizes were computed and rank ordered. The best items were defined as those with the lowest cross product of these rank orderings. The acceptability, reliability, validity, and responsiveness of three short form BIs (five, four, and three item) were determined and compared with the 10 item BI. Agreement between scores generated by short forms and 10 item BI was determined using intraclass correlation coefficients and the method of Bland and Altman.
RESULTS The five best items in this sample were transfers, bathing, toilet use, stairs, and mobility. Of the three short forms examined, the five item BI had the best measurement properties and was psychometrically equivalent to the 10 item BI. Agreement between scores generated by the two measures for individual patients was excellent (ICC=0.90) but not identical (limits of agreement=1.84±3.84).
CONCLUSIONS The five item short form BI may be a suitable outcome measure for group comparison studies in comparable samples. Further evaluations are needed. Results demonstrate a fundamental difference between assessment and measurement and the importance of incorporating psychometric methods in the development and evaluation of health measures.
- five item Barthel Index
- psychometric methods
- health measurement
- item reduction
Statistics from Altmetric.com
Routine data collection for audit purposes is now considered mandatory. In addition, effectiveness studies and large multicentre trials are dependent on outcomes data collection being integrated into daily clinical practice. This formidable task is compounded by a requirement to supplement traditional health indicators that can be collected easily (for example, mortality rates and duration of stay), with measures of patient oriented outcomes that are commonly multi-item scales rated by clinicians (for example, disability levels). These measures, which generate total scores by combining the scores of many items, must be simple, easy to use, and rigorous (reliable, valid, responsive) if they are to be administered routinely, and used to influence patient welfare and guide the expenditure of public funds. Therefore, they should have the minimum number of items necessary for rigorous measurement.
In the development of multi-item measures, the balance between item number and scientific rigor can be achieved using psychometric methods. Briefly, a large pool of items is generated to ensure that all important variables are considered for inclusion in the final instrument,1 and then reduced to its quintessential number on the basis of item performance in empirical field tests.2 Although psychometric methods have been used extensively in the social sciences,3 they have been slow to transfer to medicine. Consequently, many widely used health measures—for example, the Barthel index (BI),4 which is a 10 item measure of physical dependence in personal activities of daily living (PADL)—were developed by choosing items on the basis of their clinical relevance. Whereas this clinical approach to scale development is intuitively sound, it assumes that the items chosen have adequate measurement properties and that all these items are required to measure a construct rigorously.
The fact that the BI was developed clinically raises the question of whether its number of items can be reduced using psychometric methods. Although it has relatively few items, takes only a few minutes to score, and is already recommended for use in elderly populations,5 rehabilitation,6 and patients with stroke,7 there is evidence that a short form BI might be a valuable measure. The 1998 and 1999 Royal College of Physicians National Sentinel Audits of Stroke (n=6894 and 5823) are only able to report BI scores for 59% and 61% of survivors respectively.8 9 Therefore, the objective of this study was to explore the possibility of developing a short form BI that is psychometrically equivalent to the 10 item measure.
PARTICIPANTS AND DATA COLLECTION
All admissions to the neurorehabilitation unit of the National Hospital for Neurology and Neurosurgery in London were studied between May 1993 and March 1999. Data routinely collected were diagnostic and demographic information; admission and discharge disability level measured by the BI (the version of Collin et al 10; recommended by McDowell and Newell11) and functional independence measure (FIM12) rated by staff from observation. Also, as part of an ethically approved multicentre study conducted between 1994 and 1996, the London handicap scale (LHS13) and medical outcomes study 36 item short form health survey (SF-3614) were administered to predetermined participants (first two admissions each week).
The database was randomly divided into two samples. In one sample, short forms were developed by performing an item analysis and selecting those items with the best measurement properties. In the other sample, five measurement properties of the short forms were examined and compared with the 10 item BI. We hypothesised that to improve clinical usefulness significantly while maintaining scientific soundness a short form BI should have a minimum of three and a maximum of five items. Therefore, three short forms (five, four, and three items) were developed and tested.
Development of short forms
The goal was to develop a short form BI that maximised concurrent validity (correlation with the 10 item BI) and responsiveness (ability to detect change in disability). Therefore, items were evaluated on the basis of corrected item total correlations computed from admission scores, and effect sizes computed from change scores (discharge minus admission). The best items were then selected.
Corrected item total correlations are correlations between each item and the sum of the remaining items in the scale. For example, the corrected item total correlation for the transfer item is the correlation between this item and the total score generated by summing the item scores of the other nine Barthel items. Correcting the total score by removing the item of interest prevents spuriously high values due to item overlap. Product-moment correlations were computed for items with polychotomous (three or more) response options, and its equivalent, point biserial correlations, were computed for items with dichotomous (two) response options.15 Corrected item total correlations indicate the extent to which each item relates to the construct measured by the total score. Consequently, higher values indicate better items.2
Effect sizes are standardised change scores.16 There are many types of effect size calculation.17 Here they are calculated as the mean change score divided by the SD of admission scores.18 Effect sizes indicate the extent to which each item changes due to rehabilitation. Therefore, higher values indicate better items.
An index of overall item superiority was determined by rank ordering item total correlations and effect sizes (1=best), and then computing the cross product of these rank orderings. Lower values indicate better items. Short forms with five, four, and three items were generated by selecting the best five, four, and three items respectively.
Psychometric evaluation of short forms
Standard methods were used to examine five psychometric properties: acceptability, reliability, validity, responsiveness, and agreement between scores generated by short form and 10 item BIs.2 19-23 To aid comparison of different versions of the BI which have different numbers of items and therefore different score ranges, the scores for all scales were transformed to have a range of 0–20. This was achieved using the following formula24:
Transformed score = 20 × (observed score − minimum score) (maximum possible score − minimum possible score)
Acceptability is the extent to which the range of health measured by a scale matches the distribution of health in the study sample. It is determined by examining score distributions.20 Ideally, the observed scores from a sample should span the entire range of the scale, the mean score should be near the scale midpoint, and floor and ceiling effects (% of the sample having the minimum and maximum score respectively) should be small. McHorney and Tarlov recommend that floor and ceiling effects should be<15%.25
Reliability is defined as the extent to which random (measurement) error is associated with a measurement instrument (high reliability=low error).2 20 Reliability is a generic term. Multiple types of reliability (and therefore many reliability coefficients) exist for each instrument, each addresses a different source (or sources) of random error.26 Although clinicians are most familiar with interrater and intrarater reproducibility, internal consistency is considered a superior indicator of reliability for multi-item measures.24 Some of the reasons for this are discussed later. Internal consistency reliability is calculated from the intercorrelations among the items using Cronbach's α coefficients.27 Confidence intervals for α coefficients can be calculated using the formula suggested by Nunnally and Bernstein.28 It is recommended that reliability estimates should exceed 0.80 for group comparison studies, and 0.95 for individual patient clinical decision making.2 Confidence intervals for individual patient scores can be computed from reliability estimates by calculating the standard error of measurement (SEM).2 The SEM is an estimate of the dispersion of scores that would be obtained if a measure was administered to a given individual multiple times.15 The following formulae are used :
SEM=standard deviation of sample scores×√(1−reliability)
95% confidence intervals for individual patient scores=±1.96×SEM
Validity is the extent to which a rating scale measures what it purports to measure.23 In this study, the aim was to determine the extent to which the validity of short forms and original BIs were similar. Three methods were used. Firstly, the extent to which each short form BI predicted the original 10 item BI (concurrent validity) was determined by examining their intercorrelations. Secondly, the extent to which different forms of the BI related to measures of similar and dissimilar constructs (convergent and discriminant validity28) was determined by comparing the magnitude and pattern of their correlations with four other health measures (FIM, LHS, SF-36 PCS, and SF-36 MCS) and two demographic variables (age and sex). Furthermore, we examined the extent to which these correlations conformed with a priori predictions. We expected BIs to correlate highly (r>0.80) with other measures of dependency (FIM), low to moderately (r=0.10 to 0.50) with measures of handicap (LHS) and health status (SF-36), and be uncorrelated (r<0.10) with age and sex. Thirdly, the extent to which short forms and the 10 item BI are interchangeable was determined by examining the agreement between the admission scores they generated using a random effects model intraclass correlation coefficient (ICC19 29) and the method proposed by Bland and Altman.22 Responsiveness is the ability of an instrument to detect change in the construct being measured.30 This was determined by calculating effect sizes from admission and discharge total scores.18 20 23Larger values indicate greater responsiveness. Effect sizes for the different forms of the BI were compared.
A total of 844 patients were admitted to the rehabilitation unit between 1993 and 1999. Barthel index scores could not be computed for seven patients (0.8%) due to missing data. The characteristics of those people from whom the short forms were developed and those in whom short forms were evaluated were similar (table 1).
ITEM ANALYSIS AND DEVELOPMENT OF SHORT FORMS
Corrected item-total correlations ranged from 0.83 (toilet use and transfers) to 0.34 (bowels), and effect sizes ranged from 0.68 (bathing) to 0.17 (bowels). The five best items were transfers, bathing, toilet use, stairs, and mobility (table 2).
PSYCHOMETRIC EVALUATION OF SHORT FORMS
All short forms showed good variability as scores spanned the full scale range. Mean scores were situated near the scale midpoint and floor and ceiling effects were small (table 3). Only the three item short form failed to satisfy all acceptability criteria as its ceiling effect exceeded the suggested maximum of 15%.
All α coefficients exceeded the suggested minimum criterion of 0.80, but lower limit confidence intervals for the four and three item short forms fell below this standard (table 3). Confidence intervals around individual patient scores were wide and inversely related to the number of items.
Short forms correlated highly (range 0.93 to 0.96; table 4) with the 10 item BI indicating they were equivalent measures of the same construct. The direction, magnitude, and pattern of correlations with other measures and variables was consistent with predictions and near identical across the four instruments indicating that they had equivalent convergent and discriminant validity. Intraclass correlation coefficients between the 10 item BI and all short forms were high (range 0.89 to 0.92) and exceeded the standard of 0.75 for “excellent” agreement.21 However, the limits of agreement indicated that scores for individual patients were not identical and inversely related to the number of items (table3).
Effect sizes for the 10, five, and four item versions of the BI were similar indicating equivalent responsiveness (table 4). The effect size for the three item BI was a little smaller.
The goal of this study was to develop a short form BI that satisfies criteria for rigorous measurement and is psychometrically equivalent to the 10 item instrument. Of the three short forms developed, the five item BI (table 5) best meets this goal. Reducing the number of items from 10 to five could decrease the time taken to administer the measure and enter data, and lessen the potential for incomplete data collection. Further studies are required to consider these empirical questions. More importantly, selecting items on their performance has resulted in no significant loss of acceptability, reliability, validity, or responsiveness.
Results from this study suggest that the five item BI could replace the original measure in clinical trials, epidemiological studies, and audit. However, can the two instruments be used interchangeably? As different measurement methods are not expected to generate identical results, the essential question is whether the difference between scores is large enough to affect clinical interpretation.22 The ICC is very high indicating “excellent” agreement between scores.21 Nevertheless, the sample mean scores differ, signifying a small relative bias.22 This is a predictable finding (we have selected items with high item total correlations and, therefore, more symmetric item response distributions31) and if consistent across samples can be adjusted for by adding 1.84 to mean scores generated by the five item measure. The limits of agreement between scores for individual patients may at first sight seem large (±3.84). However, they are smaller than others have reported for the test-retest reproducibility of the 10 item BI, which is widely accepted to be adequate (±4.232). More importantly, health measures such as the BI are recommended for group comparison studies and not individual patient clinical decision making. This is because confidence intervals around individual scores, as demonstrated here, are too wide to be able to make reliable and valid judgements at the level of the individual patient.25
Results from this study underline a fundamental difference between assessment and measurement. When assessing a health construct—for example, a person's dependence in personal activities of daily living—clinicians needs to gather as much relevant information as possible. By contrast, measurement requires that this construct be quantified rigorously. There is no doubt that the 10 item BI provides a more comprehensive assessment of physical dependence in personal activities of daily living than the five item short form. Therefore, the 10 item BI is a superior assessment tool. However, this study demonstrates that the two instruments generate equivalent quantitative estimates (measurements) of this construct in this sample. Consequently, the two instruments are equivalent measures. This finding shows that the entire range of clinically relevant items is not required to measure a construct rigorously. In fact, surprisingly few items are needed provided they are chosen on the basis of their empirical performance as measures. Interestingly, previous investigators have generally added clinically chosen items to the BI thinking that its content was too limited and that longer instruments would be superior measures (see McDowell and Newell11 for review of 12, 14, 15, 16, and 17 item BIs).
This difference between assessment and measurement emphasises the importance of a psychometric approach to scale development. That is, a large pool of items should be generated and reduced to form rating scales on the basis of their performance in empirical field tests,3 and not of the sole basis of clinical criteria. However, several methods exist for reducing an item pool to its quintessential number. Our criteria were chosen specifically to develop a measure that predicted the 10 item BI and maximised responsiveness. Other studies have selected items on the basis of linear regression,33 factor analysis,34 interitem correlations,35 equidiscriminatory item-total correlations,36 Rasch item analysis,37 item response theory modelling,38 and patient ratings of item importance and frequency.35 Although some of these methods have been compared,35 36 the impact of different item reduction techniques on the development of multi-item measures has yet to be adequately determined.
One previous study developed a short form version of the BI by selecting the items that best predicted function 6 months after stroke.39 The measurement properties of this four item BI (feeding, grooming, bladder, bowels) have never been reported. In our sample they are limited. The ceiling effect (27.5%) and reliability (α=0.60) fail to satisfy recommended criteria. The correlation (r=0.82) and agreement (ICC=0.69; limits of agreement ±6.24) with the 10 item BI, and responsiveness (effect size=0.52) are notably less than for all of our short forms. Therefore, these four items reflect the 10 item BI to a limited extent and do not constitute a reliable and valid measure of physical independence in personal activities of daily living.
Our study has two limitations. Firstly, test-retest and interrater reproducibility were not examined. Although these data are important, high levels of agreement between the five and 10 item BI indicate good reliability for both measures.2 In addition, previous studies have consistently demonstrated high test-retest and interrater reproducibility for BI items suggesting that the five item short form total score will be reliable.11 Moreover, internal consistency is recognised to be the most important type of reliability for multi-item measures because α coefficients are conservative estimates and the test-retest method generates spuriously high values due to memory effect.2
The second limitation of this study is its generalisability. We have only studied a sample of people with neurological disability undergoing inpatient rehabilitation. Although subgroup analyses show that results are generalisable to stroke (n=125) and multiple sclerosis (n=407), work is needed to determine the applicability to other samples and to define whether these five items are consistently the most superior. It is also important to note that we have merely shortened an existing instrument and not examined the extent to which these scales are effective outcome measures. Any inherent limitations of the BI remain—for example, its restricted applicability to people with moderate and severe disability, and its failure to measure directly the cognitive and communication impact of disease.
A psychometrically equivalent five item short form BI has been developed. Future studies are now required to determine the generalisability of these results and to establish the limitations and understand fully the trade off of this instrument. Results highlight a fundamental difference between assessment and measurement and the value of a psychometric approach to health measurement.
We thank all the people who participated in this study, the staff of the neurorehabilitation unit who routinely collect data, and Dr Barney Reeves (Royal College of Surgeons, London) for an important discussion. Dr Hobart was funded by a Wellcome Training Fellowship in Health Services Research and a grant obtained by AJT from the NHS Central Audit Fund. The multicentre study was funded by a grant from the North Thames Regional Health Authority Research and Development Responsive Funding Programme (JH PI). There are no conflicts of interest.