Objective To examine dimensionality, reliability and validity of the Amyotrophic Lateral Sclerosis Functional Rating Scale-revised (ALSFRS-R) using traditional classical test theory methods and Rasch analysis in order to provide a rationale for possible improvement of its metric quality.
Methods Methodological research on ALSFRS-R collected in a consecutive sample of 485 patients with amyotrophic lateral sclerosis (ALS) attending three tertiary ALS centres.
Results The ALSFRS-R items showed good internal consistency but dimensionality analysis argues against the use of ALSFRS-R as a single score because the scale lacks unidimensionality. Parallel analysis and exploratory factor analysis revealed three factors representing the following domains: (1) bulbar function; (2) fine and gross motor function; and (3) respiratory function. Rasch analysis showed that all items in each domain fitted the respective constructs to measure, except for item No 9 ‘climbing stairs’ and item No 12 ‘respiratory insufficiency’. Rating categories did not comply with the criteria for category functioning. Collapsing the scale's 5 level ratings into 3 levels improved its metric quality.
Conclusions The ALSFRS-R fails to satisfy rigorous measurement standards and should be, at least in part, revised. At present, ALSFRS-R should be considered as a profile of mean scores from three different domains (bulbar, motor and respiratory functions) more than a global total score. Further studies on ALSFRS-R using modern psychometric methods are warranted to confirm our findings and refine the metric quality of this scale, through a step by step process.
Statistics from Altmetric.com
Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disorder of unknown cause, characterised by progressive impairment of motor function due to degeneration of upper and lower motor neurons. At present, the only approved therapy for ALS is riluzole.1 ,2
The negative results of many recent clinical trials in ALS have raised concern about their design. Poor knowledge of the pathogenesis, unsatisfactory animal models, weak trial designs and biases in patient selection have been claimed as possible causes.3 ,4 Among the criteria for a proper trial design, a crucial issue is the choice of surrogate markers as outcome measures.
The Amyotrophic Lateral Sclerosis Functional Rating Scale (ALSFRS)5 and its revised form (ALSFRS-R)6 are the most widely used surrogate markers of disease progression of ALS in clinical practice and research. The ALSFRS-R showed a strong correlation with disease progression and survival,7 ,8 and thus has been used as a primary or secondary outcome measure of efficacy in several therapeutic trials.1 ,2
To date, both ALSFRS and ALSFRS-R have been analysed using only classical test theory (CTT) procedures, and both have demonstrated good internal consistency, reproducibility and criterion related validity.6 ,9 ,10 The CTT approach, as is known, does not take into account some standard criteria and attributes—concerning both single items and the total score—that must be considered when evaluating the fundamental properties of a measurement tool (eg, to place confidence in the total score unidimensionality is required, otherwise outcomes cannot be unambiguously interpreted).11 Rasch analysis (RA) is being increasingly recommended in the development and evaluation of clinical tools for healthcare to verify if they comply with the theoretical requirements of measurement, including dimensionality analysis and item level scale evaluation.12
The aim of our study was to test the internal validity of the ALSFRS-R (mainly in terms of dimensionality, rating scale functioning, and item technical quality) using both CTT and RA methods.
A sample of 485 subjects with a diagnosis of probable or definite ALS according to the El Escorial revised criteria,13 consecutively attending three tertiary ALS centres, was evaluated with the ALSFRS-R. The study was approved by the local ethics committees.
ALSFRS-R is a simple, easy to administer, disease specific scale consisting of 12 items assessing bulbar, arm, leg and respiratory function. Its score is usually based on a consensus between the patient (or caregiver, if the patient cannot communicate effectively) and clinician.14 The answer to each item is rated according to 5 categories, from 0 (complete dependence) to 4 (normal function), resulting in a total score ranging from 0 to 48. To ensure reliable data acquisition, all evaluators underwent extensive training.
We combined CTT and RA approaches to investigate the following psychometric properties of ALSFRS-R.
Internal consistency and dimensionality
The internal consistency of ALSFRS-R was assessed by means of Cronbach coefficient α and item to total correlation. Given the unclear factorial structure of responses to the ALSFRS-R, an estimate of the number of relevant factors was obtained with parallel analysis (PA).15 Subsequently, an exploratory factor analysis (EFA) for ordinal data16 with orthogonal (Varimax) and oblique (Promax) rotations on a randomly split half of the dataset (n=242) was used to study the contribution of each item to the factors identified by PA. A confirmatory factor analysis (CFA) on the second half of the dataset was used to verify the fit between the data and the model. The following goodness of fit indexes were taken into account: Tucker–Lewis Fit Index (TLI), Comparative Fit Index (CFI), root mean square error of approximation (RMSEA) and the standardised root mean square residual (SRMR). For acceptable fit, TLI and CFI should be >0.95, RMSEA <0.80 and SRMR <0.10.17 As multidimensionality was confirmed, we used the underlying factors and their relation to each item to break the scale down into subscales, the clinical meaningfulness of which was judged by expert opinion. Then, each subscale underwent RA.
An introduction to RA and related concepts can be found in dedicated textbooks.18 Our analysis was performed on the entire dataset (n=485). We started with a diagnostic assessment of the ALSFRS-R rating categories to investigate whether the response levels to each item in the scale were being used effectively and consistently.19 Based on this diagnostic evaluation and following standardised procedures,19 we collapsed some adjacent categories and recoded response levels. After rating scale modifications, we performed a second series of RA on the three subscales suggested by the preliminary dimensionality analysis. The internal construct validity of each subscale was assessed by evaluating the fit of individual items to the latent trait as per the Rasch model. Infit and outfit mean square statistics for each item were calculated, considering values between 0.8 and 1.2 as an indicator of acceptable fit.18
Subscale reliability was evaluated in terms of person separation reliability, an index similar to Cronbach's α estimating how well one can differentiate between different individuals’ performances on the measured variables: for the range 0–1, coefficients >0.70 are taken as evidence of sufficient reliability and coefficients >0.80 are considered good.18
A principal component analysis (PCA) on the standardised residuals was used to investigate the local independence of items and the presence of subdimensions as an assessment of the unidimensionality of the scale. The following criteria were used to confirm unidimensionality: (a) a cut-off of 50% of the variance explained by the trait that the scale intended to measure (the ‘Rasch factor’); and (b) eigenvalue of the first residual factor smaller than 3.20
In addition, we performed a differential item functioning (DIF) analysis on each subscale to examine the stability of item hierarchy across the following subsamples: men versus women; younger age (≤60 years) versus older age (>60 years); and disease duration (≤2 years vs >2 years). DIF was investigated using an item by item t test for difference in mean measures between the two subgroups (two sided, 1% α). Further technical aspects of our statistical analyses can be found elsewhere.21
STATA V.10.1 (StataCorp LP, College Station, Texas, USA) was used to perform PA, Lisrel 8.80 (Scientific Software International Inc, Lincolnwood, Illinois, USA) for CFA and EFA, and WINSTEPS V.3.68.2 (Winsteps.com: Chicago; 2009) for RA.
Demographic and clinical characteristics of patients are shown in table 1. Cronbach's α was 0.88 for ALSFRS-R. Items showed an item to total correlation between 0.45 (item No 2 ‘salivation’) and 0.81 (item No 7 ‘turning in bed’). PA revealed three factors with empirical eigenvalues exceeding those from the random data. These three factors explained 50.4%, 18.1% and 11% of the variance, reaching a cumulative 79.5%. As suggested by PA, we performed an EFA for a three factor solution to investigate the contribution of each item to the scale. The results are presented in table 2; orthogonal and oblique rotations produced very similar results, suggesting the adequacy of the orthogonal solution. These results showed three factors that clearly represent the following domains: (1) bulbar function (item Nos 1–3); (2) fine and gross motor function (item Nos 4–9); and (3) respiratory function (item Nos 10–12). A CFA on this three factor model showed a good fit (TLI, CFI, RMSEA and SRMR were 0.97, 0.98, 0.034 and 0.040, respectively), thus confirming the multidimensionality of ALSFRS-R.
As for RA, rating scale diagnostics showed that response levels of each item (score 0–4) did not comply with the pre-set criteria for category functioning (average measures, thresholds, etc.). Accordingly, the number of levels was revised, adopting a solution able to maximise both statistical performance and clinical meaningfulness. For the first 11 items, the 5 original response levels were reduced to 3, always collapsing level ‘0’ with ‘1’, and level ‘2’ with ‘3’. For item No 12 ‘respiratory insufficiency’, the best solution was obtained collapsing the three central levels, thus obtaining the following three response options: 2=no respiratory insufficiency; 1 = use of BiPAP; 0 = invasive mechanical ventilation. By way of example, a typical graphical presentation of these results is shown in figure 1. In figure 1A, the graph shows that the probability of using response levels ‘1’ and ‘2’ in that item is never modal (ie, higher than that of the other levels). In figure 1B (after combining original level ‘0’ with ‘1’, and ‘2’ with ‘3’), the probability of selecting each of the three revised response levels (score 0–2) is a clear function of performance (patient functional ability minus item difficulty) shown on the x axis. The ‘thresholds’ correspond to the intersections (ie, the probabilistic midpoint) between two adjacent response curves. Whether the responses to the items are consistent with the metric estimate of the underlying construct is indicated by the ordered set of ‘thresholds’ for each item.
Applying this collapsing procedure, the RA showed that all items included in each of the three subscales (bulbar function: item Nos 1–3; motor function: item Nos 4–9; and respiratory function: item Nos 10–12) fitted the respective constructs to measure, except for item No 9 ‘climbing stairs’ (infit Mnsq=1.64; outfit Mnsq=1.53) and item No 12 ‘respiratory insufficiency’ (outfit Mnsq=1.56), which showed an unexpectedly high variability in the observed data compared with the Rasch model prediction.
The other main results of RA for each subscale are shown in figure 2 and table 3; distribution of subject functional ability and item difficulty, reliability indices and results regarding the three PCAs of the standardised residuals (analysing the variance explained by the Rasch factor and the first residual factor). The three subscales demonstrated different levels of sample item matching (the best was for bulbar function, the worst for respiratory function). Subject ability span was more than 10 logits in each subscale, whereas item difficulty span was more limited, ranging from 1.89 logits (bulbar function) to 3.84 logits (respiratory function). Item separation reliability was high in the three subscales, while the person separation reliability was sufficient or good in the bulbar and motor subscales, and borderline (0.69) in the respiratory subscale. No PCA of the standardised residuals presented residual correlations >0.30, thus confirming the local independence of the items in each subscale. DIF analysis showed no difference in responses due to gender, age or disease duration.
In clinical trials, we measure constructs (ie, ‘latent’ variables, such as functioning), perform statistical tests on the scales’ raw scores and draw conclusions. The appropriateness of these conclusions strongly depends on the metric quality of the selected measures, and it has a crucial influence on patient care, drug efficacy and health policies.
Unambiguous interpretation requires that a score represent a single attribute (dimension). Otherwise, one could not be sure if two individuals with the same score are, in fact, comparable. This problem hampers understanding of clinical trial outcomes, which in turn has consequences for selecting interventions for individual patients.22
Our main finding is that the ALSFRS-R presents a series of drawbacks that corrupt its metric quality.
The ALSFRS-R items showed good internal consistency according to CTT, with a Cronbach's α even higher than in the original paper,4 but our dimensionality analysis argues against the validity of summing the ALSFRS-R items into a single score. Our data clearly indicate the presence of three different domains (bulbar, motor and respiratory function): in each domain an aspect of functional status can be independently assessed but domain scores cannot be simply summed to obtain an overall functional status measure in ALS. These three functions are clinically meaningful and correctly represent the underlying structure of the domains investigated by ALSFRS-R.4 ,23 ,24
RA showed that some rating categories of ALSFRS-R did not comply with the set criteria for category functioning. This may be due to rater difficulty in discerning among the five levels of functional ability. As an example, in item No 7 (‘turning in bed and adjusting bed clothes’) the wording of categories ‘0’ (helpless) and ‘1’(can initiate but not turn or adjust sheets alone) does not allow a clearly distinct ranking of functional ability and could introduce error variance rather than metric information into the ratings. The same can be said, in the same item, for the categories ‘2’ (can turn alone or adjust sheets, but with great difficulty) and ‘3’ (somewhat slow and clumsy, but no help needed).
Collapsing the scoring options into a 3 level rating for all items improved the measurement quality of the scale, providing a simpler and more distinct idea of the level of functioning represented by each rating level, without loss of measurement information.18 These findings show that there is space for a refinement of ALSFRS-R by item rewording and/or reduction of option number.
After collapsing the categories, fit statistics showed two misfits: item Nos 9 and 12. The misfit of item 9 (‘climbing stairs’) is in line with the clinical observation that different environmental factors (ie, home architecture) and personal attitudes can produce high variability in this response, unexpected by the Rasch model. The high outfit value of item 12 ‘respiratory insufficiency’ is due to the presence of subjects without dyspnoea and orthopnoea (less difficult items) but on permanent ventilation (most difficult item); although the finding is clinically understandable, it demonstrates an additional serious bias of the scale.
Thus our results suggest a rethinking of item ‘climbing stairs’ and a clarification of the conceptual framework and measurement strategy of the whole subscale ‘respiratory function’, including the provision of detailed guidelines for its compilation. More generally, there is a lack of standardised methodology for ALSFRS-R administration25 and no formal interview instructions.
Concern about the metric properties of ALSFRS-R has already been expressed.1 Our findings confirm what clinicians know: the interpretation of a total raw score of ALSFRS-R is hampered by ambiguities due to its different metric meanings for the different ALS forms. This problem is likely to be complicated by the presence in ALSFRS-R of a typical phenomenon of ordinal summed rating scales; the relationship between total raw scores and linear Rasch transformed measures of global function is not linear but ogival.26 As patients approach the bottom of the scale, each raw score point represents an increasing metric distance, yet it appears that patients are ‘slowing down’ in their worsening because it becomes increasingly difficult for them to lose further raw score points. This finding would imply a reduced sensitivity of the raw scores to a change occurring in high and low functioning ALS subjects,27 ,28 and on the other hand it underlines the complex relationship between progression of disease and modification of ALSFRS-R raw scores.1 ,29
Care should be taken in interpreting our data. First, our non-probability sample might compromise the study's external validity. Nevertheless, our sample was a large cross section of patients with a broad spectrum of disease severity, drawn from three different tertiary ALS clinics. Second, we cannot exclude the fact that some specific (linguistic, cultural or technical) characteristics of the Italian version of the ALSFRS-R could have somewhat influenced our results, although our version was checked using a thorough procedure of ‘forward/backward translation’ followed by pilot testing and expert revision, without any semantic difficulties being found.
Our findings suggest that ALSFRS-R fails to satisfy rigorous measurement standards and should be, at least in part, revised. Valid inferences on the efficacy of treatment trials require high quality outcome measures. At present, we believe that ALSFRS-R should be considered as a profile of mean scores from three different domains (bulbar, motor and respiratory functions) more than a global total score. Further studies on ALSFRS-R using modern psychometric methods are warranted to confirm our findings and refine the metric quality of this scale, through a step by step process.
We thank the patients and their families for their collaboration in this study.
Contributors Study concept and design: FF, GM, AG, PV and AC. Acquisition of the data: GM, PV and AC. Analysis and interpretation of the data: FF, GM, AG, PV and AC. Drafting of the manuscript: FF, GM, AG, PV and AC. Critical revision of the manuscript for important intellectual content: FF, GM, PV and AC. Obtained funding: GM and AC. Administrative, technical and material support: FF, GM, AG, PV and AC. Study supervision: FF, GM, PV and AC. AC had full access to all of the data in the study and takes responsibility for the integrity of the data and the accuracy of the data analysis. All authors have approved the submitted version of the paper.
Funding This work was funded in part by Ministero della Salute (Ricerca Sanitaria Finalizzata, RF-MAU-2007-643050) and Centro Nazionale per la Prevenzione e il Controllo delle Malattie (grant 31, 2009). The research leading to these results has received funding from the European Community's Health Seventh Framework Programme (FP7/2007–2013) (grant agreements Nos 259867 and 278611).
Competing interests FF has received research support from the Italian Ministry of Health (Ricerca Finalizzata). GM has received research support from the Italian Ministry of Health (Ricerca Finalizzata). AC serves on the editorial advisory board of Amyotrophic Lateral Sclerosis and has received research support from the Italian Ministry of Health (Ricerca Finalizzata), Regione Piemonte (Ricerca Finalizzata), University of Torino, Federazione Italiana Giuoco Calcio, Fondazione Vialli e Mauro onlus and European Commission (Health Seventh Framework Programme); he serves on a scientific advisory board for Biogen Idec and Cytokinetics.
Ethics approval The study was approved by the local ethics committees of the three clinical centres involved.
Provenance and peer review Not commissioned; externally peer reviewed.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.