Background Evaluating the long term benefit of therapy in multiple sclerosis (MS) is challenging. Although randomised controlled trials (RCTs) demonstrate therapeutic benefits on short term outcomes, the relationship between these outcomes and late disability is not established.
Methods In a patient cohort from the pivotal interferon β-1b trial, the value of clinical and MRI measures were analysed, both at baseline and during the RCT, for predicting long term physical and cognitive outcome.
Results Baseline disability correlated with both physical (R2=0.22; p<0.0001) and cognitive (R2=0.12; p<0.0001) outcome after 16 years. Accrual of disability during the RCT (R2=0.12; p<0.0001) and annualised relapse rates during the trial correlated with physical outcome (R2=0.12; p<0.0001) but not with cognition. In contrast, baseline MRI measures of atrophy and lesion burden correlated with cognitive (R2=0.21; p<0.0001), but not with physical, outcome. Accumulation of plaque burden measured by MRI did not correlate with late physical disability or with cognitive outcome. Multivariate regression analysis using stepwise elimination demonstrated that baseline variables contributed independently to predicting long term outcomes while trial outcome variables contributed little. Overall, and considerably dependent on baseline measures, the models developed by this method accounted for approximately half of the variance in long term cognitive and disability outcome.
Conclusions Although on-trial change in some short term clinical measures correlated with long term physical and disability outcomes, the proportion of the variance explained by single commonly employed on-study variables was often small or undetectable. Better correlations were observed for several baseline measures, suggesting that long term outcome in MS may be largely determined early in the disease course.
Statistics from Altmetric.com
The efficacy of disease modifying therapies in multiple sclerosis (MS) has generally been evaluated by monitoring selected clinical and paraclinical outcomes in relatively short (1–3 years) randomised controlled clinical trials (RCTs). However, despite the widespread adoption of both MRI and clinical markers for use in clinical trials, the relationship between these short term outcomes and longer term outcomes is unclear. Demonstration of the value of short term measures for predicting long term outcome in MS, in particular disability, would help in projecting longer term impacts of therapy on social, economic and medical costs of the untreated versus treated disease.
Formal validation of surrogate outcomes serving as putative predictors in clinical trials involves more than simply establishing correlations between candidate surrogates and trial outcomes.1–5 Nevertheless, demonstration of a correlation between short term outcomes in an RCT and long term outcomes is a crucial starting point in the development of a surrogate for hard long term outcomes, which have unassailable clinical significance.
The first RCT of interferon β-1b (IFNβ-1b) in MS was begun more than two decades ago.6–8 This trial randomised 372 patients to three different treatment arms (IFNβ-1b 250 μg, IFNβ-1b 50 μg and placebo). Unequivocal treatment benefit for the higher dose arm was seen at 2 and 3 years for several short term clinical outcomes, including relapse rate, relapse free interval, time to first relapse and categorical change (ie, change of ≥1 point at the end of the study) on the Kurtzke Expanded Disability Status Scale (EDSS).9 Patients in the higher of the two dose arms also demonstrated benefits on MRI measures of T2 disease burden and new active T2 lesions.6–8 Nevertheless, because few patients reached hard disability outcomes such as EDSS ≥6 by trial completion, the study could not suitably address questions about the effects on unremitting long term disability in MS. Patients participating in the original RCT have been followed since its conclusion, and after more than 16 years from RCT onset and almost 6000 patient-years of follow-up, we can now address important questions about the relationships between the short term clinical and MRI measures used in the RCT and long term disability outcome.
Here we assess the predictive validity10 of several clinical and MRI measures from the pivotal IFNβ-1b RCT for change in physical and cognitive outcomes at the 16 year follow-up. The effect of therapy on long term outcomes will be reviewed elsewhere.11
The design and methods of the original RCT and the 16 year follow-up study have been described in detail previously.6–8 12 13 These other papers were descriptive in nature and did not explore the predictive validity of the different clinical trial outcomes. Briefly, patients participating in the original IFNβ-1b pivotal trial were re-contacted in 2005 (approximately 12 years after completion of the pivotal trial) and asked to participate in the follow-up study.12 Of the 373 patients in the original phase III study, 328 (88.2%) were identified. Among these, 293 were still alive and 260 (70%) consented to detailed assessment of their interim disease course (medical record review and personal interview) and current physical, imaging and cognitive assessments. Physical disability was measured both by the EDSS and the MS Severity Score (MSSS) at the start of the RCT and by EDSS during the RCT and also during the long term follow-up (LTF) period. Cognitive outcome was assessed at LTF by a battery of five neuropsychological tests, consisting of the Paced Auditory Serial Addition Task (PASAT), the Symbol Digit Modality Task (SDMT), the California Verbal Learning Test II (CVLT-II), the Controlled Oral Word Association Task (COWAT) and the Delis–Kaplan Executive Function System (D-KEFS) test. In addition, the Wechsler Test of Adult Reading was used to estimate the premorbid IQ.14
Patients who agreed to participate were assessed over the course of 1–3 clinic visits. When unable to participate in person, patients were offered a home visit by study investigators for their assessment. A comparison of baseline RCT data between those patients who did and those who did not participate in the LTF study showed that the two groups were very similar for all baseline measures for on-trial behaviour (table 1). As a result, our sample is likely representative of the entire RCT population. Ethics approval for the follow-up study was obtained from the institutional review boards or independent ethics committees of the participating centres. All subjects gave written informed consent.
Of relevance to any long term follow-up study of this type, it is important to recognise that ability to contact patients (at least in the USA) is severely restricted due to the Health Insurance Portability and Accountability Act (HIPAA) regulations. Thus the HIPAA regulations do not permit any patient contact after a failure (by the patient) to respond to two written letters requesting their participation. We cannot simply call them, even when we know their address and phone number.
Several clinical and MRI variables were determined, both at the start and over the course of the pivotal trial.6–8 On-RCT variables were defined as those assessed during the first 2 or 3 years of the study. Because the 2 and 3 year analyses led to essentially identical conclusions, only data from the 2 year analysis are presented because this data set contained a more complete ascertainment of patients enrolled into the RCT. Thus after completion of the original 2 year protocol, the RCT was extended for an additional year at the request of the US Food and Drug Administration but several patients (fulfilling their commitment to take part in a 2 year study) withdrew their consent (see figure 1).
Candidate predictor variables evaluated for their relationship to long term outcome categorically included relapse related, disability, MRI and other variables (table 2). Relapse related variables included pre-trial and on-trial relapse rates. Pre-RCT attacks were determined historically by patient interview and from medical records whereas on-RCT relapses were determined by study investigators every 3 months during the pivotal trial. Disability variables included baseline EDSS, baseline MSSS, a 1 point change in EDSS sustained for 3 months, a categorical change (≥1 point) on the EDSS scale measured from baseline to trial end and the measured EDSS change over the course of the RCT. MRI variables consisted of baseline T2 burden of disease (BOD), defined as the volume (measured in cm2 per slice) of hyperintense lesions seen on T2 weighted images, on-RCT change in the T2 BOD, third ventricular width (measured in mm) and numbers of new and newly enlarging lesions seen on annual T2 weighted images during the RCT. Other variables consisted of duration of disease, age, gender, treatment history, randomisation group in the pivotal trial and the development of neutralising antibodies (NAbs) measured in neutralising units/ml (NU/ml). NAbs were considered present when found at titres ≥20 NU/ml on two or more consecutive occasions during the RCT.
Because of the possibility of a relationship between baseline variables and ultimate outcome, we included all of these variables as candidates for the model. In our analysis, in addition to these baseline variables, we also included the variables of treatment during the pivotal study, total exposure to IFNβ-1b and also changes (during the RCT) in third ventricular width, BOD, T2 activity, EDSS and relapse rate. No variable was forced into the model. In this way, we have addressed all baseline as well as on-trial predictors of importance over the first 2 years under randomised treatment allocation.
Long term outcome
For analysis of LTF physical outcomes, a dichotomous, composite, ‘negative’ measure was used. A ‘negative’ physical outcome was defined when a patient either converted to secondary progressive (SP) MS or reached an EDSS ≥6.0. These outcomes were chosen because they were clinically important and considered unlikely to remit once sustained. SPMS was defined prospectively as a progressive increase in disability (following a relapsing–remitting course), evolving over ≥12 months and not relapse associated. In addition, SPMS patients had to experience an increase of ≥1 point on the EDSS scale over the previous 2 years (or a 0.5 point increase from an EDSS score of 6.0 or 6.5) with or without superimposed exacerbations. To reach EDSS ≥6.0, this level of disability had to be confirmed by two consecutive evaluations (at least 3 months apart) and sustained for the remaining follow-up period. Secondary analyses explored EDSS and SPMS as individual (not composite) physical outcomes at LTF evaluation.
For analysis of LTF cognitive outcomes, a continuous measure was used. This was the so-called ‘Cognitive Performance Index’ which represented the sum of the patient's z scores on the PASAT, SDMT, CVLT-II, COWAT and D-KEFS tests.
Relationships between candidate predictors and hard physical outcomes at LTF (eg, EDSS ≥6 or SPMS, as well as the composite ‘negative’ physical outcome of both outcomes) were explored using logistic regression modelling. We also looked at models including death as a ‘negative’ outcome. However, because this did not change any of the results and because we only had permission to review the records for seven of the 35 deceased patients,12 we felt it was preferable to exclude this outcome from our composite measure. Relationships with the continuous ‘Cognitive Performance Index’ were explored with linear regression models. Two methods of analysis were undertaken. In the first, we explored ‘univariate’ relationships, in which regression analyses were run with each candidate predictor considered individually. In the second, we developed multivariate regression models using stepwise elimination procedures for model selection, in which all candidate variables were allowed to enter if their coefficient had a significance of p<0.5. A candidate predictor was eliminated from the model as soon as it failed to contribute to the overall R2 for the model at a significance level of p<0.10, with the p values derived from t tests (linear regression) or Wald χ2 tests (logistic regression).
Clinical and MRI disability outcomes during the RCT and at LTF
The mean EDSS score for the entire group, at baseline, was 2.89. By the end of 2 years the mean EDSS had increased by 0.05 points (SD 1.33) and by the LTF it had increased by 2.28 points (SD 2.04). Thus the mean EDSS at the LTF was 5.17 (SD 2.43). The EDSS was available in all 260 patients. At the 2 year point, 21.2% of patients (55/260) had sustained a 1 point confirmed EDSS change from their baseline. At the LTF, 43.5% (113/260) of patients had reached an EDSS of 6.0, 40.0% (104/260) had reached SPMS and 53.8% (140/260) had reached either negative outcome. Cognitive assessment at the LTF was completed in 58.5% (152/260) of the patients and the Cognitive Performance Index had a mean summed z score of –4.52 (SD 4.22). At baseline, the third ventricular width was 4.86 mm (SD 2.28), and by year 2 this had increased by 0.644 mm (SD 0.972). BOD at baseline was 1.96 cm2 (SD 2.02) and by 2 years this had increased by 0.13 cm2 (SD 0.61).
Univariate and multivariate analyses
The exploratory univariate analyses for the relationship of candidate predictors with respect to physical and cognitive function 16 years later are shown in table 2. In these univariate explorations (table 2), several baseline and on-RCT variables (but not others) were significantly correlated with long term disability outcome (either physical or cognitive). Baseline disability correlated significantly with both physical (R2=0.22; p<0.0001) and cognitive (R2=0.12; p<0.0001) outcome after 16 years. Accrual of disability during the RCT (R2=0 0.11; p<0.0001) and annualised relapse rates during the trial (R2=0.12; p<0.0001) correlated significantly with physical outcome but not with cognition. In contrast, baseline measures of third ventricular width (R2=0.21; p<0.0001), MRI lesion burden (R2=0.21; p<0.0001) and premorbid IQ (R2=0.14; p<0.0001) were correlated with cognitive, but not with physical, outcome. Notably, with the exception of the measure of third ventricular width, a change in MRI over the course of the trial did not correlate with late disability—either cognitive or physical. The actual change in EDSS over the course of the trial was a superior predictor of physical outcome compared with more commonly used measures such as sustained or categorical 1 point EDSS change. Moreover, neither the sustained nor the categorical 1 point EDSS change remained in the multivariate model. These disability measures, however, were all poor predictors of cognitive outcome in the univariate analysis (table 2). Finally, the occurrence of NAbs during the RCT had no relationship to outcome (table 2).
In the principal multivariate analysis, the contribution of each potential predictor variable was tested using a stepwise elimination procedure to estimate a final model for predicting both physical and cognitive outcome (table 3). The most significant predictor of both physical and cognitive outcome after 16 years was baseline EDSS (table 3). Similarly, in the final regression model, the change in EDSS score over the first 2 years of the RCT was an independent predictor of cognitive and (especially) physical outcome. In contrast, MRI measures such as T2 BOD (at baseline) and third ventricular width (both at baseline and change during the RCT) contributed largely (or only) to cognitive outcome. Annualised relapse rates during the RCT contributed only to predicting physical outcome (table 3).
In both multivariate models, explained variance was approximately half of the total variance in long term outcome (table 3) but more so from baseline measures than from on-study surrogates. The amount of the variance explained by any single variable was generally quite small (table 2).
In many chronic disabling diseases it is difficult to establish long term efficacy for any specific therapy and MS is no exception. The protracted observation times necessary for patients to reach hard disability outcomes contrast with the relatively short term formal RCT trial periods that have been successfully executed. RCT designs have not allowed for sufficient time to reach these outcomes. Furthermore, once a drug has been shown to improve patient outcome on measures thought by even a substantial minority to be clinically relevant to the disease process, impediments to continuation arise. Patients may not agree to prolonged exposure to placebo and many clinicians will likely discourage it.15 16 For these reasons, establishing long term efficacy at present rests on the analysis of non-randomised longitudinal data with best available adjustments for the many biases that impact such studies.11
In addition to such alternative analysis strategies, however, it is also imperative to establish that these short term outcome measures correlate with (and predict) long term outcome as an essential first step towards establishing surrogacy. The present study is the first to evaluate the predictive value of short term outcome measures used in MS clinical trials for hard disability endpoints. Much of the predictive power of the final regression models (table 3) came from single baseline measures rather than from on-study changes. Observations of simple correlations between short term measures and long term outcomes fall well short of proving that these measures are true surrogates for long term efficacy.1 Although the short term measures we explored (clinical attacks, disability and MRI lesions) are generally believed to be reflections of the pathological processes (ie, episodic inflammation, demyelination and axonal injury) which underlie permanent disability in MS,17 none (individually) was strongly associated with disability or cognitive outcome, and some widely used and previously influential trial outcomes were completely disassociated.
A therapy might either alter disease course without affecting all of these processes or alter them without affecting outcome. For example, a neuroprotective agent could limit axonal injury or promote oligodendrocyte survival without affecting inflammation. Similarly, an immune suppressant might impact destructive inflammation to a lesser extent than reparative inflammation and thus lead to less (but more disabling) inflammation. Finally, it is also possible that because of either functional redundancy or plasticity within the CNS, the correlation between a particular short term measure and long term outcome may be weak or non-existent.18 Nevertheless, even though true surrogate markers have not been established, any therapy that successfully interrupts one or another of these basic pathogenic mechanisms, which are correlated with long term physical and cognitive outcome in MS, has potential for limiting long term disability. The findings should stimulate efforts to find better surrogate markers.
This study was necessarily restricted to those who agreed to current physical and cognitive assessments and review of their interim disease course (n=260; 70% of the 328 patients identified from the original cohort). The baseline characteristics of those who participated and those who did not are detailed elsewhere.11 In brief, no significant differences were observed between these groups for any clinical and outcome related features. Notably, the percentage of women (69% vs 71%), age of onset (27.3 vs 27.7 years), duration of disease (8.0 vs 8.1 years), entry EDSS (2.9 vs 2.9) and mean on-trial change in EDSS (0.0 vs 0.3) were nearly identical.
Outcomes focused on physical signs arising from CNS inflammation (ie, clinically evident relapses and less clearly disability progression on the EDSS scale) seem to be better predictors of physical rather than cognitive outcome. In contrast, outcomes thought to better measure clinically silent pathology within the CNS (ie, T2 lesions, BOD and atrophy) were better predictors of cognitive than physical outcome but were weak nevertheless. Such disassociation has been suggested previously, based on the belief that spinal cord pathology has a greater impact on physical function whereas intracerebral pathology is more likely to impact cognitive function.19 This suggests that more attention should be paid in future studies to this differential impact.
Although on-RCT behaviour for some outcomes correlated with long term outcome (tables 1 and 2), the strongest associations were actually with simple and single baseline functions as measured by the EDSS (for physical outcome) or the BOD and third ventricular width (for cognitive outcome). This result is consistent with several other reports in the literature.20–23 Thus these baseline measures effectively provide a type of integrated assessment of disease activity that had occurred up to the point of evaluation. In contrast, on-RCT measures provide an estimate of disease activity over a shorter time frame.
Baseline EDSS was significantly related to development of both physical disability and cognitive decline and based on the R2 for univariate explorations, and better predicted these outcomes than did MSSS (table 2). Similarly, in the multiple logistic regression model for outcome derived with stepwise selection, the measure of MS severity consistently selected for inclusion in the model was baseline EDSS, not MSSS (table 3). This observation is also consistent with our recursive partitioning analysis.10 Indeed, EDSS may be a much better measure of MS severity than its detractors might otherwise suggest.24 25
Notably, the actual change in EDSS over the 2 year course of the trial data analysis was a better predictor of long term outcome than either the sustained 1 point EDSS change or the categorical 1 point change at trial end. This observation, combined with the fact that the 1 point change definition of treatment failure was found no more likely to occur than improvement to the same degree in placebo arms, strongly suggests that the current practice of using sustained EDSS change with only 3–6 months as the time of confirmation of worsening as a primary disability outcome should be revised.26
Neither MRI T2 burden change nor accumulation of new MRI lesions during relapses was predictive of disability or cognitive change. On the clinical side, on-study attack rates were not predicted by rates prior to entry, confirming the findings of Young and colleagues.27 This raises considerable doubt about the validity of using pre-trial relapse frequency as a criterion for trial entry. In addition, the observation that the pre-study attack rate did not contribute to predicting physical outcome (tables 1 and 2) is consistent with the reported lack of relationship between number of attacks during the relapsing–remitting phase and time to cane, bedridden status or death from MS.28
Many adjustments are necessary to control for biases that can contaminate non-randomised observational studies.11 No relationship between therapy during the RCT and either physical or cognitive outcome emerged in our regression analysis (table 2). Nevertheless, these observations are not definitive because the difference between the three arms (in intent to treat terms) consists of the few years during which therapeutic exposure differed between groups. Post-trial, all participants were offered and encouraged to take the active treatment.
Unlike our longitudinal data for physical disability, our cognitive data are cross sectional. Consequently, these data could not be analysed using the same bias mitigating statistical methods that we applied to the physical disability data.11 Therefore, any data regarding the effect of treatment on cognition will remain contaminated by the natural tendency for patients who are doing well to stay on therapy and for those who are doing poorly to switch or to stop therapy—a source of bias that can adversely affect any non-randomised study.11
In summary, this is the first study to assess the predictive validity of a variety of short term outcome measures for very long term physical and cognitive outcomes of patients with MS. These included the key outcomes used ubiquitously in trials for determination of efficacy. We found that baseline measurement of disability and MRI, and the on-RCT measures of clinical attacks, disability change from entry to exit and atrophy, modestly but independently correlated with physical or cognitive long term outcomes after 16 years. Although nearly half of the long term physical and cognitive outcome was predictable, much of this came from single and simple baseline measures (table 3). In general, the amount of the variance explained by any single variable was quite small. Importantly, the previously influential, expensive and widely used on-trial change in MRI plaque burden (as measured by T2 lesion volume) did not correlate independently with either physical or cognitive outcome.
The authors acknowledge the assistance of Ray Ashton and Maria Bell from PAREXEL MMS (Worthing, UK) with manuscript preparation.
Funding The study was sponsored by Bayer HealthCare Pharmaceuticals. PAREXEL MMS received payment from Bayer HealthCare Pharmaceuticals for editorial support.
Competing interests DSG, GE, AR, DLi, DL, AK, CW, VK and KB have support from Bayer HealthCare for the submitted work, AT has a specified relationship with Bayer HealthCare and might have an interest in submitted work from the previous 3 years. Both the sponsors and the independent investigators were intimately involved in the study design. The researchers DSG, GE, AR, DLi, AT and DL were independent of the funders. The researchers AK, CW, VK and KB either work or previously worked for the funders.
Ethics approval The study obtained ethics approval from the institutional review boards or independent ethics committees of the participating centres before long term follow-up planning, which began in 2004.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement Statistical and data tables are available from the corresponding author ( ). Participants gave written informed consent for data acquisition and data sharing.