Article Text

## Abstract

**OBJECTIVES** To review the outcome measures commonly used in phase III treatment trials of relapsing-remitting multiple sclerosis and to introduce a method of data analysis which is clinically appropriate for the often reversible disability in this type of multiple sclerosis.

**METHODS** The conventional end point measures for disability change are inadequate and potentially misleading. Those using the disability difference between study entry and completion do not take into account serial data or disease fluctuations. Rigid definitions of “disease progression” based on two measurements of change in disability several months apart, do not assess worsening after the defined “end point”, nor the significant proportion of erroneous “treatment failures” which result from subsequent recovery from relapses that outlast the end point. Assessing attacks merely by counting their frequency ignores the variation in magnitude and duration. These problems can be largely circumvented by integrating the area under a disability-time curve (AUC), a technique which utilises all serial measurements at scheduled visits and during relapses to summarise the total neurological dysfunction experienced by an individual patient on any particular clinical scale during a study period.

**CONCLUSIONS** The “summary measure” statistic AUC incorporates both transient and progressive disability into an overall estimate of the dysfunction that was experienced by a patient during a period of time. It is statistically more powerful and clinically more meaningful than conventional methods of assessing disability changes, particularly for trials which are too short to expect to disclose major treatment effects on irreversible disability in patients with a fluctuating disease.

- multiple sclerosis
- outcome measures
- disability

Multicentre double blind placebo controlled phase III trials of various immunomodulatory agents (interferon β-1b (IFNβ-1b, Betaseron), interferon β-1a (IFNβ-1a, Avonex), copolymer-1 (Cop-1 or glatiramer acetate, Copaxone) and intravenous immunoglobulin (IVIg)) for relapsing-remitting multiple sclerosis in the past few years have all demonstrated a modest reduction in the mean number of relapses but little or no significant effects on disability.1-4 This should hardly be surprising given the very slow average accumulation of fixed neurological deficits arising from incompletely resolved attacks5 and the low incidence of secondary progression in the patients selected for these trials. Much criticism has been directed at the well recognised limitations of the clinical rating scales used in these trials and considerable efforts are being made to improve the methodology of assessing impairment, disability, and handicap.6 However, we think that there is also room for improvement in the methods of analysis of both relapse and disability data. More appropriate statistical methods may play an important part in the interpretation of results, whatever the inadequacies of the rating scales used.

We firstly discuss some of the problems of analysing serial data in relapsing-remitting multiple sclerosis, and secondly suggest a simple alternative statistical method which is both more appropriate to the natural history of this condition and more likely to capture any treatment effect on the time course of this fluctuating disease.

## Disability data

The validity of using rigidly defined end points to measure disease progression in relapsing-remitting multiple sclerosis is fundamentally flawed due to the frequent remissions that occur early in the course of the disease. The table summarises the clinical end points which have been used in each of the recent major phase III trials under discussion. Regular neurological examinations were carried out at either three-monthly (IFNβ-1b and Cop-1) or six-monthly (IFNβ-1a and IVIg) intervals, but the published disability outcomes were derived either from determining the numbers of patients who sustained a deterioration in their expanded disability status score (EDSS)7 for a period of either three or six months, or from the difference in disability scores between baseline and the end of the study. Considerable amounts of the disability data collected were not utilised in these analyses.

In a disease which characteristically remits after exacerbations, a method for determining “progression” which relies on a certain deterioration of clinical ratings to be sustained for three or six months, will not incorporate any subsequent worsening that might occur after this defined period. Furthermore, this type of end point will also result in a significant proportion of erroneous “treatment failures”—that is, some patients who have met the criteria for “progression” will subsequently recover to baseline levels after a period of so called “sustained deterioration”. This phenomenon is common at the lower end of the EDSS where most patients recruited for these trials tend to cluster. Natural history studies on cohorts with early multiple sclerosis have shown that up to 24% of relapses last more than three months.8 9 Similar criticisms can be applied to the Kaplan-Meier survival analysis, which has been used to study time to confirmed progression in disability,2 10 as such a method again inappropriately assumes that the exacerbations concerned are irreversible.

The second commonly used end point, which relies on mean change over a trial period (subtracting final from initial scores), is also problematic, as it ignores any transient disability due to attacks experienced during the study, the reduction of which may be the major beneficial effect of the therapy under investigation. This reliance on the difference between assessments at the start and the end of a trial wastes all the intermediate disability data, and is not statistically or clinically meaningful.11

A popular way of presenting serial data is to plot mean group scores as a time series. This has not been carried out explicitly in the published phase III trials, but has been illustrated in a recent MRI study involving patients on two doses of IFNβ-1a.12 In this paper the mean monthly number and volume of MRI enhancing lesions were plotted versus time and analysed by Student’s *t*test. This type of analysis has several problems.13 The curves joining the means may not be representative of individual disease courses, the data from each patient over time is ignored when every time point has been analysed separately, and the means at successive points are not independent as they are to some extent influenced by the values of preceding data.14

Another technique for evaluating serial data is the analysis of variance (ANOVA), but the assumptions for its use in treatment trials may not always be valid; complete data sets are required (analysis of ad hoc assessments associated with relapses presents difficulties) as there are problems with treating missing values15 and the method has also been considered difficult to understand and interpret.16

## Relapse data

The relapse data from published phase III treatment trials has been subject to many criticisms. Unresolved issues include the methods of assessing exacerbations, which are beyond the scope of this article. It has been suggested that the usefulness of relapses as a trial end point would be enhanced if a meaningful measure of the disability caused by each attack was available.17 In three out of the four trials in question, all relapses required objective confirmation by neurological examination.2-4 In the IFNβ-1b study, although subjectively reported events were accepted as relapses, up to 80% of attacks were verified by examination and graded according to severity.1 However, despite this admirable but time consuming and arduous requirement of the trial investigators, the clinical end points were derived solely from the number of counted relapses (table). A substantial amount of data on the severity and the time course of attacks was not utilised. Furthermore, comparisons of “annual exacerbation rates”, although widely used, are conditional on the assumption that an individual patient’s attack rate during the study is independent of his baseline relapse frequency.2

## Use of “summary measure” as outcome measure

An alternative approach is to use a summary measure which captures changes in disease status over the whole study. The essential components are the sets of serial impairment or disability data, preferably with frequent sampling points for each patient, which are integrated to produce a single numerical summary of a particular dysfunction curve over the trial period. This is done by plotting an individual patient’s scores serially against time. The summary measure is then obtained by calculating the area under the curve (AUC).13 This method has the advantage that many of the statistical problems outlined above are avoided. In particular, in treatment trials of relapsing-remitting multiple sclerosis all scores acquired serially can be used and patients with missing data need not be discarded from the analysis. Different AUCs can be calculated for each clinical scale employed in the study.

In practice, there are two main methods used for calculating the AUC. The first integrates the area contained between the plot of consecutively acquired clinical rating scores during the trial and a baseline defined by the zero point on the particular clinical scale, to give a summary measure of the total disability for each patient (see). The second version calculates the AUCs with respect to the baseline disability score for each patient at trial entry. This method improves the power of the summary measure, particularly if the changes in disability during the trial are small relative to the variance of the cohort at study entry. The main disadvantage is that statistical independence is lost as the normalised AUCs depend on the stability of the baseline scores. Instability due to a recent resolving relapse may be avoided to some extent by ensuring a neurologically stable run in period. This second technique has been utilised in one study of disability in progressive multiple sclerosis to date.18

In the figure, EDSS scores obtained at six-monthly scheduled visits are shown for four hypothetical but typical patient courses, plotted versus time according to the trapezium rule. Note that the outcome in examples A and B would be the same—namely, “no change in disability” if the end point of an alteration in EDSS over two years was employed, whereas A and C would both become treatment failures if “sustained deterioration” of 1.0 point in EDSS was used. The calculated summary measures in these three examples are 5, 4, and 5.25 EDSS-years respectively using raw scores, and +1, 0, and +1.25 respectively, when normalised to entry baseline. If disability is also scored during relapses, then the episodes and the temporary dysfunction arising from them can be incorporated into an individual patient’s summary score for the whole trial period. Unequal time intervals between data points are permitted, which is particularly useful for the unpredictable disease course of relapsing-remitting multiple sclerosis. In the figure, example D has the same “baseline disability” time course as A, but when the EDSS scores during two relapses are incorporated into the AUC, the summary measure is increased from 5 to 5.79 using raw data, and from +1 to +1.79 with data normalised to entry baseline.

Ideally, to improve the accuracy of any particular dysfunction curve, more frequent assessments should be made during exacerbations, to acquire more detailed information on the relapse onset and offset as well as the duration and magnitude of any transient disability. As it is cumbersome and inconvenient for patients to have repeated neurological examinations at times of increased disability, alternative scales being developed for rapid and easy administration, such as the Guy’s Neurological disability scale,19 may be particularly useful for this purpose.

For any neurological rating scale (for example, EDSS, Scripps neurological rating scale,20 ambulation index,21 or individual Kurtzke functional system scores7), the AUC obtained from serial scheduled time points can be compared with the AUC incorporating additional measurements obtained during relapses. The difference between the two values can be interpreted, to some extent, as an approximation of the short lived effects of exacerbations (for example, comparing the EDSS summary measures from examples A and D in the figure). Caution is necessary in short trials of two or three years, as fixed neurological deficits are accumulating very slowly, and an increased AUC at the end of a trial may simply represent transient disability which has either resolved or has yet to resolve. Classification into different subgroups for further analysis depending on the individual time point curves may be necessary.13 In addition, differences in sampling frequency between patients and controls can introduce bias which may require weighting adjustment. Also, as for other statistical methods, the problems of the ordinal nature of disability scales such as the EDSS, as well as the “noise” introduced by within rater variability, remain.22 23Nevertheless, the summation of disability provided by the AUC, whether transient or fixed, provides a more clinically meaningful measure, particularly within the time constraints of these relatively short trials, by which to judge the effectiveness of a new therapy for relapsing-remitting multiple sclerosis.

Why has this method not been used in treatment trials of relapsing-remitting multiple sclerosis? Summary measure statistics have been in use since 1938,24 although the technique was rarely employed in medicine until the past decade. In neurology, it has been utilised as a primary outcome variable in a headache treatment trial (using serial raw data)25 and in a pilot study for rehabilitation in progressive multiple sclerosis (with summary measures of change of EDSS from baseline).18 Relapsing-remitting multiple sclerosis treatment trials lasting two to three years are probably not long enough to demonstrate any meaningful effects on irreversible disability and evaluating relapses by merely counting them is an oversimplification. Summary measure statistics enable the magnitude and duration of neurological dysfunction caused by exacerbations to be incorporated into an overall disability analysis. Moreover, the inclusion of all serial data in the AUC calculations should reduce the variance which is associated with data obtained at single time points (for example, in a comparison between the initial and final disability scores). This “variance stabilising” effect means that fewer patients should be necessary for the same power to detect a predetermined clinically significant difference. Increasingly, there is a need to consider the cost effectiveness of pharmacotherapies. From the disability-time plots of different treatments being compared, the incremental therapy benefit of the test drug relative to the control can be expressed in terms of the readily interpretable disability-year (difference in AUCs) for use in cost effectiveness studies. For all these reasons we suggest that it is appropriate to employ summary measure statistics to evaluate the effects of new treatments in patients with relapsing-remitting multiple sclerosis.

## Appendix

The AUC is a summation of the areas under the graph between each pair of consecutive scores by the trapezium rule. Disability measures (y_{0}, y_{1}, y_{2}, ...) are plotted versus their times of assessment (t_{0}, t_{1}, t_{2}, ...).

The AUC using raw data is calculated as follows:

If we have n+1 measurements y_{i} at times t_{i}

## References

## Statistics from Altmetric.com

Multicentre double blind placebo controlled phase III trials of various immunomodulatory agents (interferon β-1b (IFNβ-1b, Betaseron), interferon β-1a (IFNβ-1a, Avonex), copolymer-1 (Cop-1 or glatiramer acetate, Copaxone) and intravenous immunoglobulin (IVIg)) for relapsing-remitting multiple sclerosis in the past few years have all demonstrated a modest reduction in the mean number of relapses but little or no significant effects on disability.1-4 This should hardly be surprising given the very slow average accumulation of fixed neurological deficits arising from incompletely resolved attacks5 and the low incidence of secondary progression in the patients selected for these trials. Much criticism has been directed at the well recognised limitations of the clinical rating scales used in these trials and considerable efforts are being made to improve the methodology of assessing impairment, disability, and handicap.6 However, we think that there is also room for improvement in the methods of analysis of both relapse and disability data. More appropriate statistical methods may play an important part in the interpretation of results, whatever the inadequacies of the rating scales used.

We firstly discuss some of the problems of analysing serial data in relapsing-remitting multiple sclerosis, and secondly suggest a simple alternative statistical method which is both more appropriate to the natural history of this condition and more likely to capture any treatment effect on the time course of this fluctuating disease.

## Disability data

The validity of using rigidly defined end points to measure disease progression in relapsing-remitting multiple sclerosis is fundamentally flawed due to the frequent remissions that occur early in the course of the disease. The table summarises the clinical end points which have been used in each of the recent major phase III trials under discussion. Regular neurological examinations were carried out at either three-monthly (IFNβ-1b and Cop-1) or six-monthly (IFNβ-1a and IVIg) intervals, but the published disability outcomes were derived either from determining the numbers of patients who sustained a deterioration in their expanded disability status score (EDSS)7 for a period of either three or six months, or from the difference in disability scores between baseline and the end of the study. Considerable amounts of the disability data collected were not utilised in these analyses.

In a disease which characteristically remits after exacerbations, a method for determining “progression” which relies on a certain deterioration of clinical ratings to be sustained for three or six months, will not incorporate any subsequent worsening that might occur after this defined period. Furthermore, this type of end point will also result in a significant proportion of erroneous “treatment failures”—that is, some patients who have met the criteria for “progression” will subsequently recover to baseline levels after a period of so called “sustained deterioration”. This phenomenon is common at the lower end of the EDSS where most patients recruited for these trials tend to cluster. Natural history studies on cohorts with early multiple sclerosis have shown that up to 24% of relapses last more than three months.8 9 Similar criticisms can be applied to the Kaplan-Meier survival analysis, which has been used to study time to confirmed progression in disability,2 10 as such a method again inappropriately assumes that the exacerbations concerned are irreversible.

The second commonly used end point, which relies on mean change over a trial period (subtracting final from initial scores), is also problematic, as it ignores any transient disability due to attacks experienced during the study, the reduction of which may be the major beneficial effect of the therapy under investigation. This reliance on the difference between assessments at the start and the end of a trial wastes all the intermediate disability data, and is not statistically or clinically meaningful.11

A popular way of presenting serial data is to plot mean group scores as a time series. This has not been carried out explicitly in the published phase III trials, but has been illustrated in a recent MRI study involving patients on two doses of IFNβ-1a.12 In this paper the mean monthly number and volume of MRI enhancing lesions were plotted versus time and analysed by Student’s *t*test. This type of analysis has several problems.13 The curves joining the means may not be representative of individual disease courses, the data from each patient over time is ignored when every time point has been analysed separately, and the means at successive points are not independent as they are to some extent influenced by the values of preceding data.14

Another technique for evaluating serial data is the analysis of variance (ANOVA), but the assumptions for its use in treatment trials may not always be valid; complete data sets are required (analysis of ad hoc assessments associated with relapses presents difficulties) as there are problems with treating missing values15 and the method has also been considered difficult to understand and interpret.16

## Relapse data

The relapse data from published phase III treatment trials has been subject to many criticisms. Unresolved issues include the methods of assessing exacerbations, which are beyond the scope of this article. It has been suggested that the usefulness of relapses as a trial end point would be enhanced if a meaningful measure of the disability caused by each attack was available.17 In three out of the four trials in question, all relapses required objective confirmation by neurological examination.2-4 In the IFNβ-1b study, although subjectively reported events were accepted as relapses, up to 80% of attacks were verified by examination and graded according to severity.1 However, despite this admirable but time consuming and arduous requirement of the trial investigators, the clinical end points were derived solely from the number of counted relapses (table). A substantial amount of data on the severity and the time course of attacks was not utilised. Furthermore, comparisons of “annual exacerbation rates”, although widely used, are conditional on the assumption that an individual patient’s attack rate during the study is independent of his baseline relapse frequency.2

## Use of “summary measure” as outcome measure

An alternative approach is to use a summary measure which captures changes in disease status over the whole study. The essential components are the sets of serial impairment or disability data, preferably with frequent sampling points for each patient, which are integrated to produce a single numerical summary of a particular dysfunction curve over the trial period. This is done by plotting an individual patient’s scores serially against time. The summary measure is then obtained by calculating the area under the curve (AUC).13 This method has the advantage that many of the statistical problems outlined above are avoided. In particular, in treatment trials of relapsing-remitting multiple sclerosis all scores acquired serially can be used and patients with missing data need not be discarded from the analysis. Different AUCs can be calculated for each clinical scale employed in the study.

In practice, there are two main methods used for calculating the AUC. The first integrates the area contained between the plot of consecutively acquired clinical rating scores during the trial and a baseline defined by the zero point on the particular clinical scale, to give a summary measure of the total disability for each patient (see). The second version calculates the AUCs with respect to the baseline disability score for each patient at trial entry. This method improves the power of the summary measure, particularly if the changes in disability during the trial are small relative to the variance of the cohort at study entry. The main disadvantage is that statistical independence is lost as the normalised AUCs depend on the stability of the baseline scores. Instability due to a recent resolving relapse may be avoided to some extent by ensuring a neurologically stable run in period. This second technique has been utilised in one study of disability in progressive multiple sclerosis to date.18

In the figure, EDSS scores obtained at six-monthly scheduled visits are shown for four hypothetical but typical patient courses, plotted versus time according to the trapezium rule. Note that the outcome in examples A and B would be the same—namely, “no change in disability” if the end point of an alteration in EDSS over two years was employed, whereas A and C would both become treatment failures if “sustained deterioration” of 1.0 point in EDSS was used. The calculated summary measures in these three examples are 5, 4, and 5.25 EDSS-years respectively using raw scores, and +1, 0, and +1.25 respectively, when normalised to entry baseline. If disability is also scored during relapses, then the episodes and the temporary dysfunction arising from them can be incorporated into an individual patient’s summary score for the whole trial period. Unequal time intervals between data points are permitted, which is particularly useful for the unpredictable disease course of relapsing-remitting multiple sclerosis. In the figure, example D has the same “baseline disability” time course as A, but when the EDSS scores during two relapses are incorporated into the AUC, the summary measure is increased from 5 to 5.79 using raw data, and from +1 to +1.79 with data normalised to entry baseline.

Ideally, to improve the accuracy of any particular dysfunction curve, more frequent assessments should be made during exacerbations, to acquire more detailed information on the relapse onset and offset as well as the duration and magnitude of any transient disability. As it is cumbersome and inconvenient for patients to have repeated neurological examinations at times of increased disability, alternative scales being developed for rapid and easy administration, such as the Guy’s Neurological disability scale,19 may be particularly useful for this purpose.

For any neurological rating scale (for example, EDSS, Scripps neurological rating scale,20 ambulation index,21 or individual Kurtzke functional system scores7), the AUC obtained from serial scheduled time points can be compared with the AUC incorporating additional measurements obtained during relapses. The difference between the two values can be interpreted, to some extent, as an approximation of the short lived effects of exacerbations (for example, comparing the EDSS summary measures from examples A and D in the figure). Caution is necessary in short trials of two or three years, as fixed neurological deficits are accumulating very slowly, and an increased AUC at the end of a trial may simply represent transient disability which has either resolved or has yet to resolve. Classification into different subgroups for further analysis depending on the individual time point curves may be necessary.13 In addition, differences in sampling frequency between patients and controls can introduce bias which may require weighting adjustment. Also, as for other statistical methods, the problems of the ordinal nature of disability scales such as the EDSS, as well as the “noise” introduced by within rater variability, remain.22 23Nevertheless, the summation of disability provided by the AUC, whether transient or fixed, provides a more clinically meaningful measure, particularly within the time constraints of these relatively short trials, by which to judge the effectiveness of a new therapy for relapsing-remitting multiple sclerosis.

Why has this method not been used in treatment trials of relapsing-remitting multiple sclerosis? Summary measure statistics have been in use since 1938,24 although the technique was rarely employed in medicine until the past decade. In neurology, it has been utilised as a primary outcome variable in a headache treatment trial (using serial raw data)25 and in a pilot study for rehabilitation in progressive multiple sclerosis (with summary measures of change of EDSS from baseline).18 Relapsing-remitting multiple sclerosis treatment trials lasting two to three years are probably not long enough to demonstrate any meaningful effects on irreversible disability and evaluating relapses by merely counting them is an oversimplification. Summary measure statistics enable the magnitude and duration of neurological dysfunction caused by exacerbations to be incorporated into an overall disability analysis. Moreover, the inclusion of all serial data in the AUC calculations should reduce the variance which is associated with data obtained at single time points (for example, in a comparison between the initial and final disability scores). This “variance stabilising” effect means that fewer patients should be necessary for the same power to detect a predetermined clinically significant difference. Increasingly, there is a need to consider the cost effectiveness of pharmacotherapies. From the disability-time plots of different treatments being compared, the incremental therapy benefit of the test drug relative to the control can be expressed in terms of the readily interpretable disability-year (difference in AUCs) for use in cost effectiveness studies. For all these reasons we suggest that it is appropriate to employ summary measure statistics to evaluate the effects of new treatments in patients with relapsing-remitting multiple sclerosis.

## Appendix

The AUC is a summation of the areas under the graph between each pair of consecutive scores by the trapezium rule. Disability measures (y_{0}, y_{1}, y_{2}, ...) are plotted versus their times of assessment (t_{0}, t_{1}, t_{2}, ...).

The AUC using raw data is calculated as follows:

If we have n+1 measurements y_{i} at times t_{i}

## References

## Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.