Article Text


Clinical appropriateness: a key factor in outcome measure selection: the 36 item short form health survey in multiple sclerosis
  1. J A Freeman,
  2. J C Hobart,
  3. D W Langdon,
  4. A J Thompson
  1. Institute of Neurology, Department of Clinical Neurology, Queen Square, London WC1 N3BG, UK
  1. Dr JA Freeman, Institute of Neurology, Department of Clinical Neurology, Queen Square, London WC1 N3BG, UK emailFreemanJR{at}


OBJECTIVES Understanding the properties of an outcome measure is essential in choosing the appropriate instrument and interpreting the information it generates. The MOS 36 item short form health survey questionnaire (SF-36) is widely acknowledged as the gold standard generic measure of health status; few studies however have evaluated its use for clinical trials in multiple sclerosis. Its clinical appropriateness, internal consistency reliability, validity, and responsiveness was investigated across a broad range of patients with multiple sclerosis.

METHODS A prospective study in which 150 adults with clinically definite multiple sclerosis completed a battery of questionnaires evaluating generic health status, disability, handicap, and emotional wellbeing. Of these, 44 patients undergoing inpatient rehabilitation completed the questionnaires before and after intervention to evaluate responsiveness.

RESULTS Score distributions demonstrated significant floor and ceiling effects in four of the eight dimensions which were particularly marked when patient selection was restricted to a narrow band of disease severity (as is the case in most clinical trials). Internal consistency exceeded the standard for group comparisons for all dimensions. Convergent and discriminant construct validity was supported by the direction, magnitude, and pattern of correlations with other health measures. In comparison with instruments measuring associated constructs, the responsiveness of the SF-36 was poor in evaluating change in moderate to severely disabled patients participating in a programme of inpatient rehabilitation.

CONCLUSIONS The SF-36 has some limitations as an outcome measure in multiple sclerosis. The results highlight the need for all instruments to be examined in the specific sample population under question and for the specific research question being investigated. In multiple sclerosis clinical trials, the SF-36 should be supplemented with other relevant measures.

  • multiple sclerosis
  • SF-36
  • quality of life

Statistics from

Numerous clinical trials have been undertaken in the past decade to determine the effectiveness of a range of interventions in multiple sclerosis. Traditionally these trials have evaluated outcome on the basis of clinical end points (for example, relapse rate) and physiological parameters (for example, lesion load on MRI1). In recent years there has been a gradual broadening of the outcomes measured to include aspects of health status.2 3 Alongside this advance, an increasing number of new measures of health status have been developed.4-6 Unfortunately only preliminary information is available about many of these measures, particularly on their use in clinical trials. As a consequence researchers have found that they are faced with greater choice but limited information on which to base their selection.7

It is widely agreed that the choice of outcome measure(s) is crucial to the successful design of a clinical trial.8 An informed decision is reliant on knowledge of the scientific (reliability, validity, and responsiveness) and clinical properties (feasibility, appropriateness to the study sample, respondent burden) of available measures.9 Understanding the purpose of the study is also a key consideration as different questions may require different measures. Will the instrument be used to describe specific characteristics of the population? Will it make comparisons with other samples? Will it evaluate the effectiveness of an intervention? Such information is essential in choosing the most appropriate instrument and interpreting the results generated in a meaningful way.

Two different approaches to the measurement of health status are the generic and the disease specific models. The generic model seeks to assess basic health values thought to be relevant to health status regardless of disease, treatment, or age group.10 By contrast, the disease specific model is not concerned with establishing universal standards but aims instead to reflect factors relevant to the person with a specific disease.

The SF-36 is generally considered as the gold standard generic measure of health status.4 It is available in several languages and has been adopted and disseminated worldwide. A standard United Kingdom version has been developed11 and norms determined for the healthy population both in the United States12 and United Kingdom.13 Although proved to be reliable and valid in a range of patient groups12 relatively few studies have investigated its use in multiple sclerosis. Most have examined its value in describing the impact of multiple sclerosis on quality of life; often comparing their findings to other patient groups and the general population.14-17 Few have used the SF-36 as an outcome measure in clinical trials of multiple sclerosis.18 19

In 1996 we published the results of a cross sectional study which piloted the use of the SF-36 in patients with multiple sclerosis in a rehabilitation unit.15 Our results showed that the SF-36 demonstrated marked floor effects in some dimensions in this group of moderately to severely disabled patients. We suggested that this was likely to limit its potential responsiveness in evaluating any changes that may occur as a result of interventions. We concluded that a systematic evaluation of the SF-36 in a broader range of patients with multiple sclerosis was necessary. This study investigated the appropriateness, reliability, and validity of the SF-36 in a broad range of patients with multiple sclerosis from the newly diagnosed to those in the advanced stages of the disease. It determined its responsiveness in a subgroup of patients undergoing a programme of inpatient rehabilitation.



One hundred and fifty patients with a diagnosis of clinically definite multiple sclerosis20 participated in this ethically approved prospective study. Consecutive patients were recruited from three different sources within a healthcare setting: a weekly outpatient assessment clinic, an inpatient neurorehabilitation unit, and those admitted under a single consultant (AJT) to acute hospital wards. Patients were excluded if they were cognitively impaired such that they were unable to reliably complete the questionnaires; had other diseases such as rheumatoid arthritis which may have influenced their health status; or were non-English speaking.


Data collected solely from one particular setting is often biased. For example, a larger percentage of severely disabled patients are more likely within the acute hospital setting than among those attending a follow up outpatient appointment. To ensure a more even spread of disability within our sample we undertook a stratification procedure to ensure that it was comprised of equal numbers of patients across the entire range of disease severity. This process involved a neurological registrar assessing all patients with Kurtzke's functional systems scale and expanded disability status scale (EDSS),21 and then categorising them into one of three groups22: mild (EDSS 0–4.5), moderate (EDSS 5.0–6.5), or severe (EDSS 7.0–9.5). Consecutive patients were recruited until there were 50 patients in each category.


Demographic details were collected by interview and diagnostic details derived from the medical records. All patients were rated for level of disease severity as described above. Level of disability was scored by interview using the functional independence measure (FIM)23 administered in accordance with published guidelines. Patients also completed a battery of self reported questionnaires measuring a range of health constructs including generic health status, handicap, emotional wellbeing, and a 0–10 point global rating scale of overall quality of life. Whenever possible this was undertaken independently but when necessary (for example, with visual disturbance, difficulty writing) physical assistance was provided by the researcher. No assistance was given in interpreting the questionnaires.

One hundred and six of the patients were assessed at a single time point. The other 44 subjects, who were all rehabilitation inpatients, were assessed at two time points (admission and discharge) to evaluate responsiveness.


The anglicised version of the SF-3613 was used. This 36-item generic health status questionnaire includes eight multi-item measures of functioning and wellbeing: physical function (PF-10 items), role limitations due to physical (RLP-four items) or emotional (RLM-three items) health problems, social function (SF-two items), emotional wellbeing (MH-five items), bodily pain (BP-two items), energy and fatigue (EV-four items), and general health perceptions (HP-five items). All items are coded, summed, and transformed onto a scale of 0–100 (0=worst health, 100=optimal health).12 In addition, scores on these eight dimensions can be reduced to two summary scores, a physical (PCS) and a mental component (MCS), by means of principal components analyses.24

A global 0–10 point scale was used to rate overall quality of life (QoL).25

Instruments measuring related health constructs

Information was gathered from the following instruments to enable comparison with some of the SF-36 dimensions.

(1) Functional performance was assessed, by patient interview, using the FIM motor domain. This 13 item, seven level scale measures aspects of daily function in four subscales: self care, sphincter control, transfers, and locomotion. The total score range is 13–91 with higher scores indicating greater levels of independence.

(2) Handicap was assessed using the London handicap scale (LHS).26 This six item, six level scale assesses the disadvantage experienced by the individual patients in the dimensions of mobility, physical independence, occupation, social integration, orientation, and economic self sufficiency. The total score range is 0–100 with higher scores indicating the least level of disadvantage.

(3) Emotional status was assessed using the 28 item general health questionnaire (GHQ).27 This version has four subscales that measure disturbances in the areas of somatic complaints, anxiety, social dysfunction, and depression. The total score range is 0–28, with higher scores indicating greater levels of emotional disturbance.

Each of these instruments has been used in various multiple sclerosis populations and has been shown to be valid, reliable, and responsive within the rehabilitation3 19 and hospital setting.28


The inpatient rehabilitation programme consisted of a structured, goal oriented, multidisciplinary programme specifically aimed at considering the individual needs of the patient.3 This typically included efforts to improve functional independence, mobility, bladder and bowel function, and communication. Advice and education regarding work and leisure pursuits, tone management, fatigue management, and strategies to compensate for memory dysfunction were also regular components of this programme.


Statistical analysis was performed using SPSS.29Descriptive statistics were used to describe demographic and disease characteristics of the sample.


Appropriateness has been used to define whether the range of the construct measured within the study sample is similar to the range covered by the measurement instrument.30 In essence this reflects how relevant the instrument is to the population being examined. This was assessed by examining the scale score distributions (range, mean, SD, floor (minimum), and ceiling (maximum) scores) of the eight dimensions and the two summary scales of the SF-36, as well as for each of the other measures.


One aspect of reliability, internal consistency, was calculated by Cronbach's α statistic.31 Alpha coefficients exceeding 0.7 are considered adequate for group comparison.12

Construct validity

Construct validity is the process used to establish the validity of a measurement instrument when no criterion or universe of content is accepted as entirely adequate to define the attribute being measured.31 It is determined by examining the extent to which empirical data support hypotheses concerning the construct the instrument is purported to measure. We examined the data for evidence of:

(1) Convergent validity—by determining the relation between dimensions on the SF-36 and instruments measuring similar constructs. Pearsons product-moment correlations were examined for: SF-36 emotional wellbeing dimensions with the GHQ; SF-36 physical dimensions with the FIM and the EDSS; and SF-36 social and role dimensions with the LHS. To provide evidence of convergent validity we would expect, for example, to see substantial correlations between the SF-36 physical dimensions, the EDSS, and the FIM; and likewise between the SF-36 emotional dimensions and the GHQ.

(2) Discriminant construct validity—by determining the relation between dimensions on the SF-36 and instruments measuring different constructs. Pearson product-moment correlations were examined between the physical and emotional wellbeing dimensions of the SF-36; the FIM and the SF-36 emotional wellbeing dimensions; and the GHQ and the SF-36 physical dimensions. To provide evidence of discriminant validity we would expect, for example, to see weak correlations between the SF-36 emotional dimensions and the FIM; and between the SF-36 mental and physical summary scales.

(3) Group differences construct validity—by examining the differences in SF-36 scores between different groups. We investigated the ability of the mental and physical summary scales to distinguish between different levels of disease severity in multiple sclerosis by using a one way analysis of variance (ANOVA) with post-hoc comparison, adjusting for multiple comparisons using Bonferroni's test, with α=0.05. To provide evidence of group differences construct validity we would expect, for example, that patients categorised into the severe group would report lower scores on both of the summary scales than patients in the mild group.

(4) Hypothesis testing—by examining whether the results produced are consistent with theoretical expectation. The following hypotheses were tested using independentt tests, with α=0.05: (a) patients requiring carer assistance will report lower scores in the SF-36 physical function dimensions than those who are independent in their daily care; (b) patients with relapsing-remitting multiple sclerosis will report higher scores in the physical summary scale of the SF-36 than those with secondary progressive multiple sclerosis; (c) patients scoring ⩾5.0 points on the GHQ (indicating emotional distress as defined by Daloset al 27) will report lower scores on the SF-36 mental summary scale than those scoring <5.0 points.


Responsiveness is the ability of the instrument to measure clinically important change over time.9 This was examined in a subgroup of 44 patients admitted for a short period of inpatient rehabilitation. This intervention has been previously evaluated and was shown to be effective in improving aspects of health status in people with multiple sclerosis in both the short3 and long term.19In each of these outcomes patients change scores between admission and discharge were determined, and effect sizes calculated (where effect size=mean change/SD of the initial distribution of scores).32 The criterion proposed by Cohen33was used to interpret the effect size, where 0.2 is small, 0.5 is moderate, and 0.8 or greater is large. Pairedt tests were used to determine the statistical significance of these change scores.



Of the 150 patients entered into the study, one did not complete the battery of questionnaires and was excluded from the analyses. Of the remaining 149 people there were no missing data for any items. Table 1 presents the demographic and diagnostic characteristics of the study sample, of which 70% were married, 33% were employed, and 49% required assistance with their daily care. Table 2 shows the mean SF-36 scores for our sample alongside those of two other multiple sclerosis populations.

Table 1

Demographic and diagnostic characteristics

Table 2

SF-36 scores for three different multiple sclerosis (MS) populations


Table 3 presents the score distributions for the SF-36 dimensions and summary scales. In the total sample, scores in all dimensions span virtually the entire range; however, floor effects in three dimensions (physical function, physical and emotional role limitations) and ceiling effects in two dimensions (emotional role limitations and pain) exceed the recommended criteria of 20%.34 When patients are subdivided into groups according to disease severity the distribution of scores within each subgroup, in some cases, alters markedly. For example : (a) the physical function scores span only the bottom 20% of the range for the severe group; (b) the means fall substantially outside the midpoint of the scale for physical function and physical role limitations in the moderate and severe groups (mean PF moderate=22.9, severe=4.4; mean RLP moderate=28.5, severe=10.5); (c) floor effects increase markedly for physical function and role limitations (both emotional and physical), particularly in the severe group. The lowest possible score is reported by 84% of severe patients for physical role limitations and 36% of patients for emotional role limitations.

Table 3

Baseline SF-36 score distributions

Table 3 demonstrates that the differences in score distributions between the total sample and the subgroups were less marked in the summary scales. Of importance, no ceiling or floor effects were present.

Table 4 presents the score distributions for instruments measuring related health constructs. In the total sample scores on the FIM, LHS, and the global rating scale of QoL span virtually the entire scale range; the mean scores were near the midpoint; and the floor and ceiling effects were minimal. This indicated that the scales were appropriate for the total study sample. When patients were subgrouped according to EDSS score the appropriateness of these instruments, while not ideal, remained satisfactory. Although the scores were restricted to a smaller range of the available scale, the floor and ceiling effects remained well below the recommended criteria of 20%.34 By contrast, in both the total sample and each of the subgroups, the mean scores on the GHQ fell below the midpoint of the scale and the ceiling effects were above the recommended upper limit.

Table 4

Baseline score distributions in instruments measuring a range of health related constructs


Internal consistency reliability for each of the eight dimensions and the component summary scales of the SF-36 was high with α coefficients ranging between 0.77 to 0.94.


Intercorrelations between the SF-36 dimensions

Table 5 reports intercorrelations between SF-36 dimensions. Importantly, none of the correlations were strong (r=0.09–0.61) demonstrating that each dimension was measuring a related but distinct construct. As predicted, related dimensions were more strongly associated than less related dimensions. For example, physical function showed a stronger correlation with physical role limitations (r=0.57) than with emotional wellbeing (r=0.09) or emotional role limitations (r=0.14). Similarly emotional wellbeing showed a stronger correlation with emotional role limitations (r=0.54) than pain (r=0.29). Interestingly, emotional wellbeing showed a much stronger correlation with energy and vitality (r=0.61) than did physical function (r=0.18).

Table 5

Associations between SF- 36 dimensions (Pearson's product-moment correlations)

Correlations between SF-36 dimensions and instruments measuring related health constructs

As predicted, associations between SF-36 dimensions and instruments measuring related health constructs were strongest between those measuring similar concepts. For example, physical function correlated strongly with the FIM (r=0.68) and the EDSS (r=−0.82); and emotional wellbeing correlated substantially with the GHQ (r=−0.59). By contrast, associations were weak between instruments measuring unrelated constructs. For example, emotional role limitations was only weakly associated with the FIM (r=0.04), and pain was only weakly associated with the EDSS (r=−0.07). It is notable that the social function dimension correlated more strongly with scales measuring emotional constructs (for example, GHQ r=−0.56) than physical constructs (for example, EDSS r=−0.29; FIMr=0.34).

Group difference construct validity—As expected, statistically significant differences between the patient subgroups occurred in three SF-36 dimensions (social function, physical function, and physical role limitations; p<0.05–0.0001). This finding demonstrates the ability of these dimensions to discriminate between different levels of disease severity. Significant differences were also demonstrated between all subgroups for the physical summary scale (p<0.001), but only between the mild and the moderate group in the mental summary scale (p<0.03).

Hypothesis testing—As predicted: (a) patients requiring carer assistance reported lower scores in the physical role limitations dimension than those who are independent (p<0.0001, mean scores=13.5 and 43.7 respectively); (b) patients with relapsing-remitting multiple sclerosis reported higher scores in the physical summary scale than those with secondary progressive multiple sclerosis (p<0.0001, mean scores=35.7 and 27.4 respectively); (c) patients scoring ⩾5.0 points on the GHQ reported lower scores on the mental summary scale than those scoring <5.0 points (p<0.0001, mean scores=40.4 and 52.2 respectively).


Forty four patients participated in inpatient rehabilitation for an average of 20 days ((SD 6) range 13–39). Effect sizes for the SF-36 dimensions ranged from negligible to small (effect sizes 0.01–0.30). The dimensions demonstrating the largest effect size were the emotional role limitations (effect size 0.27) and pain (effect size 0.30). Of the eight dimensions, only pain (p=0.006) and physical function (p=0.01) demonstrated a statistically significant change in scores between admission and discharge. By contrast effect sizes on the FIM, LHS, and GHQ were all moderate in magnitude (effect size 0.56, 0.58, and 0.51 respectively) and statistically significant differences were demonstrated between scores on admission and discharge for each of these measures (p<0.002).


This study provides information about a widely used generic measure of health status—the SF-36. The SF-36 was constructed to compare functional health and wellbeing across patient and general populations, and to evaluate and compare the benefits of alternative treatments.24 The focus of this study was to examine its performance as an outcome measure in multiple sclerosis.

The generalisability of our results is supported by the fact that the demographic and diagnostic characteristics of our sample population are typical of those described in the literature.35Furthermore, as demonstrated in table 2, the distribution of SF-36 scores is very similar to the results of previous multiple sclerosis studies16 17 25 suggesting that our sample is representative of the general multiple sclerosis population.

The test-retest reliability of the SF-36 was not investigated in this study but the results of others, both in the United Kingdom11 and the United States,12 report excellent results at 2 weeks. In agreement with some other studies, our results demonstrate that the internal consistency for all dimensions of the SF-36 exceeds the 0.7 standard for group comparisons.12 Similarly, convergent and discriminant construct validity are supported by the direction, magnitude, and pattern of correlations with other health measures. Further evidence for construct validity has been provided by support for the clinical hypotheses tested. These data support the internal consistency reliability and validity of the SF-36 as a measure of health status in multiple sclerosis. Consequently it would seem reasonable to choose the SF-36 as an outcome measure in clinical trials evaluating the effectiveness of interventions in multiple sclerosis.

When the data are examined in more detail, however, some limitations of this measure become apparent. For instance the large floor and ceiling effects in four of the eight dimensions indicate that the range of health status measured is unlikely to represent the range experienced by this population, and demonstrate limitations in the ability of the SF-36 to discriminate between individual patients in these dimensions. It is notable that the floor and ceiling effects do not simply apply to patients at the extremes of the disease severity range; the moderate group also exhibit significant floor effects in three dimensions. The data also show a polarisation of responses in the role limitations dimensions. This is perhaps not surprising when the dichotomous format of the questionnaire items is considered. For an example, refer to fig1, which contains the emotional role limitations question.

Figure 1

Sample item from the SF-36 emotional role limitations dimension

During the past 4 weeks have you had any of the following problems with your work or other regular daily activitiesas a result of any emotional problems (such as feeling depressed or anxious)? (circle one number on each line):

Yes No
(a) Cut down on the amount of time you spent on work or other activities12
(b) Accomplished less than you would like 12
(c) Didn't do work or other activities as carefully as usual12

These concerns as to the appropriateness of the SF-36 in multiple sclerosis are heightened when the population is subdivided into groups according to disease severity. This is very important as the selection criteria of most clinical trials will inevitably narrow the range of disease severity of the study sample, sometimes markedly (for example EDSS 1.0<3.536; EDSS 3.0– 6.52; EDSS<6.537). These results highlight the importance of examining the appropriateness of an instrument for the specific population under investigation. Even though an instrument may prove to be appropriate for one group of patients this may not necessarily be the case for a different group, even within the same medical condition.

No floor or ceiling effects occur in the SF-36 mental and physical summary scales, suggesting that these scales may be more appropriate than the individual dimensions for discriminating between individual patients at a single point in time. Additionally they have the advantage of reducing the number of statistical comparisons required in the analysis of results, thereby reducing the role of chance in testing experimental hypotheses.24 A disadvantage, however, is that it is impossible to interpret precisely where any changes have occurred; a common feature of all multidimensional instruments.

The responsiveness of an instrument is of key importance in outcome studies. If the instrument is unable to detect change in health, an intervention that improves health status may show no apparent difference between treated and untreated patients. Unfortunately, this property is often overlooked and information about the responsiveness of the SF-36 in multiple sclerosis trials is scarce. The negligible to small effect sizes demonstrated by the SF-36 show the responsiveness of the SF-36 to be poor in evaluating the effectiveness of inpatient rehabilitation in people with moderate to severe disability. Although some may suggest that this is because little or no change has occurred, the moderate effect size results of the FIM (measuring physical function), the GHQ (measuring emotional health), and the LHS (measuring handicap) show that change has indeed occurred, at least in these select areas. Furthermore statistically significant changes were demonstrated between change scores on each of these three measures, but in only two of the dimensions of the SF-36. The poor responsiveness of the SF-36 may, in part, be explained by the fact that it measures broad issues of both function and wellbeing, which, taken together may not give a clear effect. By contrast, the FIM, GHQ, and LHS measure more specific health constructs. We would also suggest that the clustering of scores at either end(s) of the scale, found in half of the SF-36 dimensions, means that the range of the scale is too limited to enable small but possibly clinically significant changes to be recorded; thereby limiting responsiveness. It is stressed, however, that the responsiveness data in this study is restricted to patients with moderate to severe disability undergoing rehabilitation and that the SF-36 has therefore not been assessed in a population representative of the patients included in most multiple sclerosis trials. This is a limitation of this study.

Different approaches to consider some of the limitations of generic measures have been used in recent years. For example, the development of disease specific measures for multiple sclerosis has been undertaken either by adapting current measures (for example, the functional assessment measure38 or the multiple sclerosis QoL-544); by gathering together a wide range of symptom specific measures (for example, the QoL inventory39); or by identifying key areas and then weighting them according to how important the patient thinks these areas are to their lifestyle (for example, the disability and impact profile40). All of these measures are in the early stages of evaluation.


Understanding the properties of an outcome measure is essential when choosing the most appropriate instrument for a study and interpreting the information it generates. The SF-36 is widely acknowledged as the gold standard generic measure of health status. It is being increasingly used as an outcome measure in a range of clinical trials to determine the effectiveness of interventions. The results of this study highlight some limitations of the SF-36 for this purpose. The marked floor and ceiling effects demonstrated in half of the dimensions, and across the range of disease severity, indicate a limited ability to discriminate between patients with multiple sclerosis at a single point in time. The poor responsiveness of the dimension scores suggest that it is limited in detecting change over time in people with moderate to severe disability. These results highlight the need for “generic” measures to be tested for specific populations and for specific purposes. We suggest that trials evaluating health status in multiple sclerosis should supplement the use of the SF-36 with other relevant and scientifically sound instruments to maximise the validity of health measurement.


View Abstract

Request permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.

Linked Articles