More information about text formats
We thank Dr Platt and colleagues for their critical review of our work, especially of the methodology that we have used in this study. It is understandable that comparative studies of treatment effectiveness trigger constructive discussions among industry and academics. We also vehemently agree that rigorous methodology and cautious interpretation of results is mandatory, especially for analyses of observational data.1 2 Therefore, in this letter, we will provide additional clarifications in response to the concerns raised.
We appreciate that the categories that are underrepresented in multivariable logistic regression models may lead to inflation of estimates of the corresponding coefficients and their variance. Such inflation would, however, result in an overly conservative matching rather than the opposite. Due to the use of a caliper, patients with an extreme propensity score can not be matched to patients within the bulk of the distribution of the propensity scores. Such patients were excluded from the matched cohorts.
The issue of residual imbalance is important in any non-randomised comparative study. We acknowledge that the standardised mean difference in annualised relapse rates (ARR) between teriflunomide and fingolimod exceeded the nominal threshold of 20%. It is therefore reassuring that the sensitivity analyses, in which the residual imbalance fell below the accepted threshold of 20% (patients with prior on-treatment relapses, Cohen’s d 14%, and...
The issue of residual imbalance is important in any non-randomised comparative study. We acknowledge that the standardised mean difference in annualised relapse rates (ARR) between teriflunomide and fingolimod exceeded the nominal threshold of 20%. It is therefore reassuring that the sensitivity analyses, in which the residual imbalance fell below the accepted threshold of 20% (patients with prior on-treatment relapses, Cohen’s d 14%, and analysis with no MRI data included, Cohen’s d 16%) confirmed the results of the primary analysis. We have chosen not to explicitly report absolute differences in proportions reported in Table 1; this information is redundant as the absolute differences can easily be calculated from the proportions shown in the table.
Dr Platt and colleagues make an important point that the comparison of baseline patient characteristics should account for the weights used due to matching in a one-to-multiple variable ratio. We have recalculated the differences for the primary analysis and have observed that our original standardised mean differences overestimated the true weighted standardised mean differences present (see the Table below). Reassuringly, the compared groups are in reality more closely aligned than what the Table 1 in our article would suggest.
TABLE: Recalculated weighted standardised mean differences (Cohen’s d) for continuous baseline characteristics. (DMF, dimethyl fumarate)
DMF vs. teriflunomide fingolimod vs. DMF fingolimod vs. teriflunomide
age 0.006 0.01 0.02
disease duration 0.006 0.003 0.01
disability (EDSS) 0.04 0.002 0.026
relapses 12 months pre-baseline 0.05 0.015 0.03
Combining the results of multiple imputation is a standard procedure, inherent in multiple imputation methodology.3 We have calculated the mode in order to combine the 17 datasets into one. The resulting variable represents the combined result of the 17 imputed data sets and therefore reflects the values with the greatest support within the imputed data space.
Similar to our previous studies, we have chosen to match the studied patients on country.4 This is a conservative decision that aims to mitigate systematic differences in patient follow-up (modelled as the surrogate ‘country’ variable). However, there are methods that are considerably more effective in accounting for inter-centre heterogeneity. We have taken precautions to minimise the heterogeneity in the studied cohort directly – through adjusting for the length and the frequency of recorded follow-up as well as its consistency across the participating centres (the requirement of Neurostatus certification at each centre, adjustment for visit frequency, pairwise censoring, and the quality control process). In fact, differences in treating conventions access to therapies among centres and regions increase the chance that patients will be matched with comparable counterparts who were offered a different therapy for reasons unrelated to their disease severity.
We have calculated ARR in individual patients in order to derive point and interval estimates of the distributions of ARR. The estimates (mean and variance) were weighted for variable one-to-multiple matching and duration of pairwise-censored follow-up. The presented mean ARRs and their 95% confidence intervals are based on this method.
Individual estimation of ARR in a cohort where 75% of patients have a recorded, pairwise-censored follow up greater than 1 year (and the minimum required follow-up of 6 months) is subject to only a negligible risk of inflation due to short follow-up. Notionally, estimation of ARR of a population based on individual observed ARRs from a sample follows standard inferential reasoning within frequentist framework. It enables direct estimation of ARR in studied sample. As in our previous studies, we have used a negative binomial model to compare the incidence of relapse events throughout the pairwise-censored follow-up between the matched patients.4 5 Here, we have used the overall number of relapses from either treatment group and cumulative follow-up, as appropriate. As stated in the Methods, all analyses used weights to account for variable matching ratio and either ‘cluster’ or ‘frailty’ terms to account for the paired data structure. We agree that it would have been more accurate to use the term ‘incidence of relapses’ rather than ‘ARR’ in the description of the negative binomial model in the Methods section. Most importantly, all three methods that we have employed to evaluate relapse outcomes (the two methods described above and the survival model) showed consistent results in the primary analysis.
Appropriately, the authors have closely examined our estimates of variance for ARRs. Unfortunately, the table presented in their communication is difficult to understand. It is unclear what method for the back-calculation of standard deviations from the reported 95% confidence intervals was applied, given that it resulted in two divergent values for both our study and the given examples of randomised clinical trials – reported as ‘SD left’ and ‘SD right’. As described above, all of our analyses used weights to adjust for variable matching ratio, and these were also used to calculate interval estimates for the weighted mean ARRs. It is reasonable to observe variability in the error recorded in each sample, especially in observational studies, which naturally encapsulate more broadly defined cohorts in exchange for greater ecological validity. Furthermore, one would expect the variability in the error to be contingent on the specific selection of study samples, as a result of matching to other treatment groups. We agree that the implications of variable standard deviations across studies are of research interest, but it is not clear that their bench marking to a particular randomised control trial is justified.
We thank the authors for pointing out the inconsistency in the reported exposure to therapies between the groups before and after matching. Sixteen patients treated with dimethyl fumarate were previously exposed to teriflunomide as their highest-efficacy treatment, and the corresponding entry in Supplementary Table 6 should be 16 (2%). We apologise for the typographical error.
A well-powered analysis is not a weakness of a study. We are aware that a statistically significant difference and clinically meaningful difference are complementary but different concepts. Studies carried out within the frequentist framework are reliant on testing of significance of null hypotheses and are destined to provide binary answers – an arbitrary outcome that reflects its origin in randomised controlled trials. Therefore, a result that rejects a null hypothesis provides its reader with more certainty than a result that fails to do so, even when the difference may be of minimal clinical significance. The role of the clinical readership is to interpret clinical meaningfulness of the results. We refrained from attempting to develop cut-offs for what represents clinically meaningful difference and chose to leave this decision to the readers. More robust and intuitive answers to this problem lie in Bayesian methods. We are strong proponents of such methods, as the interpretation of their results is more intuitive for a clinical reader.
The title of their discussion point suggests that Dr Platt and colleagues consider reports that require further clarification of some of its concepts to be methodologically flawed. In the study under discussion, we have utilised the methodology previously used in several studies of comparative effectiveness in the MSBase data set, in particular the comparison of treatment escalation to fingolimod or natalizumab.4 We have systematically addressed the relevant sources of bias, in particular indication bias.1 We take great comfort in the fact that our previous comparative analyses have been highly convergent with the results of pivotal randomised controlled trials.5 6 We have now provided clarification of additional points in response to our colleagues’ review of our article. We value their constructive criticism and believe that our thorough response further strengthens the credibility of the reported results.
1. Kalincik T, Butzkueven H. Observational data: Understanding the real MS world. Mult Scler 2016;22(13):1642-48. doi: 10.1177/1352458516653667 [published Online First: 2016/06/09]
2. Trojano M, Tintore M, Montalban X, et al. Treatment decisions in multiple sclerosis - insights from real-world observational studies. Nat Rev Neurol 2017;13(2):105-18. doi: 10.1038/nrneurol.2016.188 [published Online First: 2017/01/14]
3. Heraud-Bousquet V, Larsen C, Carpenter J, et al. Practical considerations for sensitivity analysis after multiple imputation applied to epidemiological studies with incomplete data. BMC medical research methodology 2012;12:73. doi: 10.1186/1471-2288-12-73 [published Online First: 2012/06/12]
4. Kalincik T, Horakova D, Spelman T, et al. Switch to natalizumab vs fingolimod in active relapsing-remitting multiple sclerosis. Ann Neurol 2015;77:425-35. [published Online First: 2014/12/27]
5. Kalincik T, Brown JWL, Robertson N, et al. Comparison of alemtuzumab with natalizumab, fingolimod, and interferon beta for multiple sclerosis: a longitudinal study. Lancet Neurol 2017;16(4):271-81.
6. He A, Spelman T, Jokubaitis V, et al. Comparison of switch to fingolimod or interferon beta/glatiramer acetate in active multiple sclerosis. JAMA Neurol 2015;72(4):405-13. doi: 10.1001/jamaneurol.2014.4147 [published Online First: 2015/02/11]
We read with interest the article by Kalincik et al.  comparing fingolimod, dimethyl fumarate and teriflunomide in a cohort of relapsing-remitting multiple sclerosis (MS) patients. The authors investigated several endpoints and performed various sensitivity analyses, and we commend them for reporting technical details in the online supplementary material. We, however, have some concerns about the design, analysis and reporting of the study.
1. In the primary analyses, three separate propensity score models were developed to construct a matched cohort for each of the three pairwise comparisons. Supplementary Table 6 clearly indicates the existence of zero or low frequencies in some variables (e.g., most active previous therapy and magnetic resonance imaging [MRI] T2 lesions). Yet, those variables were used as covariates in the propensity score models, unsurprisingly resulting in extremely high point estimates and standard errors (SE; as reported in Supplementary Table 7). For example, teriflunomide was not the most active therapy for any patient in the dimethyl fumarate cohort (n=0 from Supplementary Table 6), but that category was nevertheless included in the propensity score model, leading to an unrealistic point estimate of 18.65 with SE of 434.5 (Supplementary Table 7). Even higher SEs (greater than 1000) are observed in the other propensity score models. Propensity scores estimated from these poorly constructed models were then used to cr...
1. In the primary analyses, three separate propensity score models were developed to construct a matched cohort for each of the three pairwise comparisons. Supplementary Table 6 clearly indicates the existence of zero or low frequencies in some variables (e.g., most active previous therapy and magnetic resonance imaging [MRI] T2 lesions). Yet, those variables were used as covariates in the propensity score models, unsurprisingly resulting in extremely high point estimates and standard errors (SE; as reported in Supplementary Table 7). For example, teriflunomide was not the most active therapy for any patient in the dimethyl fumarate cohort (n=0 from Supplementary Table 6), but that category was nevertheless included in the propensity score model, leading to an unrealistic point estimate of 18.65 with SE of 434.5 (Supplementary Table 7). Even higher SEs (greater than 1000) are observed in the other propensity score models. Propensity scores estimated from these poorly constructed models were then used to create three matched cohorts, which are the basis for the primary analyses in this work. Readers need to be skeptical about any inference (estimated SE of the treatment effect, and consequently the confidence intervals/p-values) made from these cohorts, because of the instability in the propensity score models. Further, while teriflunomide was not the ‘most active previous therapy’ for any patient in the original (unmatched) dimethyl fumarate cohort (n=0 and 0% from Supplementary Table 6), Table 1 reported n=14 (2%) patients with this therapy after matching. Naturally the matched cohort should not produce more patients than originally present in the unmatched cohort for any category.
2. A threshold lower than 10% or 20% in absolute value for standardized mean differences (i.e. Cohen’s d) is normally considered to assess imbalances in baseline covariates. However, in this study, a standardized mean difference was reported to be equal to 26% for relapse activity prior to the baseline for the comparison of fingolimod vs. teriflunomide matched sample (Table 1). Neither the standardized or raw difference in proportions was reported for any of the categorical variables in Table 1, even though some of the percentages in matched cohorts were substantially different (e.g., relapse rate). Large residual differences observed in the distribution of the covariates (likely due to poorly built propensity score models) will contribute to bias in the resulting estimates. Furthermore, matching by country, a crucial variable which would allow minimizing outcome assessment bias  was not reported in Table 1, but as Supplementary Table 4 which clearly shows that matching by country is far from being obtained. Even more important, since the matching process was conducted in a variable ratio manner for the primary analyses, standardized differences in Table 1 should be replaced with weighted standardized means or proportion differences to obtain a correct check of residual baseline imbalances after matching .
3. In the primary analysis, missing baseline MRI values were imputed to generate 17 imputed datasets (MRI information was available for only 20-27% of the population as reported in Supplementary Table 6). In a propensity score analysis, multiple imputation (instead of single imputation) would substantially complicate the analysis due to the pooling of estimates from the 17 imputed datasets (using Rubin’s rules). Both Supplementary Tables 7 and 11 included one set of estimates from each stage of the analysis (propensity score analysis and primary analysis of the matched cohorts respectively), making it unclear how the results were pooled. If the results were not pooled and a single imputed dataset was used for the analysis (as suggested by Supplementary Tables 7 and 11), then such a process would fail to account for the uncertainty in the missing values, leading to SEs and p-values that are smaller than expected.
4. We applaud the authors for conducting a series of sensitivity analyses to evaluate the robustness of their findings. However, readers would have more confidence in the findings if the supplementary materials included more details of how those sensitivity analyses were done. For example, when 1:1 matching was done, it is not clear whether and how the authors have accounted for the matched-pairs designs. In particular, despite having almost identical sample sizes in some matched cohorts (e.g., comparing analyses of ‘no MRI data included’ vs. ‘matching on 2-year relapse rate’ for fingolimod vs. dimethyl fumarate in Supplementary Table 11), high variability in p-values in most cases deserve further explanation.
5. As for the PS adjusted treatment effect analyses, this work claims that individual ARRs were calculated and used in the assessment of primary endpoint analysis. This approach is controversial . Furthermore, the use of individual ARRs is contradicted in the statistical analysis section in which the authors state that a weighted negative binomial accounting for matching has been used. It is unclear whether individual ARRs were fed into a negative binomial and it is important to note that, if they were, results may be biased. The authors do not make clear whether standard errors and p-values properly accounted both for matching and weighting in all assessed endpoints (they include a cluster term in the negative binomial model which only accounts for matching). Table 1 presented below reports a back-calculation of the standard deviation (SD) for ARRs, which should correspond to a stable population parameter, in particular the column ‘SD right’ which is less prone to rounding effects present the original ARRs confidence interval values. This standard deviation is benchmarked against a recent work on a new drug for MS . For the studies OPERA I and II consistent and stable SD values are obtained (around 1), while highly inconsistent and underestimated SDs are obtained for the MSBASE study, especially when the weighting scheme should have been attributing within the matched group the weight of 1 to the treatment arm represented by one single patient. Our Table 1 shows that the reported estimated standard errors are incorrect (i.e. generally smaller) due to the use of a wrong weighting scheme or lack of accounting for weighting properly and, consequently, p-values significance has been inflated dramatically.
6. The large number of patients included made statistically significant a very small and perhaps not clinically meaningful difference, increasing the risk of overinterpretation of the results. From a clinical standpoint an ARR difference between 0.20 and 0.26, that is an ARR ratio of 0.80 or, more intuitively, 1 relapse over 5 years vs. 1 relapse over 4 years is close to negligible overall. This represents an effect size that no future trial would likely be powered or interested to detect, especially as it comes from an ARR threshold (0.20) quite prone to the presence of noise in detecting a relapse.
To recapitulate, we wish to highlight the need for caution while interpreting the findings of this paper. Real world evidence (RWE) is an important and necessary component of research to assess the effectiveness and safety of various therapies outside the context of randomized clinical trials. However, because RWE is prone to various sources of bias, rigorous and careful analysis, interpretation and reporting are needed to ensure that results are reliable, reproducible and useful to inform clinical decision making.
 Kalincik T et al. Comparison of fingolimod, dimethyl fumarate and teriflunomide for multiple sclerosis. Journal of Neurology, Neurosurgery & Psychiatry. 2019: jnnp-2018-319831. doi:10.1136/jnnp-2018-31983.
 Bovis F et al. Expanded disability status scale progression assessment heterogeneity in multiple sclerosis according to geographical areas. Ann Neurol. 2018 Oct;84(4):621-625.
 Austin, Peter C. Assessing balance in measured baseline covariates when using many‐to‐one matching on the propensity‐score. Pharmacoepidemiology and drug safety 17.12 (2008): 1218-1225.
 Suissa S et al. Statistical Treatment of Exacerbations in Therapeutic Trials of Chronic Obstructive Pulmonary Disease. American Journal of Respiratory and Critical Care Medicine. 2006;173(8):842-846.
 Hauser SL, Bar-Or A, Comi G, Giovannoni G, Hartung HP, Hemmer B, Lublin F, Montalban X, Rammohan KW, Selmaj K, Traboulsee A, Wolinsky JS, Arnold DL, Klingelschmitt G, Masterman D, Fontoura P, Belachew S, Chin P, Mairon N, Garren H, Kappos L; OPERA I and OPERA II Clinical Investigators. Ocrelizumab versus Interferon Beta-1a in Relapsing Multiple Sclerosis. N Engl J Med. 2017 Jan 19;376(3):221-234.
Table 1 – Back-calculation of ARR standard deviation to benchmark flaws in the reported standard errors and p-values
Study Drug n ARR Lower ARR Upper ARR SD left SD right
OPERA I  Ocrelizumab 410 0.16 0.12 0.2 1.29 1.00
Interferon 411 0.29 0.24 0.36 0.85 0.97
OPERA II  Ocrelizumab 417 0.16 0.12 0.2 1.30 1.01
Interferon 418 0.29 0.23 0.36 1.05 0.98
Kalincik  - Comparison 1 DMF 470 0.19 0.15 0.23 1.14 0.92
Variable Matching Ratio 2:1 Teriflunomide 355 0.22 0.18 0.26 0.84 0.70
Kalincik  - Comparison 2 Fingolimod 910 0.18 0.16 0.21 0.79 1.03
Variable Matching Ratio 4:1 Teriflunomide 403 0.24 0.21 0.27 0.59 0.52
Kalincik  - Comparison 3 Fingolimod 1825 0.2 0.19 0.22 0.49 0.90
Variable Matching Ratio 5:1 DMF 672 0.26 0.24 0.28 0.46 0.43
ARR: Annualized relapse rate; SD: standard deviation; DMF: dimethyl fumarate
Robert W. Platt, Department of Epidemiology, Biostatistics, and Occupational Health, McGill University, 1020 Pine Ave W, Montreal, Quebec H3A 1A2, Canada.