Background Huntington’s disease (HD) is caused by a CAG repeat expansion mutation in the huntingtin gene. The age of onset of HD is largely determined by the length of the repeat expansion; likewise the complex though characteristic phenotypic signature of HD is influenced by the size of the CAG repeat expansion. We wondered whether it may be possible to predict genotype and the size of the CAG-expansion based on phenotypic data.
Aims We aimed to develop a predictive model for HD using machine learning techniques focused on the question to what extent and with what degree of precision it is possible to predict in HD gene expansion mutation carriers (HDGECs) the size of the CAG-expansion mutation based on cross-sectional data sets as generated through the Enroll-HD study.
Methods As a modelling technique we used Gradient Boosted Trees (GBT) implemented in the XGBoost library. This computational approach was applied to the Enroll-HD database, periodic public release R2. The Enroll-HD R2 database incorporates data from a total of 4146 subjects, including 2295 HD manifest subjects, 880 pre-manifest HDGEC and 971 healthy controls. We applied 10-fold cross-validation to evaluate the performance of the model, root mean squared error (RMSE) as an average quality measure, and 50 and 95 percentiles of absolute errors (P50, the median error, and P95) as a more detailed measure. While pre-processing the data we excluded all variables explicitly connected to the age of onset, including age of HD onset in affected parent(s) and age of symptoms onset in a given participant. Overall 292 variables for each subject were analysed.
Results GBT allowed to predict the length of the CAG repeat with the following degree of precision: for manifest HD subjects RMSE = 1.77 ± 0.09, P50 = 0.86 ± 0.05, and P95 = 3.77 ± 0.30; for pre-manifest HDGEC RMSE = 2.10±0.09, P50 = 1.16 ± 0.05, and P95 = 4.45 ± 0.22.
Conclusions The computational analysis of phenotypes using GBT allowed to predict the size of the CAG repeat expansion relatively accurately: the absolute error was 1 CAG repeat for 50% of cases and 4 or less CAG repeats for 95% of HD patients as well as in pre-manifest HDGEC. The results suggest that (1) the autosomal dominant HD mutation can be predicted based on complex phenotypic data and that (2) the clinical phenotypic data collected during Enroll-HD visits are sufficient to allow to infer a causative genetic insult from its phenotypic consequences.
- predictive model
- machine learning