Leveraging Machine Learning to Identify Familial Hypercholesterolemia

The use of machine-learning was found to detect familial hypercholesterolemia with high accuracy.

The use of machine-learning was found to detect familial hypercholesterolemia (FH) with high accuracy, according to a study published in NPJ Digital Medicine.

In this study, data of 4,027,775 patients from the Clinical Practice Research Datalink, a database of electronic medical records from 836 practices in the United Kingdom, were analyzed. Patients aged >16 years who had a record of total cholesterol measured between 1999 and 2019 were included. Documented FH diagnoses (n=7928) were predicted using 45 diagnostic features included in the medical records. Individuals were randomly split into training (n=3,020,832) and validation (n=1,006,943) cohorts. The 5 algorithms tested were logistic regression, random forest, gradient boosting, deep-learning neural networks, and ensemble learning.

The strongest indicators overall of FH were cholesterol concentrations and family history.

The logistic regression algorithm identified highest low-density lipoprotein cholesterol, highest total cholesterol, and baseline statin potency as the best predictors of FH. The random forest and boosting algorithms identified similar features which included current statin potency, triglyceride concentrations, body mass index, and systolic blood pressure.

The deep-learning algorithm identified exclusion features or secondary causes of increased cholesterol such as kidney disease, diagnosis or family history of coronary heart disease, diabetes, and tendon xanthomata.

When applying these models developed using the training cohort on the validation cohort, the lowest predictor was found to be the logistic regression model (area under the curve [AUC] c-statistic, 0.812). All other models preformed similarly (AUC c-statistic: ensemble, 0.890; random forest, 0.891; boosting, 0.892; deep-learning, 0.892).

The positive likelihood ratio (LR+) identified the likelihood of having FH given positive test result. The ensemble approach (LR+, 45.5; 95% CI, 42.4-48.9) had the highest positive LR, followed by boosting (LR+, 14.0; 95% CI, 13.5-14.5), logistic regression (LR+, 11.3; 95% CI, 10.7-12.0), random forest (LR+, 8.7; 95% CI, 8.4-8.9), and deep learning (LR+, 7.2; 95% CI, 7.0-7.4).

The negative LR (LR-) identified the likelihood of having FH given a negative test. The deep learning approach had the lowest negative LR (LR-, 0.31; 95% CI, 0.28-0.33), followed by random forest (LR-, 0.34; 95% CI, 0.31-0.36), boosting (LR-, 0.44; 95% CI, 0.41-0.46), logistic regression (LR-, 0.65; 95% CI, 0.62-0.67), and ensemble learning (LR-, 0.70; 95% CI, 0.68-0.72).

Study limitations include a risk for information bias, as is the case for to any study of large aggregates of health care data, and large amounts of missing data, all of which may have affected the predictive models. It also remains unclear how predictors identified using machine learning approaches are useful in clinical practice.

The study authors concluded this analysis was successful at identifying possible predictive features of FH.

Disclosure: Multiple authors declared affiliations with industry. Please refer to the original article for a full list of disclosures.


Akyea R K, Qureshi N, Kai J, et al. Performance and clinical utility of supervised machine-learning approaches in detecting familial hypercholesterolaemia in primary care. NPJ Digit Med. 2020;3:142. doi:10.1038/s41746-020-00349-5