Diversity in population cohorts: the elephant(s) in the room?

The neurosciences have now entered the era of big data, aggregating measurements from ever more sites and countries. Openly shared data repositories of human brain structure, connectivity or task responses, as well as participants’ genetics, environment and lifestyle, pave the way for tackling new research questions and developing new analytical methods1,2. The UK Biobank, the world’s largest biomedical dataset, collates deep genetic, brain and phenomic data from ~500,000 individuals. Its measurements also cover cognitive function, medical history, sociodemographic data, or lifestyle and physical measures3. In addition, there are large-scale initiatives such as the Human Connectome Project, the Adolescent Brain Cognitive Development (ABCD) Study, the Enhancing NeuroImaging Genetics through Meta-Analysis (ENIGMA) Consortium and the Healthy Brain Network. However, using these emerging large datasets does not provide a free ticket to neuroscience insights or clinical translation of predictive models.

Here we argue that placing a premium on population diversity has the potential to reveal untapped and more reproducible insights into the human brain. Today, we are in a better position than ever to fully attend to major sources of population stratification that are routinely inaccessible in smaller datasets, cohorts or initiatives.

Eclipsing population stratification: emerging challenges

Parallel to burgeoning neuroscience databases, there has been a growing trend in designing neuroimaging-based prediction models4. These complex, often nonlinear models are poised to provide insights into old challenges in neuroscience in particular and biology in general5. Ensuing predictive models require large datasets to enable accurate predictions in the broader population6. However, acquiring more participants does not always result in deeper neuroscientific insight that could be ultimately translated into everyday life. Concretely, decisions from diagnostic models based on a limited spectrum of an actual patient population and diseases can potentially lead to dangerous consequences in everyday clinical practice, such as inaccurate mortality risk prediction in pneumonia patients7.

Dataset shift is one source of these inaccuracies and mispredictions. In machine learning terminology, dataset shift is at play when the joint distribution of the features and outputs differs between the training and target data. As an example from medicine, predictive models may fail on new participants with differences in biological or cognitive backgrounds such as demographics, handedness, dyslexia diagnosis, disease prevalence or treatment response. In fact, dataset shift is a serious contender for the most critical limitation of genetics in precision medicine8. This challenge is rooted in the overwhelming abundance of studies on participants of European descent, leading to a dearth of well-powered studies in globally diverse populations. Specifically, more than three out of four participants in the thousands of existing genome-wide association studies (GWAS) are of European descent8. However, this ethnicity makes up only 16% of the world’s population (Fig. 1a). Hence, polygenic risk scores — phenotype predictions based on tens of thousands of common genetic variants — estimate individual risk far more accurately in Europeans than in most non-European groups. In a study quantifying the difference in genetic prediction accuracy across 17 anthropometric and blood-panel traits in non-European versus European participants, prediction accuracy relative to Europeans was 38% lower in Hispanic or Latino/Latina Americans, 38% lower in South Asians, 50% lower in East Asians, and 78% lower in Africans on average (Fig. 1b)8. These results point to a systemic generalization failure attributable partly to the dominance of participants of European descent in genetic studies.

Fig. 1: Dominance of European participants in GWAS degrades phenotype prediction performance in individuals from non-European communities.
figure 1

a, The ancestry of GWAS participants over time and its comparison with the global population. b, Prediction accuracy of 17 quantitative anthropometric and blood-panel traits relative to European-ancestry individuals across five continental populations in the UK Biobank. Violin plots show distributions of relative prediction accuracies, center dots, mean values; whiskers, s.e.m. Adapted with permission from ref. 8, Springer Nature.

Failures in single-subject prediction caused by dataset shift cannot be remedied by recruiting more participants. Prioritizing diversity instead of enforcing sample homogeneity showed early promise for polygenic risk scores8. By respecting population stratification, we can pinpoint genomic variants that are rare or absent in European populations. Diversity-aware analyses have already yielded insights, including genetic risk variants linked to type 2 diabetes in the Latino/Latina population9 or prostate cancer in African men10. These examples highlight the value of increased diversity in GWAS participants to propel genetic discovery, enhance understanding of genetic diseases, and refine medical care tailored to single patients.

Incomplete knowledge of how predictive models perform in distinct subpopulations hampers neuroscience, genetics and other biomedical research areas. We need further research benchmarking predictive models in minority populations since overfitting to narrow subpopulations increases structural racism and ultimately hurts the quality of patient care. Ethnicity and population diversity are closely interlocked with the genesis and pathophysiology of major brain disorders such as autism spectrum disorder (ASD), schizophrenia and Alzheimer’s disease. These medical conditions exhibit subpopulation-related differences in prevalence, symptoms and treatment response. Alzheimer’s disease is more prevalent among African Americans and Hispanics than white Americans in the United States, with estimates ranging from 14% to almost 100% higher11. Moreover, women are more often diagnosed with Alzheimer’s disease than men12. Furthermore, in schizophrenia, men and women tend to diverge in several clinical parameters, including onset age, symptoms, disease severity and treatment responses13. Similarly, ASD cohorts typically have 3–5 times more male than female members14. Consequently, major brain disorders need to be investigated in a diverse pool of participants. These disorders have complex mechanisms that are, in many cases, interlocked with sex, age, ethnicity and potentially many other social identity factors.

A commitment to illuminating disease mechanisms through the prism of population stratification is even more critical in the aftermath of COVID-19. Demographic status has been linked with outcomes of this public-health crisis (Fig. 2a). Many lines of evidence suggest that we will face more mental health concerns due to chronic social isolation and stress15. Recent evidence documented the detrimental effect of COVID-19 on minority and marginalized strata of the US population16. Charting patterns across >17,000 candidate variables describing the ABCD population cohort, social determinants of inequity, including household income and immigration status, emerged as the primary determinants of negative pandemic experiences (Fig. 2b). Thus, COVID-19 effects can differ by sociological strata (Fig. 2c). That is why modeling the burden on population strata, especially on minority and marginalized racial and ethnic populations, should be part of the first-line approach in analyses of epidemic-related outcomes.

Fig. 2: Major sources of population stratification are entangled with COVID-19 outcomes across ~10,000 US families.
figure 2

a, Multivariate pattern-learning analysis of how pre–COVID-19 family measures systematically covary with COVID-19 experiences across >17,000 candidate indicators. In the ABCD cohort, household income (x axis) was negatively associated with subsequent COVID-19–associated experiences (y axis) during the pandemic (COVID-19 impact scores). Modeling results are displayed in a hexagonal binning plot: the color scale reflects the number of families with the same baseline COVID-19 association. b, Household environment variable loadings from the multivariate pattern analysis. The bar plot reflects the top-ranked household characteristics in families with the primary explanatory mode of the model (pink, positive associations; purple, negative associations). c, A circular bar plot summarizing experienced racism, sleep hygiene and social media consumption associations with the secondary explanatory mode. Panels ac were plotted from data associated with ref. 16.

Statistically, ignoring population stratification can lead to a phenomenon called Simpson’s paradox or reversal: a trend appears in several different groups of individuals, but disappears or inverts when the groups are combined. Thus, a treatment that appears effective at the population level may have adverse consequences within specific population subgroup. For instance, a higher drug dosage may appear to be associated with higher recovery rates at the population level. However, within specific population strata, a higher drug dosage may actually result in lower recovery rates. As a real example from COVID-19, stratified into age groups, Belgian men within any age group had a higher COVID-19 infection fatality rate than women. However, in the total population of Belgium, women appeared to show a higher rate. This discrepancy is explained by the fact that there are considerably more older women than men in Belgium17. Ignoring such implications of Simpson’s paradox can generate misleading conclusions, which can be dangerous, such as false claims of vaccine inefficacy. Concurrently, rarely attended dimensions of population diversity may uncover instances of Simpson’s paradox that we are unaware of today.

Attending to dimensions of population diversity does not reduce to including more variables in the statistical model to be estimated. If the goal is to estimate causal effects, the selection of variables or model specification needs to be based on causal grounds. Investigators must propose and defend a plausible causal structure spelling out the assumed (directional) dependencies among the outcome, input variables and relevant confounding variables, including diversity factors18. Establishing an assumed causal structure at the beginning of a research endeavor requires consideration of aspects outside the dataset at hand, which can often be challenging. Moreover, the ground-truth causal structure involving some variables, such as socioeconomic status (SES), may be particularly daunting. SES is likely interwoven with inter-individual differences in brain function and susceptibilities to mental and other illnesses throughout the lifespan. Nevertheless, the differences in SES are also likely to arise, at least partly, from differences in brain function and mental illness. Thus, including such causally ambiguous diversity dimensions can lead to deceiving estimates in quantitative models.

Adding as many diversity variables to a statistical model as possible typically makes it more difficult, rarely simpler, to discern what the ultimately obtained statistical model estimates actually mean. The ensuing ‘causal salad’ refers to the consequences of adding numerous control variables without the necessary attention to causal structure19. The growing set of ‘control variables’ is an invitation to an erroneous causal inference of effects. Statistically controlling for inappropriately picked variables can result in collider bias: the true effects of how input variables relate to the target phenotype are distorted. A collider can refer to a variable affected by both the input variable and outcome. For example, seemingly lower mortality was observed for overweight individuals compared with those with average body mass index when controlling for cardiovascular disease (collider). However, increased body mass index is associated with shorter life expectancy. This statistical distortion of the underlying causal directional graph is because being overweight was associated with a substantially increased risk of developing cardiovascular disease at an earlier age. This, in turn, results in a greater proportion of life with cardiovascular disease morbidity20.

Consequences of the diversity blind spot

Respecting distinct population strata can unmask aspects of brain function and organization. For example, existing neuroscientific research emphasizes the importance of handedness in face recognition or language processing. Brain studies that explicitly targeted handedness as a central axis of biology highlighted differences in the functioning of the face-processing network. Based solely on neuroimaging studies that investigated only right-handed participants, face perception was thought to be highly lateralized to the right hemisphere. Therefore, even though earlier investigations deliberately omitted ~10% of the population, the right-hemisphere lateralization of face-sensitive brain areas made it into many neuroscience textbooks21. Only recently have studies on handedness in face recognition tasks found that the fusiform face area is commonly lateralized to the right hemisphere in right-handers, whereas asymmetric hemispheric lateralization appears absent in left-handed subpopulations (Fig. 3a)22.

Fig. 3: Handedness shapes brain function depending on cultural and geographic diversity.
figure 3

a, Hemispheric lateralization in face perception depends on handedness. The bar plots show the extent of activation response (in the number of voxels on the y axis) in the four investigated brain areas when participants viewed faces or bodies compared with the extent of activation when viewing chairs. An asterisk indicates statistically significant differences between activity in the left versus the right hemisphere. FFA, fusiform face area; EBA, extrastriate body area; FBA, fusiform body area; hMT, human motion area MT. b, Language- and visuospatial processing-sensitive neural activations in left-handers with typical (left) and atypical (right) language lateralization. Regions activated during a word-generation task are shown in green. Regions activated during a visuospatial attention task are shown in blue. Panels a,b adapted with permission from ref. 21. c, Handedness as an example of a major source of population diversity, impacting brain and behavior, ties into cultural and geographical differences due to varying societal pushback. The handedness occurrence in the United States is plotted across time. The bar plot highlights the ten jurisdictions with the highest and the five jurisdictions with the lowest left-handedness rates. Panel c plotted from data associated with ref. 35.

The interdependence between handedness and interindividual differences in hemispheric specialization has also been observed in language processing. Left-handers show more bilateral language processing while right-handers show language processing lateralized to the left hemisphere (Fig. 3b)23. Only a tiny fraction of right-handers (4%) exhibit right-hemispheric dominance of the language network. Yet this share increases to at least 27% in left-handers. Consequently, the whole spectrum of lateralization-related variation will remain hidden as we neglect explicit modeling of hand dominance.

Excluding left-handers in neuroscientific research may not be a coincidence. Handedness has been systematically ignored throughout the history of medical studies, on grounds such as to ‘reduce noise’24. However, left-handers make up ~10% of the general population and constitute >800 million people on the planet (Fig. 3c). While the neuroscience field has been in the habit of eclipsing population strata with left-handers, several societies re-educate left-handers early on as children. Handedness rates thus depend on geographical location and historical periods. It is estimated25 that in the United States, only about 3–4% of individuals born before 1920 developed as left-handed, compared with about 11–12% of those born after 1950. The conversion rate can further depend on biological sex. In Japan, the proportion of females forced to convert to the more common right-handedness is much higher than that of males (95.1% to 81.0%)26. Finally, in many African cultures, using the left hand is considered disrespectful and rude, which may be why only 7.9% of people in Abidjan, Ivory Coast, and 5.1% of people in Khartoum, Sudan, are left-handed27. Bias and unrepresentative participant samples led to several erroneous articles, including claims of a reduced lifespan of left-handers28, that shaped public opinion. In summary, cultural and sociological factors can drive inter-individual differences in behavior and its brain basis, which need to be accounted for in population-scale neuroscience studies.

In addition to left-handers being sometimes seen as ‘non-normal’ or needing correction, women were also less often recruited as participants in neuroscientific research over decades. Fortunately, public health recommendations are no longer made on purely male-based datasets such as the Baltimore Longitudinal Study of Aging. This study, which began in 1958 and explored ‘normal human aging’, did not enroll any women for the first 20 years of its execution29. Furthermore, the Physicians’ Health Study concluded in 1989 that daily aspirin might reduce the risk of heart disease, based on 22,071 men and 0 women30.

Similarly to left-handedness, homosexuality was listed as a mental disorder in the Diagnostic and Statistical Manual of Mental Disorders until 1973. The notion of a disorder can hence be a product of the zeitgeist, scaffolded by historical and social conventions. More broadly, some investigators employ the term ‘neurotypicality’ or ‘normative brain’ to describe someone with the brain functions, behaviors and processing considered standard or average. However, studying neurotypicality may go against efforts to embrace diversity. Instead of hunting the illusion of a normal or typical brain, future neuroscience research may benefit from acknowledging key dimensions of ‘neurodiversity’. In a broader context, the extended notion calls for the appreciation of all brains, including those with dyslexia, dyspraxia, dyscalculia, synesthesia, attention-deficit/hyperactivity disorder and ASD. Rich dimensions of population diversity should be respected as a natural form of human variation.

The future starts today, not tomorrow

The diversity dimensions available in a dataset play a crucial role, which can be demonstrated by the dependence of machine learning algorithm success on population diversity (Fig. 4)31. In the biomedical sciences, it is often challenging to know which aspects of human diversity drive differences in a biomedical dataset or research question before they are studied in the context of an actual phenotype of interest. That is, even given a single dataset or an identical participant cohort, the most relevant diversity dimensions may depend on the research goals of a particular investigator. Opting for pertinent dimensions can build on prior evidence of the dimensions' implication in the target phenotype under study. However, there are a large number of potentially relevant variables that could be considered to capture aspects of diversity. Operationally, we can only explicitly model those dimensions of diversity available to the investigator. The UK Biobank, ABCD and other population datasets with deep phenotyping and broad participant recruitment may offer early opportunities to confront such questions.

Fig. 4: Population diversity is linked to more brittle predictive patterns especially in the higher association cortex.
figure 4

a, Relationship between diversity (prediction out of distribution (OOD) versus within distribution (WD)) and prediction accuracy of classifying diagnosis with ASD using functional connectivity profiles. Prediction accuracy is based on AUC (top) and F1 score (bottom) and in two different population cohorts: Autism Brain Imaging Data Exchange (ABIDE, left columns) and Health Brain Network (HBN, right columns). Diversity here denotes the mean absolute difference in propensity scores between the participants of the training set and those in the held-out strata with unseen participants. The second column displays the performance for every single stratum in the holdout strata based on a tenfold cross-validation strategy. b, Anatomical hot spots where the machine-learning-extracted predictive rules are most at risk of becoming brittle in the face of diversity. Each scatterplot depicts the relationship between network-aggregated coefficients of the predictive model and the level of population diversity. Seven scatterplots correspond to model coefficients averaged for each of the seven intrinsic connectivity networks. Brain rendering indicates regions whose coefficients show a significant association with diversity. ASD, autism spectrum disorder; AUC, area under the curve; DAN, dorsal attention network; DMN, default mode network; FPN, frontoparietal network; LSN, limbic system network; SMN, somatomotor network; VAN, ventral attention network; VN, visual network. Adapted with permission from ref. 31, © 2022 Benkarim et al., CC-BY 4.0.

Purposefully broadening recruitment and retention of participants from diverse backgrounds is increasingly recognized as a necessary improvement. However, recruiting from marginalized communities still faces substantial challenges, such as mistrust of government entities, economic constraints, or exploitation of a vulnerable population32. To address some of these challenges, the Canadian Longitudinal Study on Aging took the initiative from the beginning: geographic areas that contain people with less education and lower SES, on average, were oversampled during the recruitment process33. Furthermore, scientific societies called for the inclusion of existing measurements from emerging regions of the world in data initiatives34.

In conclusion, the emergence of large-scale population datasets marks a watershed event in twenty-first-century neuroscience. Deep profiling across phenome, brain measurements and genetics puts us in a better position than ever to sensitize studies to sources of brain diversity as we march toward single-subject prediction and precision medicine. Specifically, we argue that it will pay off to treat diversity factors as variables of interest rather than nuisance variables. The future of neuroscience lies in celebrating diversity rather than perpetuating the elusive notion of a ‘normal brain’.