Main

Social and language development are inextricably linked in typical infants as a result of experience during caregiver–infant interactions1,2. It is theorized that social and language development and learning as well as affect and emotion development in infants is an experience-expectant process that involves highly affective parental speech, such as infant-directed speech or motherese, as an early input during caregiver–infant interactions3. Motherese is a highly compelling form of positive affective speech; it has human-unique and special characteristics including higher pitch, slower tempo and exaggerated speech contours, accompanied by heightened positive affect. Motherese occurs in many and diverse cultures, and some theorize that this type of affective speech has a genetic basis that emerged evolutionarily in humans4,5,6. The prevailing view, then, is that, across cultures and languages, strongly affective speech creates and maintains a caregiver–infant social and language interaction which promotes learning. Implicit is that the infant’s seemingly ‘automatic’ neural response to motherese mediates this development and learning. Typically developing (TD) infants attend to and prefer motherese over other forms of adult speech3,7,8,9,10, and a small number of behavioural and neuroimaging studies suggest that TD infants may process motherese differently from non-speech sounds11,12,13,14,15. However, if such attention enhances social and language learning, then it would be predicted that enhanced or reduced neural responsiveness to such affective speech might be associated with enhanced or reduced early-age social and language ability. However, this hypothesis remains largely untested.

Deficits in joint caregiver–child social interactions and in social and language learning and development are early-age signs of autism spectrum disorder (ASD), so it is essential to understand the reasons for these deficits16,17,18. While the behavioural literature has thoroughly described these diagnostically core deficits in ASD19,20, there is remarkably scant evidence about how these deficits arise and whether a key actor may be abnormally reduced neural responsiveness of infants with ASD to affective speech, including motherese. There are no studies that characterize impairments in neural processing of motherese and related types of affective speech in ASD nor any carefully controlled eye tracking studies of attention to motherese in ASD. Potentially, if neural responses to motherese and other affective speech in infants with ASD are reduced, then attention to highly affective speech, such as motherese, as well as social, language abilities might likewise be reduced.

Therefore, the main study hypothesis is that, across TD toddlers and toddlers with ASD, enhanced or reduced attention to affective speech would be associated with enhanced or reduced neural responsiveness to affective speech as well as with better or poorer early-age social and language abilities. To investigate this hypothesis, we measured functional magnetic resonance imaging (fMRI) responses to three different levels of affective speech—mild affect speech, moderate affect speech and motherese—during natural sleep in large samples of toddlers with ASD and TD toddlers. Because motherese is a strongly positive and, for infants, compelling form of affective speech, by including motherese along with milder affective speech we have a strong test of whether neural and behavioural responses to positive affective speech are reduced in toddlers with ASD. Additionally, by obtaining fMRI activation to affective speech during natural sleep, we were able to measure how the toddler’s brain ‘intrinsically’ responds to motherese and non-motherese affective speech independent of volition, arousal, interest, motivation, attention, awareness, expectation and cooperation, which could alter neural responses recorded during wakefulness. We also determined whether such neural responses are correlated with early-age social and language abilities in toddlers with ASD and TD toddlers. Then, to determine whether neural responses to affective speech as well as social and language abilities were associated with behavioural preference for motherese, we used active, gaze-contingent eye tracking, which provides strong evidence of volitional preference as opposed to passive looking and objectively quantifies preference for a female speaking motherese. We chose to test eye tracking on motherese because it provides a strong test of behavioural responsiveness to age-relevant, highly compelling affective speech, as opposed to eye tracking to mildly affective speech.

As shown schematically in Fig. 1, we first tested whether neural responses to affective speech including motherese are in fact reduced in toddlers with ASD. Next, we tested the prediction that neural responses to affective speech are correlated with social and language developmental measures across toddlers with ASD and TD toddlers. Lastly, we tested our main hypothesis above using a two-stage procedure. In the first stage, we used a data-driven, unbiased method (that is, similarity network fusion (SNF)21,22) to identify clusters of individuals whose combined neural and clinical measures are maximally similar to each other and maximally different from individuals in other clusters. In the second stage, we tested whether individuals in different neural–clinical clusters differed in preference for and attention to motherese based on the gaze-contingent eye tracking preferential looking paradigm. In this SNF approach, rather than determining diagnostic categories a priori through clinical measures, the similarity and dissimilarity in patterns of the different kinds of measures across all individuals drives subject subgroups/clusters independent of diagnosis. This method takes into account all neural and clinical data equally in each individual and reveals bio-behavioural dimensionality.

Fig. 1: Experimental design and data analysis flow chart.
figure 1

Specifically, the fMRI experimental design included three language paradigms: that is, mild affect speech, moderate affect speech and motherese. In this study, three different data analyses were performed: (1) ASD versus TD t-tests; (2) correlations between neural responses to three language paradigms and Vineland communication and socialization subtests across TD toddlers and toddlers with ASD; (3) similarity network fusion (SNF) and clustering analysis.

Results

Variable affective speech levels across fMRI language paradigms

We collected fMRI data from three language paradigms with variable affect levels: mild affect speech (the story language paradigm), moderate affect speech (the Karen language paradigm) and motherese (the motherese paradigm). For detailed descriptions of these language paradigms, see Methods section. The speech stimuli in each language paradigm included female voices speaking with varying levels of affective valence and speech. To determine whether there were perceptual differences in affect levels across the three language paradigms, we conducted two computer-based surveys in TD adults (survey 1, n = 19; survey 2, n = 15) and compared the ratings of affect levels between language paradigms using paired two-tailed t tests. Results from both survey 1 (Supplementary Fig. 1a; motherese versus moderate affect speech, t(18) = 20.5, P < 0.001, Cohen’s d = 4, 95% CI 2.82–5.2; motherese versus mild affect speech: t(18) = 20.2, P < 0.001, Cohen’s d = 5.4, 95% CI 3.25–7.47; moderate affect versus mild affect speech: t(18) = 11.7, P < 0.001, Cohen’s d = 1.9, 95% CI 1.34–2.42) and survey 2 (Supplementary Fig. 1b; motherese versus moderate affect speech: t(14) = 47.7, P < 0.001, Cohen’s d = 17.9, 95% CI 8.15–27.6; motherese versus mild affect speech: t(14) = 73.6.4, P < 0.001, Cohen’s d = 31.2, 95% CI 12.03–50.43; moderate versus mild affect speech: t(14) = 20.4, P < 0.001, Cohen’s d = 10.3, 95% CI 2.68–17.91) demonstrated that TD adults rated the motherese vignettes as having the strongest positive affect, the Karen language vignettes as having moderate affect, and the story language vignettes as having mild affect.

Similar activation patterns in TD adults and TD toddlers

We collected a total of 241 fMRI language datasets (see Supplementary Table 1 for scan details): 200 fMRI datasets from 71 toddlers (see Table 1 for demographic information and clinical scores) and 41 fMRI datasets from 14 adults. We first compared whole-brain activation patterns in sleeping TD toddlers and awake TD adults, and found no statistically significant differences in activation patterns for each of the three language paradigms (Supplementary Fig. 2a). To quantify the per cent signal change to language stimuli, we used two language-relevant regions of interest (ROIs) from the meta-analytic activation map in Neurosynth (https://neurosynth.org/) with the term ‘language’ that included left and right temporal regions. These ROIs were identical to those used in previous papers23,24. Per cent signal change was significantly lower in sleeping TD toddlers than in awake TD adults (Supplementary Fig. 2b). These results show that, during sleep, TD toddlers have similar temporal cortical activation patterns during language processing, albeit with reduced activation strength compared with that of awake passively listening adults.

Table 1 Demographic information and clinical test scores for toddlers with ASD and TD toddlers

Reliability of affective speech activation in toddlers

We also evaluated the test–retest reliability of brain activation to language paradigms with varying levels of affective prosody within individual toddlers with ASD and TD toddlers across time. The test–retest scans were divided into two groups based on intervals between initial and retest scans: short-term retest (1–4 months after initial scans) and long-term retest (12–15 months after initial scans). The overall test–retest reliability (initial scans versus all retest scans) was quantified with intraclass correlation coefficients, which showed moderate to good reliability for moderate affect speech and motherese paradigms, and moderate reliability for the mild affect speech paradigm in the left temporal ROI but poor reliability in the right temporal ROI (Supplementary Fig. 3).

Reduced neural response to speech in toddlers with ASD

We found robust and significant activation in temporal language regions in TD toddlers but reduced activation in toddlers with ASD in all three levels of affective prosody (Fig. 2a). There were no significant differences in whole-brain activation between TD toddlers and toddlers with ASD. However, an ROI-based analysis, using two-sample two-tailed t tests, demonstrated significant group differences (TD versus ASD) across three language paradigms in both left temporal ROI (mild affect speech: t(52) = 2.99, P = 0.005, Cohen’s d = 0.89, 95% CI 0.31–1.47; moderate affect speech: t(62) = 2.61, P = 0.012, Cohen’s d = 0.68, 95% CI 0.17–1.2; motherese: t(60) = 2.3, P = 0.026, Cohen’s d = 0.59, 95% CI 0.06–1.12) and right temporal ROI (mild affect speech: t(52) = 2.7, P = 0.011, Cohen’s d = 0.81, 95% CI 0.24–1.39; moderate affect speech: t(62) = 2.74, P = 0.009, Cohen’s d = 0.73, 95% CI 0.21–1.25; motherese: t(60) = 2.48, P = 0.017, Cohen’s d = 0.66, 95% CI 0.12–1.19) (Fig. 2b). Thus, TD toddlers had stronger temporal cortical activation across all three types of affective speech as compared to toddlers with ASD, who had significantly weaker responses.

Fig. 2: Reduced language-related brain activation in toddlers with ASD as compared to TD toddlers.
figure 2

a, Whole-brain activation to speech paradigms with varying levels of prosody in TD toddlers and toddlers with ASD. b, Per cent signal change extracted from two ROIs (left and right temporal regions) within the Neurosynth ‘language’ meta-analysis map (https://neurosynth.org/) across three language paradigms. Barplots show average per cent signal change, and error bars show s.e.m. Asterisks indicate significant results of two-sample two-tailed t tests between groups (TD versus ASD). d, Cohen’s d. *P < 0.05, **P < 0.01.

Correlations between fMRI and social/communication ability

We further investigated correlations between a toddler’s neural response to affective speech and his/her social and communication abilities. The results of mixed-effects models showed significant correlations between fMRI activation and Vineland socialization and communication scores across individuals and language paradigms (left temporal ROI: Vineland communication scores, t(48) = 2.4, P = 0.02, marginal R2 = 0.068; left temporal ROI: Vineland socialization scores, t(50) = 2.73, P = 0.009, marginal R2 = 0.08; right temporal ROI: Vineland communication scores, t(50) = 2.58, P = 0.013, marginal R2 = 0.094; right temporal ROI: Vineland socialization scores: t(51) = 3.23, P = 0.002, marginal R2 = 0.13) (Fig. 3 and Supplementary Table 2).

Fig. 3: Scatterplots showing correlations between brain activation to language and social communication abilities in toddlers.
figure 3

Black lines in each scatterplot represent regression model fit across all individuals (toddlers with ASD and TD toddlers) and language paradigms (mild affect speech, moderate affect speech and motherese). The r values indicate Pearson’s correlation coefficients (for estimated coefficients from mixed-effects models, see Supplementary Table 2). **P < 0.01, ***P < 0.001.

Neural correlates of behavioural motherese preference

Given the reduced neural activation to motherese and other affective stimuli among toddlers with ASD, we next examined a behavioural measure of motherese preference and how it relates to motherese-related activation in toddlers with ASD. Gaze-contingent eye tracking during a visual preference paradigm including both motherese and non-motherese, non-social stimuli is a strong behavioural test of attention to and preference for age-relevant and compelling affective speech. As such, when visual attention preference was measured using this task in TD toddlers and toddlers with ASD, unlike TD toddlers, individuals with ASD showed substantially and significantly reduced percentage fixation towards motherese, preferring the non-motherese, non-social computer ‘techno’ sounds (Fig. 4a; TD versus ASD: t(52) = 3.25, P = 0.001, Cohen’s d = 0.83, 95% CI 0.25–1.4, two-sample one-tailed t test). Next, we examined whether fMRI activation to motherese was correlated with eye tracking-based attention to motherese in toddlers with ASD and TD toddlers using Pearson’s correlation (one-tailed test). As shown in Fig. 4b, the TD group exhibited a significant positive correlation in the left temporal ROI (r(21) = 0.407, P = 0.03), but no statistically significant correlations were observed for the right temporal ROI (r(21) = 0.186, P = 0.2). The ASD group showed no statistically significant evidence for either left (r(29) = 0.007, P = 0.49) or right temporal ROI (r(29) = −0.097, P = 0.7).

Fig. 4: Gaze-contingent eye tracking measures of preference for motherese and correlations with neural response to motherese in toddlers with ASD and TD toddlers.
figure 4

a, TD toddlers showed significantly stronger preference for motherese over non-social, computer ‘techno’ images and sounds compared to toddlers with ASD (t(52) = 3.25, P = 0.001, Cohen’s d = 0.83, 95% CI 0.25–1.4; two-sample one-tailed t test). d, Cohen’s d; white cross, group mean. b, Correlations between visual attention preference for motherese and neural response to motherese in toddlers with ASD and TD toddlers. In the TD group, there was a significant positive correlation in the left temporal ROI (r(21) = 0.407, P = 0.03), but no statistically significant correlations were observed for the right temporal ROI (r(21) = 0.186, P = 0.2). In the ASD group, there was no statistically significant correlation in either left (r(29) = 0.007, P = 0.49) or right temporal ROI (r(29) = −0.097, P = 0.7). *P < 0.05, **P < 0.01.

Mapping motherese preference to SNF neural–clinical clusters

While toddlers with ASD exhibited reduced activation in response to language independent of affect levels, the significant, positive relationship between per cent signal change and percentage fixation to motherese among TD toddlers suggests that a subset of toddlers perhaps exhibit both reduced activation to and behavioural preference for motherese stimuli. As such, we next applied SNF to determine whether we could objectively identify neural–clinical subgroups that directly map onto eye tracking measures of preference for motherese. Using SNF, we identified four fMRI–clinical phenotypically distinct clusters of toddlers, including two predominantly TD and two completely ASD clusters (Fig. 5a). At the individual level, 100% of TD toddlers fell into two clusters: cluster 1 (12 TD, 2 ASD) and cluster 2 (8 TD, 3 ASD), and 83.3% of toddlers with ASD also fell into two clusters: cluster 3 (0 TD, 14 ASD) and cluster 4 (0 TD, 11 ASD).

Fig. 5: TD and ASD subgroups with distinct fMRI–clinical patterns and correlations with behavioural preference for motherese.
figure 5

a, SNF and Louvain algorithm revealed four fMRI–clinical distinct subgroups (green, TD; red, ASD), including two clusters predominantly containing TD individuals (cluster 1: 12 TD, 2 ASD; cluster 2: 8 TD, 3 ASD) and two clusters entirely containing individuals with ASD (cluster 3: 0 TD, 14 ASD; cluster 4: 0 TD, 11 ASD). b, The left barplot displays average per cent signal change across language paradigms for each cluster. The right barplot shows differences in per cent signal change between motherese versus moderate affect speech in four clusters. The error bars represent the standard error of the mean. c, Social and language ability across clusters. Boxplots show interquartile range (IQR; first quartile, Q1; third quartile, Q3); the horizontal line inside the box represents the median; whiskers indicate Q1 – (1.5 × IQR) or Q3 + (1.5 × IQR). d, Toddlers in clusters 1 and 2 had significantly higher percentage fixation towards motherese versus non-social computer ‘techno’ images and sounds than toddlers with ASD in cluster 4 (cluster 1 versus 4: 79.28% versus 41%; t(19) = 3.95, P = 0.0009, Cohen’s d = 1.86, 95% CI 0.75–2.98; cluster 2 versus 4: 78.83% versus 41%; t(16) = 3.82, P = 0.001, Cohen’s d = 1.86, 95% CI 0.66–3.07). The white cross indicates group mean. The heatmap matrix shows standardized effect sizes (Cohen’s d) for each pairwise group comparison between clusters. Cohen’s d value is shown in each cell, and the standard effect size is also indicated by the colour of the cell. Asterisks indicate significant results of two-sample one-tailed t tests between clusters. ***P < 0.001. ABC, adaptive behaviour composite; EL, expressive language; RL, receptive language; ELC, early learning composite; SA, social affect; RRB, restricted and repetitive behaviour; L, left temporal ROI; R, right temporal ROI.

Visual inspection of Fig. 5b,c shows that cluster 1 toddlers had the highest neural activation and best clinical performances among clusters, while cluster 4 toddlers had low activation and very low clinical scores (for all clinical variables, see Supplementary Fig. 4). The composite activation and clinical scores of individuals in clusters 2 and 3 generally fell between cluster 1 and 4. In this way, SNF provides insight into the bio-dimensional, multi-modal subgroups underlying TD and ASD neural–clinical heterogeneity.

SNF also enables follow-up statistical analyses of different clusters for further characterization, interpretation and hypothesis generation. We statistically examined the main effects of cluster (clusters 1, 2, 3 and 4), paradigm (mild affect speech, moderate affect speech and motherese) and hemisphere (left and right temporal ROIs) as well as their interactions on fMRI activation using a three-way ANOVA. Results showed main effects of cluster (F(3,46) = 11.75, P < 0.001) and hemisphere (F(1,46) = 9.5, P = 0.003) as well as a significant cluster × language paradigm effect (F(6, 92) = 2.83, P = 0.014) and a significant cluster × hemisphere effect (F(3, 46) = 3.68, P = 0.019) on fMRI activation (Supplementary Table 3). Follow-up tests showed statistically significant cluster effects involving fMRI responses to motherese. Specifically, two-sample two-tailed t tests, after correcting for multiple comparisons using the false discovery rate correction, demonstrated that right temporal response to motherese in cluster 1 was significantly greater than other clusters (cluster 1 versus 2: t(23) = 3.39, P = 0.0025, Cohen’s d = 1.31, 95% CI 0.39–2.23; cluster 1 versus 3: t(26) = 3.45, P = 0.0022, Cohen’s d = 1.31, 95% CI 0.45–2.16; cluster 1 versus 4: t(23) = 4.7, P = 0.0001, Cohen’s d = 1.76, 95% CI 0.78–2.73). Cluster differences in right temporal activation and effect size were greatest between cluster 1 and 4, with cluster 4 having the least right temporal activation to motherese among the different clusters. These can be qualitatively seen in Supplementary Fig. 4b.

We then tested the hypothesis that cluster 1 toddlers would have greater fMRI responses to motherese than to the moderately affective but non-motherese speech paradigm, while individuals with ASD in cluster 4 would have smaller differential fMRI responses to motherese. The statistical analysis was performed with non-parametric Mann–Whitney one-tailed tests because data in cluster 4 were not normally distributed as verified by the Shapiro–Wilk test (left temporal ROI, P = 0.045; right temporal ROI, P = 0.034). Indeed, in the right temporal ROI, cluster 1 toddlers exhibited a differential increase to motherese (that is, motherese versus moderate affect speech) which was significantly greater than activation observed toddlers with ASD in cluster 4 (z = 2.03, P = 0.022, effect size r = 0.41, 95% CI 0.05–0.71). In fact, cluster 1 had 22% greater and cluster 4 had 58% less fMRI activation to motherese than to moderate affect speech. At the individual level, 71% of cluster 1 toddlers had greater activation to motherese than to moderate affect speech, while 82% of cluster 4 toddlers with ASD had less activation to motherese than to moderate affect speech (χ2(1, 25) = 5.03; P = 0.025, Cramer’s V = 0.53, 95% CI 0.16–0.85). This differential activation pattern between cluster 1 versus 4 was absent for the left temporal ROI (z = 0.77, P = 0.23, effect size r = 0.15, 95% CI 0.01–0.51). In addition, cluster 4 toddlers with ASD showed a marginally statistically significant difference in right temporal activation to motherese relative to moderate affect speech when compared with toddlers with ASD in cluster 3, who had 38% greater activation to motherese than to moderate affect speech (z = 1.48, P = 0.075, effect size r = 0.3, 95% CI 0.02–0.66). Thus, toddlers with ASD in cluster 4 displayed a distinctive pattern of (1) less neural response to motherese as well as a within-individual decrease in the differential fMRI response to motherese relative to moderate affect speech; (2) very poor social, language and cognitive abilities; and (3) severe ASD symptoms (Fig. 5b,c and Supplementary Fig. 4b,c). Notably, cluster 1 toddlers showed a neural–clinical pattern opposite from those of cluster 4, namely robust activation to all three affective speech paradigms, increased differential motherese activation relative to moderate affect speech and the highest social, language and cognitive abilities.

Finally, we compared eye tracking measures of attention to and preference for motherese (1 of the 12 motherese segments utilized in the fMRI paradigm) across the four neural–clinical clusters. As examined by two-sample two-tailed t tests (Fig. 5d), toddlers with ASD in cluster 4 had significantly lower percentage fixation towards motherese compared with those in clusters 1 and 2 (cluster 4 versus 1: 41% versus 79.28%, t(19) = −3.95, P = 0.0009, Cohen’s d = 1.86, 95% CI 0.75–2.98; cluster 4 versus 2: 41% versus 78.83%, t(16) = −3.82, P = 0.001, Cohen’s d = 1.86, 95% CI 0.66–3.07). There was a marginally statistically significant difference in percentage fixation between cluster 4 and cluster 3 (41% versus 61.76%, t(18) = −1.72, P = 0.051, Cohen’s d = 0.74, 95% CI −0.25 to 1.73).

Discussion

Our within-individual fMRI, clinical, eye-tracking design provides unique evidence in support of the long-standing behaviour-based theory that a toddler’s increased neural responses to motherese and other affective speech may increase behavioural responses to motherese utterances and lead to increased social and language abilities at early ages. We found in TD toddlers and toddlers with ASD that greater or lesser neural responses to affective speech in temporal cortex was associated with greater or lesser social and language abilities. Then, using a data-driven SNF and clustering approach, we disentangled TD and ASD neural and clinical heterogeneity into four subgroups and showed that cluster 1 (12 TD, 2 with ASD) toddlers with greater differential neural responses to motherese also have greater visual attention preference for motherese and better social and language skills than toddlers in cluster 4 (11 with ASD) who have lesser neural responsiveness to motherese affective speech, very poor social and language abilities, and reduced eye-tracking-related attention to motherese. A distinctive finding among toddlers in cluster 4 was that motherese stimuli evoked less of a neural response than did moderate affect speech, which is in direct contrast with cluster 1 toddlers, who exhibited increased neural responses to motherese. Clusters 2 and 3 contained individuals with somewhat intermediate neural–clinical phenotypes. Overall, these varying phenotypic characteristics across clusters indicate that social preference and language development are intertwined across a wide spectrum of social and language ability and disability. Moreover, these neural responsiveness effects were not confounded by volition, arousal, interest, motivation, attention, awareness, expectation and cooperation. Further, clustering results suggest that high neural response TD toddlers and low neural response toddlers with ASD stand at opposite ends of the neural–affective–response and social–language ability spectrum. Overall, these results indicate that the biology and behaviour of both TD toddlers and toddlers with ASD are multi-dimensional.

Our study points to the early-age neural bases of the core social deficits and reduced responsiveness to parental affective speech that first emerge in infants and toddlers who develop ASD. As such, reduction of a normal neural response to affective language stimuli is already present at the early age of clinical onset in most toddlers with ASD. This may be a biomarker of foundational dysregulation of social–emotional neural development that could underpin the development of associated social, cognitive and behavioural functions. Indeed, one of the first early signs of ASD in babies and infants is a sharp reduction of behavioural response to mother’s affective speech25,26. The present study, conducted in sleeping toddlers with ASD, provides evidence for the possible early-age neural basis of reduced behavioural preferences for motherese utterances. In cluster 4, it is robustly evident that reduced neural responses to affective speech, including motherese, are associated with reduced volitional behavioural attention to motherese and poor social and language abilities.

Because this reduced neural response to affective speech, including motherese, is observed during natural sleep, it cannot be attributed to attention, arousal, momentary distractions or competing motivations. As compared with cluster 1 toddlers, toddlers with ASD in cluster 4 have a strikingly weak neural response in the superior temporal cortex that actually differentially diminishes to motherese. We hypothesize that, if such a weak response occurs in the awake baby or infant with ASD, it could effectively disconnect the foundational caregiver–infant loop in which affective speech such as motherese enhances the experience-expectant process of social, language, emotional and cognitive development and learning in infants27,28. Such a foundational dysfunction, we further hypothesize, might undermine not only the early caregiver–infant experience-expectant process but also later efforts through behavioural therapy to socially engage toddlers with ASD. Many early interventions that show success for some individuals hinge critically upon the idea of changing this attribute of early ASD development29,30,31. The hope is that early intervention will increase engagement between the child and the social world and enable experience-related neuroplasticity to divert a child towards more TD trajectories.

Motherese, found in many diverse cultures4,5,6, is an especially strong affective tool that may augment the human-special caregiver–infant interactive loop and is thought to have a genetic basis that emerged evolutionarily in humans. As such, very early neural, behavioural and experience-expectant responses and processes in the infant are likely to be driven genetically to some significant degree and evolutionarily emergent in human infants. Thus, since ASD is highly genetic with heritability of about 81%32,33, the marked deficits in neural and behavioural responses we identify in toddlers with ASD in cluster 4 might very well involve genetic effects, a hypothesis that should be pursued with brain-genetic studies. Also, ASD is a prenatal multi-stage, multi-process disorder that begins in the first trimester with disruption of proliferation and neurogenesis and continues throughout the second and third trimesters with disorder of neurite outgrowth, synaptogenesis and neural network function34,35. Social and language impairments are the consequences. Indeed, dysregulation of gene expression in ASD is highly correlated with ASD social symptoms in infants and toddlers36. We also identified gene co-expression networks in ASD that are associated with neural hypoactivation to language stimuli and with abnormal cortical growth in toddlers with ASD who had very poor language development24,37. Identified genes and networks include language-relevant, ASD-associated, human-specific and prenatal genes, as well as those involved in cell proliferation and excitatory neuron development. These several studies indicate that prenatal genetic dysregulation in ASD may lead to early-age impairment in social and language development and the reduced neural responses to affective speech we describe herein. These disruptions could act as important impediments to socially engaging with the caregiver. It is of major importance that future work directly test this theory.

Understanding early-age clinical and neural ASD heterogeneity is a major challenge. Heterogeneity among TD toddlers is equally important to address but is often overlooked, leading to weakened power to detect and characterize differences between toddlers with ASD and TD toddlers. Here, we demonstrate the power of the unbiased data-driven SNF/clustering method that was used for the first time in the ASD field to resolve the neural bases of early-age social and language heterogeneity in ASD. Our data identified an ASD subgroup with a distinctive neural–clinical pattern that may suggest a poor prognosis, and another ASD subgroup with somewhat better clinical, neural and behavioural characteristics that might suggest a better prognosis. Several individuals with ASD fell in clusters 1 and 2, opening the possibility that SNF and clustering may have identified those rare toddlers with ASD who later have optimal outcomes. Long-term follow-up studies on these individuals will be valuable to test these possibilities. Notably, this method allowed us to additionally resolve the heterogeneity among TD. The relevance of the neural–clinical subgroups to the behaviour of interest was then quantitatively measured and validated using gaze-contingent eye tracking assessments of toddlers volitionally choosing to view a female telling a story in motherese or computer ‘techno’ sounds and images. This gaze-contingent eye tracking assessment, as shown in a similar method38, is a tool to simulate social interaction behaviours in ASD. The present results show that social and language ability and behavioural preference for motherese are linked to how strongly temporal cortex responds to affective speech, including motherese, in toddlers across the neurodevelopmental spectrum from typical to language and social impaired.

Lastly, another exciting finding is that our purely unbiased data-driven SNF/clustering method appears to have replicated, in a new sample of toddlers using new paradigms, the presence of two main ASD subgroups with good and poor language ability that we previously identified using a subjective stratification approach23,24. Previously, we arbitrarily stratified ASD based on a child being above (‘ASD good’) or below (ASD poor’) one standard deviation on the norm of Mullen expressive and receptive language scales. A similar pair of ASD subgroups emerged from the purely unbiased SNF/clustering method while also resolving TD toddler heterogeneity. Therefore, these may be reliable diagnostic and prognostic ASD subgroups that are also aetiologically and biologically meaningful. As such, future work should investigate how they may open early-age diagnostic, prognostic and/or treatment avenues for biomarker discovery.

The current study has three possible limitations that are worth discussion. The first limitation is that we collected fMRI data during natural sleep without electroencephalography as doing both simultaneously is an extremely challenging dual procedure in non-sedated toddlers39. Thus, sleep stages were not monitored directly. Research has shown that, as compared with TD children, children with ASD have a longer latency before falling asleep (for example, 5 min (ref. 40)), lower rapid eye movement (REM, a sleep phase featured by random rapid movements of eyes) sleep percentage (for example, 14.5% versus 22.6%40), and greater non-REM sleep percentage as well as a slight difference in anterior–posterior distribution of non-REM sleep40,41,42,43. In our study, fMRI scans started after each child was sound asleep at the beginning of the night, and by waiting about 20–30 min after sleep onset, data were collected during non-REM sleep. So, potential sleep differences between ASD and TD would not be expected to differentially account for ASD versus TD brain activation differences.

Another limitation involves the three affective language paradigms used in the present study. By definition and design, the motherese vignettes were distinct from the mild and moderate affect speech vignettes by being higher in affect, high-pitched, lyrical and sing-songy in intonational quality. The mild and moderate affect speech vignettes did not have these more vivid motherese qualities and had comparatively lower levels of positive affect, as demonstrated by our affect testing procedure. Another difference was that, while motherese and moderate affect speech had only forward speech, mild affect speech had both forward and backward speech. However, by including these different affective speech designs, we demonstrated several important points. First, inclusion of the mild affect speech (with its backward speech segment) enabled us to demonstrate that our previous results from toddlers with ASD versus TD toddlers23,24,44,45 could be replicated in an entirely different cohort of toddlers using the identical speech stimuli. This highly substantial and successful replication shows the robustness of our fMRI approach and findings in toddlers with ASD. Second, inclusion of three different levels of affective speech enabled us to demonstrate the strong generalizability of affective language effects across paradigms in toddlers with ASD, wherein ASD had significantly reduced temporal neural responses in all three language paradigms. Third, despite these stimulus differences across paradigms, we also observed that cluster 1 individuals showed the predicted pattern of increased right temporal activation with higher levels of affect. This, in turn revealed that the largest neural activation differences were seen in motherese (strong affective speech) between cluster 1 and cluster 4. Fourth, again despite differences between paradigm stimulus, there were no TD toddler–adult differences in activation patterns evoked by affective speech, including motherese.

A third limitation worth noting is the fact that auditory sensory deficits in ASD might account for the different neural response to the stimuli during sleep. Figure 5b shows that higher-ability TD and lower-ability ASD are at the opposite ends, with lower-ability TD and higher-ability ASD in the middle. While the hypothesis that deficits in lower-level sensory processing was not directly tested herein, it seems implausible that brain activation patterns across clusters is matched by abilities in low-level auditory sensory processing across TD toddlers as well as toddlers with ASD. In addition, we previously analysed the question of general auditory processing in mild affect speech by specifically examining activation responses within the primary auditory cortex, but did not find any significant differences between ASD and TD groups23.

In conclusion, in a one-of-a-kind fMRI study of ASD, we resolve ASD and TD neural–social–language–symptom heterogeneity into four discrete subgroups and show robust and systematic differences in how the brain of toddlers with ASD and TD toddlers responds to varying levels of affective speech, including motherese. We show that these neural differences are associated with differences in volitional behavioural preference for social–emotional motherese utterances and relate to social and communication developmental differences. Our findings support the long-standing behaviour-based theory that neural activity elicited by affective speech such as motherese may be important in driving infants to engage with caregivers in social and language learning. We speculate that enhanced neural responsiveness leads to such learning, while weaker neural responsiveness may impede or preclude it. This hypothesis predicts that neural and behavioural deficits together may be a biomarker of foundational dysregulation of social–emotional neural development and learning. As such, different ASD neural–clinical–behavioural subgroups were identified that may benefit from different treatment approaches.

Methods

This study was approved by the University of California, San Diego Institutional Review Board. Informed consent was obtained from parents or guardians of toddlers and from adult participants. Families were compensated up to $850 dollars upon study completion. In the present study, data collection and analysis were not performed blind to the conditions of the experiments.

Participants

Toddlers were recruited through community referral and a population-based screening method in collaboration with paediatricians via the Get SET Early Approach46, formally known as the 1-Year Well-Baby Check-Up Approach16,18. All toddlers participated in clinical assessments, including the Autism Diagnostic Observation Schedule (ADOS)47, Mullen Scales of Early Learning48 and Vineland Adaptive Behavior Scales49. Toddlers who received their initial diagnostic and clinical evaluations at <36 months were invited to return for repeat evaluations until they reached 48 months. Clinical scores at the most recent visit were used as a best estimate of a child’s abilities (Table 1). Clinical testing occurred at the University of California, San Diego Autism Center of Excellence. Adult participants were recruited by word of mouth.

Clinical scores and fMRI scans were collected from 71 toddlers (41 with ASD, 30 TD; 53 male, 18 female, 14–55 months old). The distribution of intervals between the age at fMRI and clinical data collection is shown in Supplementary Fig. 5a. Toddlers were considered TD if their diagnosis at outcome was TD and their Mullen Early Learning Composite scores fell within two standard deviations of the group mean. This allows us to examine activation patterns along a continuum of language and cognitive abilities in TD children. A subset of toddlers (4 with ASD, 6 TD) had test–retest fMRI scans collected at intervals ranging from 1 to 15 months after the initial scan. fMRI scans were also obtained from 14 TD adults (6 male, 8 female, 20–37 years old). No statistical methods were used to pre-determine sample size, but our toddler sample sizes are similar to those reported in prior publications23,45.

Sleep fMRI

Scans of toddlers were conducted during natural sleep, which has been proven to yield robust activation in toddlers with ASD and TD toddlers44,45,50,51. For the sleep fMRI, on the day of fMRI scan, parents were instructed to eliminate naps from their child’s typical routine, keep their child awake while at home and arrive at the scanner 1 h past their child’s normal bedtime. In an attempt to standardize stages of sleep during scanning, babies were placed on the scanner bed approximately 20 min after sleep onset. Previous research has shown that sleep fMRI acquisition success is confined to non-REM stage 3 sleep39, further leading to homogeneity of sleep state among successfully acquired scans.

Language paradigms

We presented three paradigms using female voices, one with high levels of positive affect (motherese paradigm, referred to as motherese), a second with comparatively moderate positive affect (Karen language, referred to as moderate affect speech) and a third with comparatively milder positive affect (story language, referred to as mild affect speech) (Affect level testing). Average peak frequency was 354 Hz (s.d. 67 Hz, range 258–469 Hz) for motherese, 236 Hz (s.d. 41 Hz, range 211–375 Hz) for moderate affect speech and 275 Hz (s.d. 35 Hz, range 258–328 Hz) for mild affect speech. Average beats per minute were 59 (s.d. 21, range 20–93) for motherese, 77 (s.d. 27, range 20–119) for moderate affect speech and 60 (s.d. 21, range 44–88) for mild affect speech. These three paradigms were presented in a block design (20 s stimulus, 20 s rest), and each speech vignette served as a stimulus. The order of paradigm presentation varied across participants.

The motherese paradigm consisted of 12 high-affect, age-appropriate vignettes (8 min 5 s), each spoken by a different female using high-pitched, intonational, lyrical and sing-songy speech characteristic of motherese3,9. Thus, by definition and design, motherese differed from moderate and mild affect speech by having higher pitch and affect. The moderate affect speech consisted of 18 different age-appropriate nursery story vignettes, each spoken by different females (12 min 5 s) with moderate levels of affect.

The mild affect speech has two vignettes spoken by a single female with comparatively milder positive affect and largely absent motherese attributes. While moderate affect speech and motherese had only forward speech, mild affect speech had both forward and backward speech. As brain activation to forward and backward speech stimuli did not differ in the present study, forward and backward speech stimuli were combined, with all speech versus rest as our main contrast. The mild affect speech has been previously described and used in our earlier ASD fMRI results23,24,44,45 with two entirely different cohorts of toddlers, neither one overlapping with the present individuals. In those previous studies, we used it to develop predictors of language outcome among toddlers with ASD22 and to identify gene expression patterns associated with fMRI language hypoactivation in ASD with poor language developmental outcomes23. There were several reasons we included the mild affect speech stimuli. First, it enabled us to test whether our previous large-sample fMRI results from toddlers with ASD versus TD toddlers23,24,44,45 could be replicated in an entirely different cohort of toddlers using the identical speech stimuli. Successful replication of effects would be a strong indicator of the robustness of our fMRI language activation approach and findings in TD as well as toddlers with ASD. Additionally, including this paradigm along with new paradigms enabled us to demonstrate the generalizability of language effects across paradigms in toddlers with ASD. Second, by including several stimuli with variable levels of affect, we could examine whether: (a) ASD had significantly reduced temporal neural responses in all three types of language affect paradigms, irrespective of low-level basic stimulus features; (b) There were similarities between TD toddlers and adults in cortical regions activated by the three language affect paradigms, irrespective of low-level basic stimulus features; and (c) TD toddlers would show a predicted pattern of increased right temporal activation with enhanced affective prosody.

Affect level testing

Two computer-based surveys were administered to TD adults to test affect levels of language paradigms.

Each fMRI paradigm consists of unique language segments, that is, 2 mild affect speech, 18 moderate affect speech and 12 motherese segments. For survey 1, each unique segment was presented in random order (same order for each participant). TD adults (n = 19) were instructed to listen to each segment and respond using a Likert scale of 1–5, with a rating of 1 indicating the least amount of affect and a rating of 5 indicating the most.

Survey 2 consisted of 18 trials, each containing a mild affect speech segment, a moderate affect speech segment and a motherese segment. Presenting all three stimulus types allowed for evaluation of differences in affect level across all language paradigms. TD adults (n = 15) then rated each segment using a Likert scale of 1–3, with a rating of 1 indicating the least amount of affect, 2 indicating some affect and 3 indicating very strong affect.

fMRI data acquisition

All fMRI data were collected in a 3 T GE scanner at the University of California, San Diego Center for Functional MRI. Functional images were acquired with a multi-echo echo planar imaging protocol (echo time (TE) 15 ms, 28 ms, 42 ms, 56 ms; repetition time (TR) 2,500 ms; flip angle 78°; matrix size 64 × 64; slice thickness 4 mm; field of view 256 mm; 34 slices). Structural images were acquired using a T1-weighted 3D magnetization-prepared rapid gradient-echo sequence (field of view 256 mm; TE 3.172 ms; TR 8.142 ms; flip angle 12°).

Imaging data preprocessing

Functional data were preprocessed using the multi-echo independent component analysis pipeline ‘meica.py’52,53 implemented in AFNI54 and Python. First, the first four volumes of each run were discarded to allow for magnetization to reach steady state. Next, motion correction parameters were calculated based on the first TE images (TE 15 ms) using a rigid-body alignment procedure. Slice timing correction was implemented for functional images of each TE, which were then normalized to an age-matched infant template55. The time series of four TEs were combined into a single time series56. Both principal and independent component analyses were applied to denoise the data through isolation of thermal (that is, random) noise from structured signals (that is, blood oxygenation level dependent (BOLD, an indirect measure of neural activity via the relative levels of oxyhaemoglobin and deoxyhaemoglobin) and non-BOLD signals) and separation of BOLD and non-BOLD signals. Only the BOLD-like components were retained in the preprocessed images, which were then spatially smoothed with a 8 mm full width at half maximum Gaussian kernel.

Head motion was quantified via framewise displacement (FD)57. For adults and sleeping toddlers, head motion was minimal (mean FD <0.12 mm). There were no group differences either between ASD and TD groups or between adults and toddler groups (Supplementary Table 4).

Whole-brain analyses

First-level and second-level whole-brain activation analyses were conducted with the general linear model in SPM12 (https://www.fil.ion.ucl.ac.uk/spm/software/spm12/). Events in first-level models were based on the canonical haemodynamic response function and its temporal derivative. To take into account the repeated measurements, that is, retest scans, we ran second-level whole-brain analyses with mixed-effects models using the 3dMVM program58 in AFNI:

$$\begin{array}{l}{\mathrm{brain}}\,{\mathrm{activation}}\\ = \beta _0 + \beta _1 \times {\mathrm{group}} + \beta _2 \times {\mathrm{age}} + \beta _3 \times {\mathrm{gender}} + \beta _4 \times {\mathrm{meanFD}} + \varepsilon.\end{array}$$

In mixed-effects models, brain activation to each language paradigm (that is, speech versus rest contrast) served as a dependent variable. Individuals were treated as a random effect, which allows for fixed effects (that is, age, gender and mean FD) to vary for each individual.

Using a similar approach, we conducted whole-brain analyses with adult data for each language paradigm. However, only within-group tests were performed as all adult participants had typical development.

Resulting activation maps were corrected for multiple comparisons with the family-wise error approach using the 3dClustSim program in AFNI (voxel wise P = 0.005, cluster size >138 voxels for adults and >186 voxels for toddlers). This spatial cluster correction took into account spatial autocorrelation by using the ‘–acf’ option in 3dClustSim.

Calculation of per cent signal changes in temporal regions

Two language-relevant ROIs from the meta-analytic activation map in Neurosynth (https://neurosynth.org/) with the term ‘language’, including left and right temporal regions, were used for ROI analysis (Fig. 2b). These ROIs were identical to those used in previous papers23,24. Given that a toddler template was used for toddler samples, ROIs were co-registered to the toddler template using FSL’s ‘flirt’ function59,60. For each language paradigm, per cent signal changes were calculated with first-level models in speech versus rest contrast for all toddlers and adults.

Stability and validation of fMRI activation in toddlers

Given the challenges relating to implementing sleep imaging with toddlers, test–retest is rarely examined, but it is essential for determining the rigour of this approach. Additional key questions surround the degree to which functional activation patterns vary along the dimension of sleep and awake states, and developmental periods such as between toddlers and adults or in individuals across time. Here, we took steps towards filling these gaps and tested: (1) whether brain activation to language stimuli in sleeping TD toddlers is similar to that in passively listening awake adults and (2) whether brain activation patterns are stable and reproducible in TD individuals and individuals with ASD across time.

These questions were addressed by comparing brain activation patterns between TD adults and TD toddlers, and by computing intraclass correlation coefficients of brain activation within toddlers who were scanned multiple times at intervals of 1–15 months, respectively.

Activation differences between ASD versus TD and TD versus adults

Next, we compared per cent signal change values (extracted from prior temporal ROIs relevant to language processing) between toddlers with ASD versus TD toddlers as well as between sleeping TD toddlers versus awake adults using two-sample two-tailed t tests, excluding repeated time points. Data distribution was assumed to be normal, but this was not formally tested.

Brain–behaviour correlation analysis

Using similar mixed-effects models as mentioned above, but including signal change values to all three language paradigm and with Vineland socialization or communication scores as a predictor of interest (age, gender and mean FD as control variables/fixed effects; both language paradigms and individuals as random effects), we investigated the relevance of brain activation to a child’s social and communication abilities assessed by the Vineland Adaptive Behavior Scales49. The Vineland socialization and communication scores and scan data from all 71 individuals, including test and retest scan data, were used in the mixed-effects models.

Motherese eye tracking paradigm

We used a novel eye tracking motherese paradigm that utilized gaze-contingent technology wherein a toddler’s gaze activates what he/she sees and hears. In this gaze-contingent paradigm, motherese and techno movies were presented side by side on the screen, and toddlers can choose to watch a movie depicting an actress telling a story using motherese or computer ‘techno’ sounds and images. The motherese vignette used in this eye tracking task was 1 of the 12 motherese stimuli included in fMRI experiments. Gaze-contingent paradigms provide strong evidence of volitional preference and attention, as opposed to passive looking. Our gaze-contingent design is a variant of other gaze-contingent eye tracking designs that simulate social interactions38.

Eye tracking was conducted using Tobii software (Tobii Studio; Tobii Pro Lab), and fixation data were collected using a velocity threshold of 0.42 pixels per ms (Tobii Studio Tobii Fixation Filter) or 0.03 degrees perms (Tobii Pro Lab Tobii IV-T Fixation Filter). Of the 71 toddlers (41 with ASD, 30 TD; 53 male, 18 female, 14–55 months old) who participated in the eye tracking assessment, 54 toddlers (31 with ASD, 23 TD; 41 male, 13 female, 12–42 months) had moderate or good eye tracking performance and total looking time > 50% and were therefore included in the analysis. For 37 toddlers, eye tracking data were collected prior to the MRI scan, while 17 toddlers completed the task after the MRI scan. The distribution of intervals between the age at fMRI and eye-tracking data collection is displayed in Supplementary Fig. 5b. Percentage fixation to motherese was calculated as a ratio of total fixation time to motherese and total fixation time to both motherese and techno displays; the sum of percentage fixation to motherese and techno was 100. Preference for motherese was characterized by per cent fixation to motherese, which was compared between individuals with ASD and TD individuals.

Clustering analysis using SNF

SNF is a novel approach for capturing heterogeneity in multiple types of patient data and forming clusters or subgroups. The method reduces noise by aggregating across multiple types of data, detects common and complementary signals from different types of data and reveals the importance of each data type to patient similarity. SNF disentangles neural and clinical heterogeneity by grouping together individuals with the greatest similarity of patterns derived from combining normalized data from all data types. To do so, SNF normalizes all modality values and integrates patterns of values into a single fused network score per individual. As such, it assigns each toddler to clusters based on score similarity. It does not depend on or return conventional parametric statistics. Toddlers in each cluster have maximal similarity to each other in composite patterns of neural and clinical features, and maximal differences from toddlers in other clusters.

We used SNF21 to integrate fMRI brain activation in three language paradigms and clinical measures and then used the Louvain algorithm61 to detect clusters of the similarity network. For this analysis, we included a subset of 50 of the 71 toddlers (30 with ASD, 20 TD) who had successful scans of all three language paradigms. The analysis was performed with six ROI variables (left and right temporal activation for each of the three language paradigms) and 14 clinical variables (three ADOS variables: ADOS social affect, ADOS restricted and repetitive behaviour, and ADOS total; six Vineland variables: Vineland communication, Vineland socialization, Vineland daily living skill, Vineland motor, Vineland adaptive behaviour and Vineland domain total; five Mullen variables: Mullen fine motor, Mullen receptive language, Mullen visual reception, Mullen expressive language and Mullen early learning composite) in R with the SNFtool package. We included both subscale and composite or total variables as they provide different but complementary information. First, ROI and clinical data were normalized separately. Next, pairwise distance matrices between individuals were calculated for ROI or clinical data. Affinity matrices (networks) were computed based on distance matrices. Each affinity matrix is equivalent to a similarity network where nodes are samples (for example, individuals), and weighted edges represent pairwise sample similarities. Network fusion that iteratively updates every network was then performed, making two networks more similar to each other with every iteration. After a few iterations, two networks converged to a single network. Specifically, the similarity matrices were computed by setting the nearest-neighbour parameter to 20 and setting the hyperparameter to 0.5. The iteration for convergence was set as 20. All these parameters were set following the original SNF paper21. We constructed the network with the strongest 15% connecting partners of each individual and ran the clustering analysis with the Louvain community algorithm. The clusters were visualized with Cytoscape62.

Further, SNF clustering results were examined for how identified clusters compare on an independent measure of interest. In the original SNF paper, identified genomic patient clusters were compared for outcome differences in cancer survival times of patients. Here, the measure of interest regarding identified clusters was behavioural response to the socially compelling motherese vignette, that is, a test of each child’s current social preference for motherese.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.