INTRODUCTION

Animals and humans live in ever changing and complex environments. They need to continuously track and learn about the changing properties of their environment and use them to behave adaptively. The neural mechanisms engaged in learning about the environment depend on the type of information being tracked. Nevertheless, the plasticity underpinning several of these various forms of learning has been argued to depend on the same molecular mechanism, ie, N-methyl-D-aspartate (NMDA) subtypes of glutamate receptors. Accordingly a number of studies have investigated whether NMDA receptor manipulations affect learning and memory (Bannerman et al, 2012; Kuriyama et al, 2011; Bohn et al, 2003).

Most of the studies looking at the role of NMDA receptors in reward learning have focused on simple tasks, such as a single association between a stimulus and an outcome or its reversal. However, in more ecological, and thus complex, scenarios, learning and decision-making often involve consideration of positive and negative aspects of potential outcomes as well as the integration of learnt information with information that is explicitly cued. Importantly, it has recently been shown that integration of information conveyed by different dimensions during decision-making is an active process recruiting particular neural mechanisms (Burke et al, 2013; Stein and Stanford, 2008). Thus, to understand the full role of NMDA receptors in reward learning and decision-making, it might not be sufficient to only study a single simple component in isolation.

Interestingly, there is some, albeit indirect, evidence that NMDA receptors are not only important for learning, but also for integration of information: NMDA receptor blockade has been reported to affect the integration of multisensory information in cat superior colliculus (Binns and Salt, 1996) or the integration of reward and delay in rats (Floresco et al, 2008). However, to our knowledge, it has not been investigated whether NMDA receptors play a role in human value-based decision-making and learning.

To examine the influence of changes in NMDA receptor activity on complex learning and decision-making behaviors in humans, we used the partial NMDA agonist d-cycloserine. D-cycloserine binds to the glycine site of the NMDA receptor. Glycine is a co-agonist of the NMDA receptor, meaning that NMDA receptors only open when both glycine and glutamate bind. D-cycloserine can thus increase the probability of glutamate release opening NMDA receptors, which in turn enhances NMDA receptor-mediated activation. We designed a multi-attribute decision-making task in which participants learnt changing accepted article preview 13 June 2014 of gains and losses of two options. They made choices between options by integrating those learnt probabilities with explicitly cued information about gain and loss magnitudes. To assess potential effects of d-cycloserine on learning and decision-making, we used reinforcement-learning models.

Surprisingly, we did not find any evidence for a change in the rate at which participants learned about reward or punishment outcomes. However, we found that d-cycloserine improved decision-making. D-cycloserine led to a more optimal integration of the learnt probability information with the explicitly cued magnitude information.

METHODS

Participants

The study was approved by the local ethics committee. In total, 52 healthy volunteers (age 18–30) took part in the study (inclusion details in Supplementary Methods). The groups were well-matched on sociodemographic and personality parameters (Table 1). Five participants were excluded (Supplementary Methods). There remained 20 participants in the d-cycloserine and 27 participants in the placebo group.

Table 1 Sociodemographic and Questionnaire Measurements

Procedure

In a double-blind, placebo-controlled design, participants were randomly allocated to a single dose of d-cycloserine (250 mg) or matching placebo capsule. They fasted 2 h before the testing visit. Approximately, 250 mg was chosen in agreement with recent studies (Klumpers et al, 2012; Onur et al, 2010). Participants were tested 3 h after drug administration. According to product information (King’s Pharmaceutical), plasma peak levels are reached within 3–4 h; other studies (van Berckel et al, 1997, 1998; Patel et al, 2011) found that peak levels are reached within 1 h. However, given d-cycloserine’s half-life of 8–12 h (product information) or 15 h (Patel et al, 2011), plasma levels would have been close to peak levels during testing, given either time-to-peak information. To assess potential subjective changes following d-cycloserine, participants completed questionnaires (Table 1) before capsule intake and before testing.

Probabilistic Instrumental Learning Task

Participants performed a probabilistic learning task with monetary wins and losses (Figure 1a). Participants made repeated choices between two options with the aim to maximize their monetary pay-off.

Figure 1
figure 1

(a) At the beginning of a trial two options appeared on the left (pink square) and right side (yellow square) of the screen. Throughout the experiment, the pink square was always on the left side and the yellow square was always on the right side. Reward (bars at the top) and loss magnitudes (circles at the bottom) were presented overlaid on the option symbols. After 500 ms, a question mark appeared after which the participants chose an option. After participants made their selection (there was no time-out), the outcomes of the gambles were shown (b), first for the chosen option (duration: 2.5 s—left option in the example), then also for the unchosen option (duration: 2.5 s). If the gamble outcomes of the chosen option led to a reward, the reward bar was shown; otherwise the reward bar was not shown. Similarly, if the gamble outcome of the chosen option led to a loss, the loss circle was shown; otherwise, it was not shown. The sum of the reward and the loss incurred for the chosen option in a trial was added to a status bar at the bottom of the screen, allowing participants to keep track of their overall gains. Subsequently, the participants were shown the outcomes for the unchosen option in the same way, except that no points were added to the status bar. Importantly, presenting the outcomes of the chosen and the unchosen option ensured that participants had an equal chance to learn the probabilities of the chosen and the unchosen option. After an inter-trial interval of 1.5 s, the next trial started. (c) Example reward probabilities for the two options over the course of the experiment. The probabilities were either stable at 20% or 80%, or they drifted between 20 and 80%, taking between five and eight trials per drift. (d) Example reward probability for one of the options (solid line), together with the probability estimates from the Bayesian learner used (dotted line).

PowerPoint slide

On each trial, participants had a choice between two options. Each option had four independent attributes: a reward and a loss magnitude, a reward and a loss probability. The magnitude determined how many points could be won (and lost) on this trial, while the probabilities determined how probable winning and losing was respectively. After participants selected one of the options, they were shown the outcomes for both options. However, only the option they had chosen contributed to the participants’ earnings. In trials where the chosen option incurred both, wins and losses, the participants’ earnings in that trial were the sum of both. Therefore, to maximize the overall gains, participants should be choosing the option on each trial with the highest reward utility (reward magnitude × reward probability) and with the lowest loss utility (loss magnitude × loss probability).

Reward and loss magnitudes were explicitly cued at the time of choice and were randomly drawn from a flat distribution between 1 and 100 points. In contrast, the probabilities were not explicitly shown and had to be learnt across trials by observing the outcomes. The outcomes for one option could either be a win and a loss, only a win, only a loss, or neither win nor loss. The independent reward and loss probabilities determined the probability of these outcomes. The probabilities varied over the course of the experiment between 20 and 80%, with only one of the four probabilities varying at any given time (Figure 1c).

Before the experiment, participants were instructed about the task (see Supplementary Methods). Participants first performed 30 training trials, followed by 381 test trials, which were included in the analysis. Each participant was tested using the same task schedule to allow for better group comparisons. At the start of the task participants were given 4£ (400 points) to ensure that they had sufficient funds to sustain losses incurred even at the beginning of the experiment.

Analysis

The behavioral analysis compared the effects of d-cycloserine vs placebo on (a) learning of reward and loss probabilities, and (b) integration of learnt (probabilities) and explicitly cued (magnitudes) information for guiding complex decisions. All analyses were performed in Matlab and SPSS.

Logistic regression analysis

To ensure that participants learnt the probabilities, we first assessed the impact of past outcomes (reward and loss) and the explicitly cued magnitudes (reward and loss) on choice, using a logistic regression analysis and normalized regressor estimates. We included regressors for the last five trial differences in reward and loss outcomes between the two options, as well as the differences in the explicitly cued magnitudes.

To investigate whether the groups differed in their learning speeds, the resulting regression weights for the past outcomes were entered into an ANOVA with group (d-cycloserine vs placebo) as a between-subject factor and time (1, 2, 3, 4, or 5 trials in the past) and valence (reward or loss) as within-subject factors.

Modeling

To look at the learning effects more specifically and assess participants’ strategies for the integration of information to make decisions, we used reinforcement-learning models to fit each participant’s trial-by-trial behavior.

Each model consisted of three main components. First, the model had estimates about the probabilities underlying the outcomes of both options. These were updated on every trial using a reinforcement-learning algorithm. Second, the probability estimates were integrated with the explicitly cued magnitudes to calculate how valuable each of the two options was (ie, their utility). Third, these two utilities were compared to predict participants’ choices. To determine the best parameter estimates for every participant, we used a standard log-likelihood maximization procedure.

When calculating how valuable each option is, participants might use different decision strategies for integrating learnt probabilities with explicit magnitudes. They could use a mathematically optimal strategy (utility as probability × magnitude). However, as this optimal strategy may be quite cognitively taxing, they could resort to a heuristic strategy (utility as a weighted sum of probability and magnitude). To test which decision strategy participants used, we fitted different models to the data. To test for differences in strategy, we then compared how well each of these models explained the groups’ behavior. In addition, we fitted a third model, which directly estimated to what degree they used either decision strategy.

Optimal model

This model assumed that participants integrated the learnt probabilities optimally with the explicit magnitudes (magnitude × probability). The learning of probabilities was modeled using a standard reinforcement-learning rule. On each trial, the estimated probability of an attribute was updated based on the trial’s outcome, as a function of the prediction error (PE):

with

where α is the learning rate. Thus, the learning rate is a measure of how much participants updated their probability estimate when the outcome associated with an attribute differed from their expectation (eg, the probability that the left option yields a reward). Separate learning rates were used for learning about wins and losses.

These predictions were combined optimally with the shown magnitudes:

The loss utility was computed in the same way and combined with the reward utility:

where λ determines how much participants weighted the prospect of rewards vs losses.

A standard soft-max decision rule was used to predict the probability of choosing an option.

where β reflects a participant’s ability to pick the option with higher utility.

To assess the effect of d-cycloserine on learning about wins and losses, we compared their respective learning rates between groups.

Heuristic model

This model differed from the optimal model only in the decision strategy for integrating learnt probabilities with explicit magnitudes. Instead of being a product of probability and magnitude, utility was computed as a weighted sum.

where μ is the probability weighting factor, describing the relative importance of the learnt probability compared to the explicit. The loss utility was computed in the same way, sharing the same μ.

Again, we compared the groups’ learning rates. Additionally, a change in learning could also manifest as a changed reliance on learnt compared to explicit information, we therefore also compared the probability weighting factors between the groups.

Hybrid model

We used the hybrid model to examine whether d-cycloserine affected how participants integrated information, shifting them towards a more optimal, less heuristic decision strategy. The hybrid model computed utility as a weighted sum of the utilities from the optimal and the heuristic model:

where ω is the heuristic weight factor, determining how much the overall utility is like the heuristic or the optimal utility. The higher ω (between 0 and 1), the more a participant relied on a heuristic decision rule.

Model Comparison

If the groups differ in their decision strategy, this should also be reflected in how well the models incorporating the different strategies can explain behavior. To assess this, we compared the model fits using the Akaike information criterion (AIC). We calculated for each participant the AIC differences between the optimal and the heuristic model and also between the heuristic and the hybrid model. This allowed us to compare whether the d-cycloserine differed from the placebo group in how well their behavior was explained by the heuristic relative to the optimal and by the heuristic relative to the hybrid model.

To confirm the results from the modeling analysis using a different method, we performed an additional regression analysis with regressors analogous to the components of the hybrid model. We included regressors for the explicit magnitude differences, for the learnt probability estimate differences, as well as for the difference in optimal utilities (magnitude × probability). The probability estimates were obtained using a Bayesian learner, like the one described in Behrens et al, 2007; also see Supplementary Methods. These Bayesian probability estimates are the most accurate estimates a participant could have given the past outcomes (Figure 1d). If participants’ decision strategy is more heuristic, the main effects should have a larger impact on behavior. Conversely, if the decision strategy is more optimal, the interaction term (magnitude × probability) should have a higher impact.

Learning About the Unchosen Option

There is some evidence (Boorman et al, 2011) that different brain areas are used to learn about the chosen and the unchosen option, thus there is a possibility they could be affected differently by d-cycloserine. We found that d-cycloserine did not affect the usage of the unchosen option’s outcomes for decision-making or learning (see Supplementary Methods and Results).

RESULTS

General Performance

In the task, participants had to constantly track the independent win and loss probabilities of two options and integrate these with explicitly cued reward and loss magnitudes. The groups did not differ in the overall earnings or in the mean points won/lost (Table 2a).

Table 2a Results of a General Behavioral Analysis

Logistic Regression

As a measure of learning, we assessed the impact of past reward/loss outcomes on choice, using a logistic regression analysis. We also included the currently displayed reward/loss magnitudes. Participants were more likely to pick options with higher reward (t(46)=18.6, P=5 × 10−23) and lower loss magnitudes (t(46)=−14.9, P=3 × 10−19). They also chose options more frequently when they were associated with more past wins and fewer losses (Figure 2a), thus suggesting that they were able to learn from past outcomes.

Figure 2
figure 2

(a) Decision weights (beta) for placebo (white) and d-cycloserine (gray), showing the decision impact of current magnitude differences (left–right) and past gamble outcome differences (left–right), for one to five trials in the past, on choice. (b) Decision weights (beta) based on a regression using shown magnitude differences, probability prediction differences (estimated using a Bayesian model), and their interaction for both groups. (c) AIC difference scores comparing the relative fit of the hybrid to the heuristic and of the heuristic to the optimal model in both groups. Error bars indicate SE. +P=0.056, *P≤0.05, **P<0.001, ****P<10−4.

PowerPoint slide

To test whether learning differed between the groups, and maybe in dependence on reward and loss valence, we ran a 2 (group) × 5 (time point) × 2(valence) ANOVA on the regression weights of the past outcomes (Figure 2a). That participants learnt the reward/loss probabilities over time was evidenced by the fact that recent reward/loss outcomes influenced choices more than longer ago ones (main effect of time, F(4,180)=80.6, P<10−6). This effect of time was stronger for losses than rewards (interaction effect, time × valence, F(4,180)=6.2, P=3.8 × 10−4), suggesting that loss probabilities were learnt more quickly.

Importantly, the groups neither differed in overall learning speed (time × group: F(4,180)=0.9, P=0.42), nor in their relative learning speeds for wins and losses (time × valence × group, F(4,180)=0.3, P=0.89).

Modeling

The aims of the modeling analyses were to assess whether the groups differed in their learning rate (all three models), as well as whether they differed in their decision strategies (model three). Each model used a standard prediction-error learning rule for learning about reward/loss probabilities. The models only differed in how learnt probabilities were integrated with explicit reward/loss magnitudes.

Optimal model

This model integrated magnitudes and probabilities optimally using a multiplicative rule. For assessing differences in learning, we compared the learning rate parameters between the groups. Surprisingly, the groups did not differ in their learning rates (reward: t(45)=−0.2, P=0.8, loss: t(45)=0.7, P=0.5), nor did they differ in any other model parameter (Table 2b).

Table 2b Parameter Estimates (Mean and Standard Error) for both Groups and the Results of Significance Tests for the Optimal Decision Model

Heuristic model

Participants might use a more heuristic and less optimal decision strategy to reduce task complexity. This was accounted for in the heuristic model, in which the decision utility was calculated as a weighted sum of probability and magnitude. Here, we could detect a change in learning either as a changed learning rate or as a change in how much impact the learnt probabilities have on decisions relative to the explicit magnitudes. Again, we did not find any group differences for the two learning rates (reward: t(45)=1.3, P=0.20, loss: t(45)=0.7, P=0.5) or for the probability weighting factor (t(45)=−0.36, P=0.72). The groups did not differ either in any other model parameter (Table 2c).

Table 2c The Heuristic Decision Model

Hybrid model

This model combined the optimal and heuristic model’s decision strategy using an additional weighting factor (ω, heuristic weight); ω indexed the degree to which participants used the heuristic relative to the optimal strategy. Again, the two groups did not differ in their learning rates (Table 2d). However, there was a strong difference in the heuristic weight between the groups, with the d-cycloserine group relying more on an optimal rather than heuristic decision strategy compared to the placebo group (t(45)=4.6, P=3.6 × 10−5).

Table 2d The Hybrid Decision Model

Model comparison

To investigate this shift in decision strategy further we compared how well each model explained the behavior of each group, using the AIC (Figure 2c). We hypothesized that if participants in the d-cycloserine group used a more optimal decision strategy, models incorporating the optimal decision rule should better explain their behavior compared to the placebo group.

First, we compared the optimal model relative to the heuristic model. In both groups, the heuristic model provided a better fit to the data than the optimal model. However, this effect was weaker in the d-cycloserine group than in the placebo group, as shown by a significant group difference (t(45)=2.1, P=0.046). This suggests that relative to the placebo, the d-cycloserine group used a more optimal strategy. We also compared the fits of the heuristic to the hybrid model. Similarly, we found that the difference in scores varied significantly between the groups (t(45)=3.6, P=8.0 × 10−4). For the d-cycloserine group, the hybrid model provided the best fit (t(19)=3.9, P=0.01). However, for the placebo group, both models explained their behavior equally well (t(26)=1.6, P=0.11). These findings are in line with those from the modeling analysis above and suggest that d-cycloserine shifted participants’ behavior to rely more on an optimal decision strategy.

To investigate this further, we ran a logistic regression analysis, analogous to the hybrid model (Figure 2b). It included an interaction term for magnitude and probability (analogous to the optimal model), as well as each factor separately (analogous to the heuristic model). Similar to our hybrid model findings, the groups did not differ in the heuristic aspect: there was no difference between the main effects for magnitude and probability (magnitude: F(1,45)=2.2, P=0.14, probability: F(1,45)=0.98, P=0.33). In contrast, the groups differed on the optimal aspect of the model: a two-by-two ANOVA analysis on the interaction regressors of magnitude and probability for reward and loss revealed that those regressors had a stronger impact on the decisions in the d-cycloserine group (F(1,45)=10.8, P=0.002). Post-hoc tests revealed that this effect was significant for loss (t(45)=−2.7, P=0.01) and was marginally significant for reward (t(32.4)=−2.0, P=0.056).

DISCUSSION

Summary of the Results

We studied the effect of changes in NMDA activity on human reward-guided learning and decision-making using d-cycloserine, a partial NMDA agonist. Participants had to learn changing probabilities of positive and negative outcomes over time and integrate them with explicitly cued reward and loss magnitudes in order to make decisions. Although participants were able to track the changing probabilities, this learning was not modulated by d-cycloserine. If there had been an effect on learning, we would have expected either a change in learning rates or alternatively a changed reliance on the learnt compared to the explicit information, neither was the case.

However, d-cycloserine had a significant effect on decision-making. The drug shifted decision-making away from a heuristic approach towards a more optimal one. In the current context, more optimal behavior meant that participants multiplied the learnt probability and the displayed magnitude information rather than adding them (heuristic behavior). Note, however, that such multiplication might be more demanding especially since participants tracked several pieces of information at a time, making heuristic strategies an appealing way of reducing task complexity.

Absence of a Learning Effect

To our knowledge, there are no studies in humans that have looked at the effects of d-cycloserine on reward-guided learning and decision-making. However, animal studies suggest that d-cycloserine improves fear extinction (Walker et al, 2002), drug use extinction (Flavell et al, 2011), and appetitive conditioning (Golden and Houpt, 2007). However, our task is most similar to instrumental reward reversal learning task paradigms and it is still unknown whether this neurally distinct form of learning is also improved by d-cycloserine, with Golden and Houpt, 2007 finding that at least simple reversal learning might not be.

Additionally, evidence for learning effects in human studies analogous to the animal studies above has been more inconsistent. Although there is evidence from clinical trials that d-cycloserine may accelerate the therapy of different anxiety disorders (reviewed in Hofmann et al, 2013), some experimental studies have not been able to replicated the basic fear extinction effects in humans (Guastella et al, 2007; Klumpers et al, 2012). Furthermore, clinical trials for alcohol or for cocaine use extinction have failed to find an enhancement by d-cycloserine (Kamboj et al, 2011, Price et al, 2013). This discrepancy between human and animal studies could be due to the strong dose dependency of d-cycloserine’s effects (Walker et al, 2002): in general animal studies have used a higher dose (by about a factor of three) and injection rather than oral administration of the drug.

Changes in Evidence Integration for Decision-Making

We found that d-cycloserine led to a very selective shift in participants’ decision strategy, making them integrate different sources of information in a more optimal manner, compared to the placebo group. More specifically, participants integrated the learnt probability and the explicitly cued magnitude more multiplicatively (ie, non-linearly) and thus less additively (ie, linearly).

Recent work has suggested functions for NMDA receptors beyond traditional long-term consolidation in learning. To our knowledge, nobody has yet tested whether changes in NMDA-mediated neural processing would affect evidence integration in a complex value-based decision-making task in humans. However, there is some evidence that NMDA receptors can modulate higher cognitive functions, like working memory (Wang et al, 2013), attention (Herrero et al, 2013; Meuwese et al, 2013; Self et al, 2012), or delay discounting (Floresco et al, 2008). However, it is important to note that it is unlikely that the decision-making effects in our task are a working-memory effect in disguise. Collins and Frank, 2012 have shown that in a standard human learning paradigm (similar to the one used here), learning rate captures a mixture of cortico-striatal learning systems as well as prefrontal working memory. Therefore, if the observed effect were driven by a working-memory mechanism, we would have expected to see a change in at least one of the different measures we used to probe for learning effects. However, we did not see any such effects.

There are at least two theoretical frameworks that are, however, in line with our results. First, NMDA receptors have been proposed as crucial for the integration of information over time during decision-making in a biophysical model (Wang, 2002). Although this model was originally based on animal recordings during perceptual decision-making tasks (reviewed in Wang, 2008), recent human brain imaging studies have shown that it is also relevant for human reward-based decision-making (Chau et al, 2014; Hunt et al, 2012; Jocham et al, 2012). Simulation studies have suggested that modulation of the NMDA receptor activity should enhance or impair decision-making, depending on the baseline activation of NMDA receptors (Standage and Paré, 2011). Following these findings, the decision-making improvement in our study could be explained as shifting the NMDA receptor activation into a more favorable range for decision-making.

Second, there is evidence that NMDA receptors are important for neural computations when more than one input is integrated non-linearly. For example, Binns and Salt, 1996 have shown that NMDA receptors are crucial for multisensory integration. More specifically, they found that the neuronal super-additivity between inputs of different modalities in the cat superior colliculus was strongly reduced by NMDA blockade. Furthermore, Self et al, 2012 have shown that NMDA receptors are crucial for the integration of top-down and bottom-up information during figure–ground separation. In the context of our task the optimal way to integrate the probabilities and the magnitudes presented, is non-linearly (ie, multiplicatively), rather than linearly (additively). Therefore d-cycloserine may have led to more optimal integration by biasing processing towards more non-linear neural interactions.

One possible explanation for the occurrence of an effect on decision-making in the absence of a learning effect is that while both are supported by NMDA receptors, there is a different optimal range of NMDA activation supporting both functions (Standage and Paré, 2011). In addition, the strength of the effect of d-cycloserine might vary across brain areas, due to different NMDA receptor subunit compositions or different local levels of glycine (see Yang and Svensson, 2008 for a review). Therefore, the behavioral dissociation could be explained as due to different d-cycloserine sensitivities in reward decision- compared to reward learning-related brain areas.

Future Directions

Future experiments will have to establish whether our results generalize to the integration of dimensions other than probability and magnitude, and to situations where all information to be integrated is explicitly shown rather than learnt.

It would also be interesting to manipulate the predicted choices for linear vs non-linear integration more widely to produce more distinct changes in choices and outcome between the groups, as well as testing whether other forms of non-linear integration or conditional reasoning (model-based) would also be affected.

Lastly, future decision-task designs could explicitly contrast which theoretical NMDA integration mechanism mentioned above drives the behavioral effect: while the biophysical model explanation would predict decision enhancements in situations where attributes are integrated linearly or non-linearly, the super-additivity mechanism would predict enhancements only where attributes need to be integrated non-linearly.

CONCLUSION

We have found that enhancing the activation of NMDA receptors changed how humans integrated evidence in a multi-attribute decision-making and learning task in the absence of more direct effects on learning. This suggests a role for NMDA receptors beyond their well-studied role in long-term consolidation and plasticity. Although we cannot clearly dissociate between different candidate mechanisms for our observed effect, either biophysical attractor networks or basic super-additivity models could plausibly explain our data.

FUNDING AND DISCLOSURE

This research was funded by the Oxfordshire Health Services Research Committee (research grant awarded to AR, OHSRC 1021) and by the Wellcome Trust (JS and NK were awarded Four-Year PhD Studentships, 092759/Z/10/Z and 089280/Z/09/Z and MFSR holds a Senior Investigator Award, WT100973AIA). CJH has valueless shares in p1vital and serves on their advisory panel. She has received consultancy payments from p1vital, Servier, Eli Lilly, AstraZeneca, and Lundbeck and is a director of Oxford Psychologists. The remaining authors declare no conflict of interest.