Abstract
The aggregation of many independent estimates can outperform the most accurate individual judgement1,2,3. This centenarian finding1,2, popularly known as the 'wisdom of crowds'3, has been applied to problems ranging from the diagnosis of cancer4 to financial forecasting5. It is widely believed that social influence undermines collective wisdom by reducing the diversity of opinions within the crowd. Here, we show that if a large crowd is structured in small independent groups, deliberation and social influence within groups improve the crowd’s collective accuracy. We asked a live crowd (N = 5,180) to respond to general-knowledge questions (for example, "What is the height of the Eiffel Tower?"). Participants first answered individually, then deliberated and made consensus decisions in groups of five, and finally provided revised individual estimates. We found that averaging consensus decisions was substantially more accurate than aggregating the initial independent opinions. Remarkably, combining as few as four consensus choices outperformed the wisdom of thousands of individuals.
Similar content being viewed by others
Main
Understanding the conditions under which humans benefit from collective decision-making has puzzled mankind since the origin of political thought6. Theoretically, aggregating the opinions of many unbiased and independent agents can outperform the best single judgement1, which is why crowds are sometimes wiser than their individuals2,3. This principle has been applied to many problems, including predicting national elections7, reverse-engineering the smell of molecules8 and boosting medical diagnoses4. The idea of wise crowds, however, is at odds with the pervasiveness of poor collective judgement9. Human crowds may fail for two reasons. First, human choices are frequently plagued with numerous systematic biases10. Second, opinions in a crowd are rarely independent. Social interactions often cause informational cascades, which correlate opinions, aligning and exaggerating the individual biases11. This imitative behaviour may lead to 'herding'9, a phenomenon thought to be the cause of financial bubbles12, rich-get-richer dynamics13,14 and zealotry15. Empirical research has shown that even weak social influence can undermine the wisdom of crowds16, and that collectives are less biased when their individuals resist peer influence17. Extensive evidence suggests that the key to collective intelligence is to protect the independence of opinions within a group.
However, in many of those previous works, social interaction was operationalized by participants observing others’ choices without discussing them. These reductionist implementations of social influence may have left unexplored the contribution of deliberation in creating wise crowds. For example, allowing individuals to discuss their opinions in an online chat room results in more accurate estimates18,19. Even in face-to-face interactions, human groups can communicate their uncertainty and make joint decisions that reflect the reliability of each group member20,21. During peer discussion, people also exchange shareable arguments22,23, which promote the understanding of a problem24. Groups can reach consensuses that are outside the span of their individual decisions24,25, even if a minority26 or no one24 knew the correct answer before interaction. These findings lead to the following questions: Can crowds be any wiser if they debated their choices? Should their members be kept as independent as possible and aggregate their uninfluenced, individual opinions? We addressed these questions by performing an experiment on a large live crowd (Fig. 1a, see also Supplementary Video 1).
We asked a large crowd (N = 5,180 (2,468 female), aged 30.1 ± 11.6 yr (mean ± s.d.)) attending a popular event to answer eight questions involving approximate estimates to general knowledge quantities (for example, "What is the height in metres of the Eiffel Tower?" cf. Methods). Each participant was provided with pen and an answer sheet linked to their seat number. The event’s speaker (author M.S.) conducted the crowd from the stage (Fig. 1a). In the first stage of the experiment, the speaker asked eight questions (Supplementary Table 1) and gave participants 20 s to respond to each of them (stage i1, left panel in Fig. 1a). Then, participants were instructed to organize into groups of five based on a numerical code in their answer sheet (see Methods). The speaker repeated four of the eight questions and gave each group one minute to reach a consensus (stage c, middle panel in Fig. 1a). Finally, the eight questions were presented again from stage and participants had 20 s to write down their individual estimate, which gave them a chance to revise their opinions and change their minds (stage i2, right panel in Fig. 1a). Participants also reported their confidence in their individual responses on a scale from 0 to 10.
Responses to different questions were distributed differently. To pool the data across questions, we used a non-parametric normalizing method, used for rejecting outliers27 (see Methods). Normalizing allowed us to visualize the grouped data parsimoniously, but all our main findings are independent of this step (Supplementary Fig. 1). As expected, averaging the initial estimates from n participants led to a significant decrease in collective error as n increased (F(4,999) = 477.3, P ≈ 0; blue lines in Fig. 1b), replicating the classic wisdom-of-crowd effects2. The average of all initial opinions in the auditorium (N = 5,180) led to 52% error reduction compared with the individual estimates (Wilcoxon signed-rank test, z = 61.79, P ≈ 0).
We then focused on the effect of debate on the wisdom of crowds, and studied whether social interaction and peer discussion impaired16,17 or promoted23,24 collective wisdom. To disentangle these two main alternative hypotheses, we looked at the consensus estimates. We randomly sampled m groups and compared the wisdom of m consensus estimates (stage c) against the wisdom of n initial opinions (stage i1, n = 5m as there were 5 participants in each group). This analysis is based on the 280 groups (1,400 participants) that had valid data from all of their members (see Methods). We observed that the average of as few as collective estimates was more accurate than the mean of the 15 independent initial estimates (blue line at n = 15 versus black line at m = 3 in Fig. 1b, z = 13.25, P = 10−40). The effect was even more clear when comparing 4 collective choices against the 20 individual decisions comprising the same 4 groups (blue line at n = 20 versus black line at m = 4 in Fig. 1b, z = 20.79, P = 10−96). Most notably, the average of 4 collective estimates was even more accurate (by 49.2% reduction in error) than the average of the 1,400 initial individual estimates (blue data point at n = 1,400 versus black line at m = 4 in Fig. 1b, z = 13.92, P =10−44). In principle, this could simply result from participants having a second chance to think about these questions, and providing more accurate individual estimates to the group discussion than the ones initially reported. However, our data rule out this possibility as one or two collective estimates were not better than five or ten independent initial estimates, respectively (z = 1.02, P = 0.31). In other words, this is the result of a 'crowd of crowds' (Fig. 1c).
Participants used the chance to change their minds after interaction and this reduced their individual error (mean error reduction of 31%, z = 19.16, P = 10−82). More importantly, revised estimates gave rise to greater wisdom of crowds compared with initial estimates (blue line versus red line in Fig. 1b, F(1,999) = 4,458.6, P ≈ 0). When compared with collective choices, the average of n revised decisions was overall more accurate than the average of m group decisions (black line versus red line in Fig. 1b, F(1,999) = 2,510.4, P ≈ 0), although this depended on the specific question asked (interaction F(3,999) = 834.7, P ≈ 0; see Supplementary Fig. 1). Taken together, these findings demonstrate that face-to-face social interaction brings remarkable benefits in accuracy and efficiency to the wisdom of crowds. These results raise the question of how social interaction, which is expected to instigate herding, could have improved collective estimates.
To answer this question, we analysed how the bias and the variance of the distribution of estimates were affected by debates (Fig. 2). Figure 2a shows a graphical representation of how deliberation and social influence affected the distribution of responses in two exemplary groups. We found that the consensus decisions were less biased than the average of initial estimates (Fig. 2b, z = 2.15, P = 0.03, see also Supplementary Fig. 2). This indicates that deliberation led to a better consensus than what a simple averaging procedure (with uniform weights) could achieve. When participants changed their mind, they approached the (less biased) consensus: revised opinions became closer to the consensus than to the average of initial answers (Fig. 2c, z = 27.15, P = 10−162). Moreover, in line with previous reports that social influence reduces the diversity of opinions16,17, we found that, within each group, revised responses converged towards each other: the variance of revised estimates within each group was smaller than the variance of the initial estimates (Fig. 2d, Wilcoxon signed-rank test of the variance of responses on each group before versus after interaction, z = 18.33, P = 10−75). However, interaction actually increased the variance of responses between groups (Fig. 2e): the distribution of the average of initial estimates (obtained by averaging stage i1 estimates on each group) had less variance than the average of revised estimates (obtained by averaging stage i2 estimates on each group, squared rank test for homogeneity, P < 0.01). Previous research in social psychology also found a similar effect; consensus decisions are typically more extreme than the average individual choice, a phenomenon known as ‘group polarization’28.
Previous studies have proposed that a fundamental condition to elicit the wisdom-of-crowds effect is the diversity of opinions3,29. Because we saw that interaction decreased the variance of estimates within groups but increased the variance between groups, we reasoned that sampling opinions from different groups might bring even larger benefits to the crowd. To test this idea, we sampled our population in two ways to test the impact of within- and between-group variance on the wisdom of crowds (Fig. 2e). In the within-groups condition, we sampled n individuals coming from m = n/5 different groups. This was the same sampling procedure that we used in Fig. 1b. In the between-groups sampling, we selected n individuals, each coming from a different group. Because different groups were randomly placed in different locations in the auditorium, we expected that sampling between groups would break the effect of local correlations, and decrease the collective error.
Consistent with our predictions, we found that breaking the local correlations by between-group sampling led to a large error reduction (red solid line versus red dashed line in Fig. 2f, 26% error reduction on average, F(1,999) = 25,824.1, P ≈ 0). In fact, averaging only five revised estimates coming from five different groups outperformed the aggregation of all initial independent decisions in the auditorium (z = 25.91, P = 10−148). This finding is consistent with previous studies showing that averaging approximately five members of 'select crowds' leads to substantial increases in accuracy30,31. In our case, adding more decisions using this sampling procedure led to a significant decrease in error (F(4,999) = 249.34, P ≈ 0). Aggregating revised estimates from different randomly sampled groups was a highly effective strategy to improve collective accuracy and efficiency, even with a very small number of samples.
We then asked whether deliberation was necessary to observe an increase in the wisdom of crowds. One could argue that the difference between wisdom of crowds obtained by aggregating the first (i1) versus the second (i2) opinions may have simply resulted from having a second chance to produce an estimate. Indeed, previous research32,33,34 has shown consistent improvements drawn from repeatedly considering the same problem in decision-making. To evaluate this possibility, we compared wisdom of crowds obtained from the answers to the discussed versus the undiscussed questions (see Methods). Figure 3a shows the error reduction when comparing the average of n revised estimates (i2) with the average of n initial estimates (i1), that is, the ratio of red line to blue line in Fig. 1b. We observed that the error reduction in the absence of deliberation (Fig. 3a, grey line) was below 3% for all crowd sizes. With deliberation (Fig. 3a, green line), in contrast, error reduction was significantly larger and increased with increasing number of aggregated opinions (F(1,999) = 3,963.6, P ≈ 0, comparing with vesus without deliberation). This result demonstrated that merely having the chance to produce a second estimate was not sufficient, and that deliberation was needed to increase the wisdom of crowds.
While we found that deliberation increased collective accuracy, the results presented so far do not shed light on the specific deliberative procedure implemented by our crowd. In principle, collective estimates could have been the output of a simple aggregation rule different to the mean17,35. Alternatively, participants could have used the deliberative stage to share arguments and arrive at a new collective estimate through reasoning22,23. This dichotomy between ‘aggregating numbers’ versus ‘sharing reasons’ has been discussed in several studies about collective intelligence36,37. It has been argued that the normative strategy in predictive tasks is to share and aggregate numbers. Instead, problem-solving contexts require authentic deliberation and sharing of arguments and reasons37. Which kind of deliberative procedure did the groups implement in our experiment?
To answer this question, we first compared the accuracy of our consensus estimates with seven different aggregation rules for how to combine the initial estimates (see Methods). Three of these rules were based on the idea of robust averaging38, namely that groups may underweight outlying estimates (that is, the median rule, the soft median rule and the robust averaging rule; see Methods for details). Three other rules were inspired in previous studies showing that the one individual may dominate the discussion and exert greater influence in the collective decision21 (that is, the expert rule, the confidence-weighted average rule and the resistance-to-social-influence rule; see Methods for details). As a benchmark, we also compared these rules with the simple average rule. Figure 3b shows the expected error if our crowd implemented each of these rules (blue bars in Fig. 3b). The empirically obtained consensus estimates (black bar in Fig. 3b) were significantly more accurate than all seven aggregation rules (z > 3.99, P < 10−5 for all pairwise comparisons between the observed data and all simulated rules). The deliberation procedures implemented by our crowd could not be parsimoniously explained by the application of any of these simple rules.
The above analysis definitively rejects the more simplified models of consensus. But the evidence is not exhaustive and does not necessarily imply positive evidence for the hypothesis that our crowd shared arguments during deliberation. To directly test this hypothesis, we ran a second experiment in the lab (experiment 2, N = 100, see Methods and Supplementary Fig. 3). Groups of five people first went through the experimental procedure (Fig. 1a). After finishing the experiment, in a debriefing questionnaire, we asked them what deliberation procedure(s) they implemented during the debates. After the end of stage i2 (cf. Fig. 1a), all participants were asked to rate (on a Likert scale from 0 to 10) the extent to which different deliberation procedures contributed to reaching consensus (see Table 1 and Supplementary Fig. 4). The procedure with highest endorsement was “We shared arguments and reasoned together” (mean rating ± s.e.m across all questions: 7.7 ± 0.2, mode rating: 10; z > 7.05, P < 10−12 for all comparisons, Table 1). We gave participants the opportunity to endorse more than one procedure or even describe a different procedure not appearing in our list. This latter option was selected less than 5% of the time (4.2 ± 2.0%). Overall, our control analyses and new experiment suggest that (1) without deliberation there is no substantial increase in collective accuracy (Fig. 3a), (2) the most salient simple aggregation rules previously proposed in the literature did not explain our findings (Fig. 3b), and (3) participants reported sharing arguments and reasoning together during deliberation (Table 1).
Experiment 2 also allowed us to probe the wisdom of deliberative crowds 'by design'. Since the materials (questions) and procedures were identical between the two experiments, we could formally test whether aggregating the consensus estimates (stage i2) drawn from four groups of five people collected in the lab could predictably and consistently outperform the aggregate of all independent opinions (stage i1) in the crowd. We found that the average of four group estimates collected in experiment 2 was significantly more accurate than the average of all 5,180 initial estimates collected from the crowd (z = 6.55, P < 10−11, Fig. 3c). It is difficult to overstate the importance of these findings as they call for re-thinking the importance of the deliberation structure in joint decision-making processes. This study opens up clear avenues for optimizing decision processes through reducing the number of required opinions to be aggregated.
Our results are in contrast to an extensive literature on herding11 and dysfunctional group behaviour39, which exhorts us to remain as independent as possible. Instead, our findings are consistent with research in collaborative learning showing that 'think–pair–share' strategies40 and peer discussion24 can increase the understanding of conceptual problems. However, these findings offer a key insight largely overlooked in the literature on aggregation of opinions: pooling together collective estimates made by independent, small groups that interacted within themselves increases the wisdom-of-crowds effect. The potential applications of this approach are numerous and range from improving structured communication methods that explicitly avoid face-to-face interactions41, to the aggregation of political and economic forecasts42 and the design of wiser public policies43. Our findings thus provide further support to the idea that combining statistics with behavioural interventions leads to better collective judgements18. While our aim was to study a real interacting crowd, face-to-face deliberation may not be needed to observe an increase in collective accuracy. In fact, previous research has shown that social influence in virtual chatrooms could also increase collective intelligence19,20.
The first study on the wisdom of crowds was regarded as an empirical demonstration that democratic aggregation rules can be trustworthy and efficient2. Since then, attempts to increase collective wisdom have been based on the idea that some opinions have more merit than others and set out to find those more accurate opinions by pursuing some ideal non-uniform weighting algorithm17,31,35. For example, previous studies proposed to select ‘surprisingly popular’ minority answers35 or to average the responses of ‘select crowds’ defined by higher expertise31 or by resistance to social influence17. Although these methods lead to substantial improvements in performance, implementing simple majority rules may still be preferred for other reasons, which may include sharing responsibility44, promoting social inclusion39, and avoiding elitism or inequality45,46. Here, we showed that the wisdom of crowds can be increased by simple face-to-face discussion within groups coupled with between-group sampling. Our simple-yet-powerful idea is that pooling knowledge from individuals who participated in independent debates reduces collective error. Critically, this is achieved without compromising the democratic principle of ‘one vote, one value’47. This builds on the political notion of deliberative polls as a practical mechanism to solve the conundrum between equality and deliberation. Solving these two things simultaneously is difficult because as more and more people’s voices are asked to make a decision, massive deliberation becomes impractical48,49. Here, we demonstrated that in questions of general knowledge, where it is easy to judge the correctness of the group choice and in the absence of strategic voting behaviour50, aggregating consensus choices made in small groups increases the wisdom of crowds. This result supports political theories postulating that authentic deliberation, and not simply voting, can lead to better democratic decisions51.
Methods
Context
The experiment was performed during a TEDx event in Buenos Aires, Argentina (http://www.tedxriodelaplata.org/) on 24 September 2015. This was the third edition of an initiative called TEDxperiments (http://www.tedxriodelaplata.org/tedxperiments), aimed at constructing knowledge on human communication by performing behavioural experiments on large TEDx audiences. The first two editions studied the cost of interruptions on human interaction52, and the use of a competition bias in a 'zero-sum fallacy' game53.
Materials
Research assistants handed one pen and one A4 paper to each participant. The A4 paper was folded on the long edge and had four pages. On page 1, participants were informed about their group number and their role in the group. The three stages of the experiment (Fig. 1a) could be completed in pages 2, 3 and 4, respectively. On page 4, participants could also complete information about their age and gender.
Experimental procedure
The speaker (author M.S.) announced that his section would consist of a behavioural experiment. Participants were informed that their participation was completely voluntary and they could simply choose not to participate or withdraw their participation at any time. A total of 5,180 participants (2,468 female, mean age 30.1 yr, s.d. 11.6 yr) performed the experiment. All data were completely anonymous. This experimental procedure was approved by the ethics committee of CEMIC (Centro de Educación Médica e Investigaciones Clínicas Norberto Quirno). A video of the experiment is available in Supplementary Video 1.
Stage i1: individual decisions
The speaker announced that, in the first part of the experiment, participants would make individual decisions. Subjects answered eight general knowledge questions that involved the estimation of an uncertain number (for example, "What is the height in metres of the Eiffel Tower?'"). Each question (Supplementary Table 1) had one code (e.g. EIFFEL) and two boxes. Participants were instructed to fill the first box with their estimate, and the second box with their confidence in a scale from 0 to 10. Before the beginning of stage i1, the speaker completed one example question on the screen, and then read the eight questions. Participants were given 20 s to answer each question.
Stage c: collective decisions
In the second part (stage c), we asked participants to make collective decisions. First, they were instructed to find other members in their group according to a numerical code found on page 1. Each group had six members, and all participants were seated next to each other in two consecutive rows. The speaker announced that there were two possible roles in the group: player or moderator. Each group had five players and one moderator. Each participant could find their assigned role on page 1 (for example, “You are the moderator in group 765” or “You are a player in group 391”). Players were instructed to reach a consensus and report it to the moderator in a maximum of 60 s. Moderators were given verbal and written instructions to not participate or intercede in the decisions made by the players. The role of the moderators was simply to write down the collective decisions made by the players in their group. Moderators were also instructed to write down an ‘X’ if there was lack of consensus among the group. Groups were asked to answer four of the eight questions from stage i1 (see Supplementary Table 1). The speaker read the four questions again, and announced the moments in which time was over.
Stage i2: revised decisions
Finally, participants were allowed to revise all of their individual decisions and confidence, including the ones that remained undiscussed. The speaker emphasized that this part was individual, and read all eight questions of stage i1 again.
Data collection and digitalization
At the end of the talk, we collected the papers as participants exited the auditorium. Over the week following the event, five data-entry research assistants digitalized these data using a keyboard. We collected 5,180 papers: 4,232 players and 946 moderators. Many of these 946 potential groups had incomplete data due to at least one missing player; overall, we collected 280 complete groups. All data reported in Fig. 1 are based on those 280 complete groups (1,400 players). For the comparison between individual, collective and revised estimates, we focus on the four questions answered at stage c.
Non-parametric normalization
The distributions of responses were spread around different values on each question (Supplementary Fig. 1). To normalize these distributions, we used a non-parametric approach inspired in the outlier detection literature27. We calculated the deviance of each data point x i around the median, and normalized this value by the median absolute deviance:
where x is the distribution of responses. The i represents the ith subject where i goes from 1 to N subjects. This procedure could be regarded as a non-parametric z scoring of the data.
The rationale for normalizing our data was twofold. First, we used this procedure to reject outliers in the distribution of responses. Following previous studies27, we discarded all responses that deviated from the median by more than 15 times the median absolute deviance. The second purpose of normalization was to average our results across different questions. This helps the visualization of our data, but our findings can be replicated on each question separately without any normalization (Supplementary Fig. 1).
Data analysis
To compute all our curves in Fig. 1, we subsampled our crowd in two different ways: either by choosing n individuals that interacted in m = n/5 different groups (within-groups sampling) or by choosing n individuals from n different groups (between-groups sampling). All curves in Fig. 1b and the solid line in Fig. 2e were based on the within-groups sampling condition; the dashed line in Fig. 2e is from the between-groups sampling condition. For a fair comparison between conditions, we computed the errors using exactly the same subsamples in our crowd. For each value of n, we considered 1,000 iterations of this subsampling procedure.
In the case of n =5, each iteration randomly selected 5 of our 280 complete groups (Fig. 2e sketches one example iteration). In the within-groups condition, we computed the crowd error of each of the five groups (the error of the average response in stages i1 and i2, and the error of the collective response in stage c) respecting the identity of each group. Finally, we averaged the five crowd errors and stored their mean value as the within-groups error for this iteration. In the between-groups sampling, we combined responses from individuals coming from different groups. We computed the error for 1,000 random combinations contingent on the restriction that all individuals belonged to different groups. Finally, we averaged all crowd errors and stored this value as the between-groups error for this iteration.
The same procedure was extended for n > 5. We randomly selected n of our 280 groups on each of our 1,000 iterations. In the within-groups condition, we selected all possible combinations of n individuals coming from m groups, and computed their crowd error. We averaged the crowd error for all possible combinations and stored this value as the within-groups error for this iteration. In the between-groups condition, we randomly selected 1,000 combinations of n individuals coming from n different groups, and computed their crowd error. We averaged all of these crowd errors and stored this value as the between-subjects error for this iteration.
All error bars in Figs. 1 and 2 depict the normalized mean ± s.e.m. of the crowd error across iterations. Pairwise comparisons were performed through non-parametric paired tests (Wilcoxon signed-rank tests). To test the general tendency that error decreases for larger crowds, we used two-way repeated-measures analysis of variance with the factors 'question' and 'crowd size n', and iteration as repeated measure.
Aggregation rules
We evaluated whether collective estimates could result from seven simple aggregation rules (Fig. 3b). All of these rules predict that the collective estimate j is constructed using a weighted average of the initial estimates x i with weights w i :
In Fig. 3b, the seven rules were sorted by accuracy. Rule 1 is an average weighted by resistance to social influence. This procedure simulates that, during deliberation, the group follows the individuals who were least willing to change their minds, presumably because they had better information17. Resistance to social influence was quantified as the inverse linear absolute distance between the initial (x i ) and revised (r i ) estimates. This quantity was used to compute the weights:
where ε is a constant to prevent divergence when x j = r j . We simulated this rule using different values of ε ranging from 0.1 to 1,000, and used the value with highest accuracy (ε = 1). In rule 2 (the ‘confidence-weighted average rule’), the group uses the initial confidence ratings as weights in the collective decision, \({w}_{i}{\rm{=}}{c}_{i}{\rm{/}}{\sum }_{j}{c}_{j}\). In rule 3, which we call the ‘expert rule’, the group selects the estimate of the most confident individual in the group. This rule is defined by w i = 1 for i =argmax(c), and w k = 0 for k ≠ i, where c is a vector with the five initial confidence ratings in the group.
Rule 4 consists of simply taking the median of the initial estimates, which is equivalent to giving a weight w i = 1 to the third-largest estimate in the group, and w k = 0 to all other estimates. Rule 5 is the simple mean, namely w i = 0.2 for all i. Rule 6, which we call 'soft median', is a rule that gives a weight of w i = 0.5 to the third-largest estimate, weights of w k = 0.25 to the second- and fourth-largest estimates, and w i = 0 to the smallest and largest estimates in the group. Finally, rule 7 is a robust average: this rule gives a weight w i = 0 to all estimates in the group that differ from the mean by more than k orders of magnitude, and equal weights to all other estimates. We simulated this rule using different values of k ranging from 1 to 10, and used the value with highest accuracy (k = 4).
Experiment 2
A total of N = 100 naïve participants (56 female, mean age 19.9 yr, s.d. 1.3 yr) volunteered to participate in our study. Participants were undergraduate students at Universidad Torcuato Di Tella, and were tested as 20 groups of 5. The instructions and procedures were identical to the main task described above. At the end of the experiment, all individuals completed a questionnaire about the deliberation procedure implemented during the task. We asked them to rate (on a Likert scale from 0 to 10) the extent to which different deliberation procedures contributed to reaching consensus for each question. They rated six different procedures (see Table 1 and Supplementary Fig. 4), which appeared in a randomized order. We also gave them the possibility to choose 'other' and describe that procedure.
Life Sciences Reporting Summary
Further information on experimental design is available in the Life Sciences Reporting Summary.
Code availability
The codes that supports the findings of this study are available from the corresponding author upon request.
Data availability
The data that supports the findings of this study are available from the corresponding author upon request.
References
Condorcet, M. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix (L’impremerie royale, Paris, 1785).
Galton, F. Vox populi. Nature 7, 450–451 (1907).
Surowiecki, J. The Wisdom of Crowds (Little, Brown, London, 2004).
Kurvers, R. H. et al. Boosting medical diagnostics by pooling independent judgments. Proc. Natl Acad. Sci. USA 113, 8777–8782 (2016).
Ray, R. Prediction markets and the financial "wisdom of crowds”. J. Behav. Financ. 7, 2–4 (2006).
Jowett, B. The Republic of Plato (Clarendon Press, Oxford, 1888).
Forsythe, R., Nelson, F., Neumann, G. R. & Wright, J. Anatomy of an experimental political stock market. Am. Econ. Rev. 82, 1142–1161 (1992).
Keller, A. et al. Predicting human olfactory perception from chemical features of odor molecules. Science 355, 820–826 (2017).
MacKay, C. Extraordinary Popular Delusions the Madness of Crowds (Wordsworth Editions Limited, Ware, 1841).
Tversky, A. & Kahneman, D. Judgment under uncertainty: heuristics and Biases. Science 185, 1124–1131 (1974).
Raafat, R. M., Chater, N. & Frith, C. Herding in humans. Trends Cogn. Sci. 13, 420–428 (2009).
Chari, V. V. & Kehoe, P. J. Financial crises as herds: overturning the critiques. J. Econ. Theory 119, 128–150 (2004).
Salganik, M. J., Dodds, P. S. & Watts, D. J. Experimental study of inequality and unpredictability in an artificial cultural market. Science 311, 854–856 (2006).
Muchnik, L., Aral, S. & Taylor, S. J. Social influence bias: a randomized experiment. Science 341, 647–651 (2013).
Festinger, L., Riecken, H. W. & Schachter, S. When Prophecy Fails: A Social and Psychological Study of a Modern Group that Predicted the End of the World (Harper-Torchbooks, New York, NY, 1956).
Lorenz, J., Rauhut, H., Schweitzer, F. & Helbing, D. How social influence can undermine the wisdom of crowd effect. Proc. Natl Acad. Sci. USA 108, 9020–9025 (2011).
Madirolas, G. & de Polavieja, G. G. Improving collective estimations using resistance to social influence. PLoS Comput. Biol. 11, e1004594 (2015).
Mellers, B. et al. Psychological strategies for winning a geopolitical forecasting tournament. Psychol. Sci. 25, 1106–1115 (2014).
Gürçay, B., Mellers, B. A. & Baron, J. The power of social influence on estimation accuracy. J. Behav. Decis. Mak. 28, 250–261 (2015).
Bahrami, B. et al. Optimally interacting minds. Science 329, 1081–1085 (2010).
Juni, M. Z. & Eckstein, M. P. Flexible human collective wisdom. J. Exp. Psychol. Hum. Percept. Peform. 41, 1588–1611 (2015).
Mercier, H. & Sperber, D. Why do humans reason? Arguments for an argumentative theory. Behav. Brain. Sci. 34, 57–74 (2011).
Mercier, H. & Sperber, D. “Two heads are better” stands to reason. Science 336, 979 (2012).
Smith, M. K. et al. Why peer discussion improves student performance on in-class concept questions. Science 323, 122–124 (2009).
Laughlin, P. R., Bonner, B. L. & Miner, A. G. Groups perform better than the best individuals on letters-to-numbers problems. Organ. Behav. Hum. Decis. Process. 88, 605–620 (2002).
Geil, D. M. M. Collaborative reasoning: evidence for collective rationality. Think. Reason. 4, 231–248 (1998).
Leys, C., Ley, C., Klein, O., Bernard, P. & Licata, L. Detecting outliers: do not use standard deviation around the mean, use absolute deviation around the median. J. Exp. Soc. Psychol. 49, 764–766 (2013).
Myers, D. G. & Lamm, H. The group polarization phenomenon. Psychol. Bull. 83, 602–627 (1976).
Hong, L. & Page, S. E. Groups of diverse problem solvers can outperform groups of high-ability problem solvers. Proc. Natl Acad. Sci. USA 101, 16385–16389 (2004).
Goldstein, D. G., McAfee, R. P. & Suri, S. The wisdom of smaller, smarter crowds. In Proc. Fifteenth ACM Conference on Economics and Computation Ser. 471–488 (ACM, Palo Alto, CA, 2014).
Mannes, A. E., Soll, J. B. & Larrick, R. P. The wisdom of select crowds. J. Pers. Soc. Psychol. 107, 276–299 (2014).
Vul, E. & Pashler, H. Measuring the crowd within: probabilistic representations within individuals. Psychol. Sci. 19, 645–647 (2008).
Herzog, S. M. & Hertwig, R. The wisdom of many in one mind: improving individual judgments with dialectical bootstrapping. Psychol. Sci. 20, 231–237 (2009).
Ariely, D. et al. The effects of averaging subjective probability estimates between and within judges. J. Exp. Psychol. Appl. 6, 130–146 (2000).
Prelec, D., Seung, H. S. & McCoy, J. A solution to the single-question crowd wisdom problem. Nature 541, 532–535 (2017).
Lorenz, J., Rauhut, H. & Kittel, B. Majoritarian democracy undermines truth-finding in deliberative committees. Res. Polit. 2, 1–10 (2015).
Landemore, H. & Page, S. E. Deliberation and disagreement: problem solving, prediction, and positive dissensus. J. Pol. Philos. Econ. 14, 229–254 (2015).
Li, V., Herce Castañón, S., Solomon, J. A., Vandormael, H. & Summerfield, C. Robust averaging protects decisions from noise in neural computations. PLoS Comput. Biol. 13, e1005723 (2017).
Asch, S. E. Opinions and social pressure. Sci. Am. 193, 31–35 (1955).
Lyman, F. T. in The Responsive Classroom Discussion: The Inclusion of All Students (ed. Anderson, A. S.) 113 (Univ. Maryland Press, Potomac, MD, 1981).
Dalkey, N. & Helmer, O. An experimental application of the Delphi method to the use of experts. Manag. Sci. 9, 458–467 (1963).
Tetlock, P. Expert Political Judgment: How Good Is It? How Can We Know? (Princeton Univ. Press, Princeton, NJ, 2005).
Sunstein, C. R. Infotopia: How Many Minds Produce Knowledge (Oxford Univ. Press, Oxford, 2006).
Harvey, N. & Fischer, I. Taking advice: accepting help, improving judgment, and sharing responsibility. Organ. Behav. Hum. Decis. Process. 70, 117–133 (1997).
Eisenberger, N. I., Lieberman, M. D. & Williams, K. D. Does rejection hurt? An FMRI study of social exclusion. Science 302, 290–292 (2003).
Mahmoodi, A. et al. Equality bias impairs collective decision-making across cultures. Proc. Natl Acad. Sci. USA 112, 3835–3840 (2015).
Galton, F. One vote, one value. Nature 75, 414 (1907).
Mill, J. S. On Liberty (John W. Parker and Son, London, 1859).
Fishkin, J. S. & Luskin, R. C. Experimenting with a democratic ideal: deliberative polling and public opinion. Acta Polit. 40, 284–298 (2005).
Austen-Smith, D. & Banks, J. S. Information aggregation, rationality, and the Condorcet jury theorem. Am. Political Sci. Rev. 90, 34–45 (1996).
Cohen, J. in Deliberative Democracy: Essays on Reason and Politics (eds Bohman, J. & Rehg, W.) Ch. 3 (MIT Press, Boston, MA, 1997).
Lopez-Rosenfeld, M. et al. Neglect in human communication: quantifying the cost of cell-phone interruptions in face to face dialogs. PLoS ONE 10, e0125772 (2015).
Niella, T., Stier-Moses, N. & Sigman, M. Nudging cooperation in a crowd experiment. PLoS ONE 11, e0147125 (2016).
Acknowledgements
J.N. and B.B. were supported by the European Research Council StG (NEUROCODEC, #309865); M.S. was supported by the James McDonnell Foundation 21st Century Science Initiative in Understanding Human Cognition—Scholar Award (Grant #220020334), and by Agencia Nacional de Promoción Científica y Tecnológica (Argentina)—Préstamo BID PICT (Grant #2013-1653). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. We thank M. Sartorio for assistance in data collection.
Author information
Authors and Affiliations
Contributions
J.N., T.N., G.G. and M.S. designed and conducted the experiments. J.N., B.B. and M.S. developed the analysis approach. J.N. analysed the data. J.N., B.B. and M.S. wrote the paper.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figures 1–4, Supplementary Table 1
Rights and permissions
About this article
Cite this article
Navajas, J., Niella, T., Garbulsky, G. et al. Aggregated knowledge from a small number of debates outperforms the wisdom of large crowds. Nat Hum Behav 2, 126–132 (2018). https://doi.org/10.1038/s41562-017-0273-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41562-017-0273-4
This article is cited by
-
The range of collective accuracy for binary classifications under majority rule
Economic Theory (2024)
-
Exploiting Meta-cognitive Features for a Machine-Learning-Based One-Shot Group-Decision Aggregation
Group Decision and Negotiation (2024)
-
Moving Beyond Benchmarks and Competitions: Towards Addressing Social Media Challenges in an Educational Context
Datenbank-Spektrum (2023)
-
Information aggregation and collective intelligence beyond the wisdom of crowds
Nature Reviews Psychology (2022)
-
The potential for effective reasoning guides children’s preference for small group discussion over crowdsourcing
Scientific Reports (2022)