Unlike the computer-generated nonsense papers in some peer-reviewed subscription services (see Nature http://doi.org/r3n; 2014), the 500 or so preprints received daily by the automated repository arXiv are not pre-screened by humans. But sometimes automated assessment can be better than human diligence at enforcing standards.
The automated screens for outliers in arXiv include analysis of the probability distributions of words and their combinations, ensuring that they fall into patterns that are consistent with existing subject classes. This serves as a check of the subject categorizations provided by submitters, and helps to detect non-research content.
Fake papers generated by SCIgen software, for example, have a 'native dialect' that can be picked up by simple stylometric analysis (see J. N. G. Binongo Chance 16, 9–17; 2003). The most frequent words used in English text (stop words such as 'the', 'of', 'and') encode stylistic features that are independent of content. On average, these words follow a power-law distribution that is evident in even relatively small amounts of text; significant deviations signal outliers.
The effect can be seen in principal-component analysis plots (see 'Counterfeit clusters'). Computer-generated articles form tight clusters that are well separated from human-authored articles.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ginsparg, P. ArXiv screens spot fake papers. Nature 508, 44 (2014). https://doi.org/10.1038/508044a
Published:
Issue Date:
DOI: https://doi.org/10.1038/508044a
This article is cited by
-
Detecting automatically generated sentences with grammatical structure similarity
Scientometrics (2018)
-
Striking similarities between publications from China describing single gene knockdown experiments in human cancer cell lines
Scientometrics (2017)
-
Comparing the topological properties of real and artificially generated scientific manuscripts
Scientometrics (2015)