Abstract
Although approximately one-quarter of the roughly 4,000 genetically inherited diseases currently recorded in respective databases (LocusLink1, OMIM2) are already linked to a region of the human genome, about 450 have no known associated gene. Finding disease-related genes requires laborious examination of hundreds of possible candidate genes (sometimes, these are not even annotated; see, for example, refs 3,4). The public availability of the human genome5 draft sequence has fostered new strategies to map molecular functional features of gene products to complex phenotypic descriptions, such as those of genetically inherited diseases. Owing to recent progress in the systematic annotation of genes using controlled vocabularies6, we have developed a scoring system for the possible functional relationships of human genes to 455 genetically inherited diseases that have been mapped to chromosomal regions without assignment of a particular gene. In a benchmark of the system with 100 known disease-associated genes, the disease-associated gene was among the 8 best-scoring genes with a 25% chance, and among the best 30 genes with a 50% chance, showing that there is a relationship between the score of a gene and its likelihood of being associated with a particular disease. The scoring also indicates that for some diseases, the chance of identifying the underlying gene is higher.
This is a preview of subscription content, access via your institution
Access options
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Change history
05 June 2002
New versions of the three pieces of supplementary info were placed on the site. These new versions did not contain any new information - the changes were strictly cosmetic.
References
Pruit, K.D. & Maglott, D.R. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29, 137–140 (2001).
Hamosh, A., Scott, A.F., Amberger, J., Valle, D. & McKusick, V.A. Online mendelian inheritance in man (OMIM). Hum. Mutat. 15, 57–61 (2000).
Garcia, C.K. et al. Autosomal recessive hypercholesterolemia caused by mutations in a putative LDL receptor adaptor protein. Science 292, 1394–1398 (2001).
Zhou, B., Westaway, S.K., Levinson, B., Johnson, M.A., Gitschier, J. & Hayflick, S.J. A novel pantohenate kinase gene (PANK2) is defective in Hallervorden-Spatz syndrome. Nature Genet. 28, 345–349 (2001).
Lander, E.S. et al. International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nature Genet. 25, 25–29 (2000).
Zimmermann, H.J. Fuzzy Set Theory and its Applications 3rd edn (Kluwer Academics, Boston, 1996).
Hogenesch, J.B. et al. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell 106, 413–415 (2001).
Plaitakis, A., Flessas, P., Natsiou, A.B. & Shashidharan, P. Glutamate dehydrogenase deficiency in cerebellar degenerations: clinical, biochemical and molecular genetic aspects. Can. J. Neurol. Sci. 20, S109–S116 (1993).
Cargill, M. et al. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nature Genet. 22, 231–238 (1999).
Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Acknowledgements
We thank Y.P. Yuan, J. Reina, D. Torrents, M. Suyama and other members of our group for helpful discussions. We are grateful to the US National Library of Medicine for kind licensing of MEDLINE, to NLM annotators for their extensive work in annotating MEDLINE papers with MeSH terms, and to the developers of RefSeq, LocusLink and Gene Ontology.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Rights and permissions
About this article
Cite this article
Perez-Iratxeta, C., Bork, P. & Andrade, M. Association of genes to genetically inherited diseases using data mining. Nat Genet 31, 316–319 (2002). https://doi.org/10.1038/ng895
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/ng895
This article is cited by
-
Deafness gene screening based on a multilevel cascaded BPNN model
BMC Bioinformatics (2023)
-
Evolving knowledge graph similarity for supervised learning in complex biomedical domains
BMC Bioinformatics (2020)
-
Network-based disease gene prioritization based on Protein–Protein Interaction Networks
Network Modeling Analysis in Health Informatics and Bioinformatics (2020)
-
PMAMCA: prediction of microRNA-disease association utilizing a matrix completion approach
BMC Systems Biology (2019)
-
Integrating random walk and binary regression to identify novel miRNA-disease association
BMC Bioinformatics (2019)