Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization

Jagadeesh, Karthik A.; Birgmeier, Johannes; Guturu, Harendra; Deisseroth, Cole A.; Wenger, Aaron M.; Bernstein, Jonathan A.; Bejerano, Gill

doi:10.1038/s41436-018-0072-y

Article
Published: 12 July 2018

Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization

Karthik A. Jagadeesh MSc¹^na1,
Johannes Birgmeier MSc¹^na1,
Harendra Guturu PhD²,
Cole A. Deisseroth¹,
Aaron M. Wenger PhD²,
Jonathan A. Bernstein MD, PhD² &
…
Gill Bejerano PhD^1,2,3

Genetics in Medicine volume 21, pages 464–470 (2019)Cite this article

4144 Accesses
20 Citations
33 Altmetric
Metrics details

Abstract

Purpose

Exome sequencing and diagnosis is beginning to spread across the medical establishment. The most time-consuming part of genome-based diagnosis is the manual step of matching the potentially long list of patient candidate genes to patient phenotypes to identify the causative disease.

Methods

We introduce Phrank (for phenotype ranking), an information theory–inspired method that utilizes a Bayesian network to prioritize candidate diseases or genes, as a stand-alone module that can be run with any underlying knowledgebase and any variant filtering scheme.

Results

Phrank outperforms existing methods at ranking the causative disease or gene when applied to 169 real patient exomes with Mendelian diagnoses. Phrank’s greatest improvement is in disease space, where across all 169 patients it ranks only 3 diseases on average ahead of the true diagnosis, whereas Phenomizer ranks 32 diseases ahead of the causal one.

Conclusions

Using Phrank to rank all patient candidate genes or diseases, as they start working through a new case, will save the busy clinician much time in deriving a genetic diagnosis.

You have full access to this article via your institution.

Download PDF

Specific phenotype semantics facilitate gene prioritization in clinical exome sequencing

Article 03 May 2019

Refined preferences of prioritizers improve intelligent diagnosis for Mendelian diseases

Article Open access 03 February 2024

Improving the diagnostic yield of exome- sequencing by predicting gene–phenotype associations using large-scale gene expression analysis

Article Open access 28 June 2019

Introduction

Genomic data is becoming increasingly utilized in medical genetics clinical practice.^1,2,3 Exome sequencing has allowed for the identification of the genetic basis of thousands of different Mendelian disorders.^4,5,6 In most clinical sequencing cases, only the proband’s exome is sequenced. Such a typical “singleton” patient exome contains 100–300 variants of uncertain significance (VUS), of which only one or two may adversely affect a single gene of interest.

Clinicians ultimately diagnose patients by identifying the disease that best explains the patient’s phenotypes.⁷ To do this, they mentally calculate a “fuzzy” match between the patient phenotype and all diseases they are aware of to try and find a disease that best fits the phenotypes observed in the patient. Clinicians also try to read up on the different patient candidate genes, to see whether any gene has been previously implicated in causing phenotypes similar to those in their patient. This is no easy task, with several thousands of well-characterized Mendelian diseases, caused by well over 3000 known Mendelian disease genes,^8,9,10 and hundreds of novel gene–disease associations discovered annually.⁷ On average, clinicians analyzing a patient with a Mendelian disorder have estimated spending 54 min to examine a single variant’s pathogenicity.¹¹ In some cases clinicians can immediately recognize the genetic basis of the disease, but in general, identifying a patient’s causative gene has been estimated to consume on average a workweek of expert time.^12,13,14

With 7 million (5.4%) babies born each year worldwide with serious inherited genetic disorders,¹⁵ and genetic testing workflows capable of sequencing and generating variant call files for hundreds of individuals per day, busy clinicians are in great need of computational approaches to aid them in making a rapid, accurate diagnosis. Such automated methods attempt to rank all patient possibly pathogenic variants for their known ability to explain the patient set of phenotypes. The goal is to bring the causal gene to the top of the patient ranked list, such that when the busy clinician first lays eyes on a sequenced case, their attention is drawn to the most likely diagnostic hypotheses as early as possible. Different computational tools attempt to achieve this goal by combining computable measures of similarity between a patient set of phenotypes and the set of phenotypes associated with any gene on the patient list, observed variant frequency in the general population, and measures of predicted variant pathogenicity.^16,17,18,19

Most automated phenotype similarity–based ranking methods such as Phevor²⁰ and PhenIX²¹ rank genes, and leave the clinician with the final step of identifying the causative disease. Phenomizer²² is one of few disease-ranking tools, making it popular among clinicians. We introduce a novel information theory–inspired ranking method, Phrank, which greatly improves on Phenomizer for identifying the causative disease. For maximum utility, Phrank is offered both through the AMELIE web interface where it is deployed with a particular knowledgebase, as well as in the form of a code module that can be combined with any variant ranking scheme and any underlying knowledgebase.

Materials and methods

Patient variant prioritization is ultimately determined via a combination of multiple variant and host gene properties. Phrank isolates and optimizes one important feature: scoring each disease caused by a patient candidate gene for its ability to explain the set of patient phenotypes.

General Phrank inputs: phenotype ontology and a gene–disease–phenotype knowledgebase

Phrank assumes access to a phenotype ontology representing relationships between phenotypes as a rooted directed acyclic graph (DAG), such as the Human Phenotype Ontology¹⁰ (HPO), and a knowledgebase of gene–disease–phenotype relationships, such as the Human Phenotype Ontology Annotations (HPO-A), where each entry consists of a diagnostic gene, a disease caused by mutations in the gene, and a disease-associated phenotype. By definition, whenever a gene g is annotated to cause phenotype ϕ, it is considered to also cause (a particular instance of) all phenotypes that are ancestral to ϕ, denoted as the set anc(ϕ). For example, if ϕ is elbow hypertrichosis, anc(ϕ) will contain elbow hypertrichosis itself, as well as hypertrichosis, abnormal hair quantity, abnormality of the hair, etc. up to the root of the phenotype DAG. For a set of phenotypes Φ, anc(Φ) is the union of anc(ϕ_i) for every ϕ_i in Φ.

Computing the conditional probability of a phenotype in a DAG

Using the DAG and knowledgebase, we define |G_ϕ| to be the number of genes associated with phenotype ϕ, and |G_pa(ϕ)| to be the number of genes associated with any parent phenotype of ϕ in the graph. We overlay a Boolean Bayesian network²³ on the HPO DAG. In a Bayesian network, each node is assigned a probability of being observed conditioned on its parent nodes being observed. Here, we define P(root) = 1, and the conditional probability of observing phenotype ϕ given its parent phenotypes pa(ϕ), P(ϕ|pa(ϕ)) = |G_ϕ|/|G_pa(ϕ)| if all parent phenotypes pa(ϕ) are observed and let the conditional probability P(ϕ|pa(ϕ)) = 0 otherwise.

Phrank ranks candidate genes and diseases using a novel phenotype similarity measurement

Per patient, Phrank takes as input a list of patient phenotypes and a list of patient candidate genes. Based on user preference, Phrank will then output a score for each gene in the candidate gene list, or for each disease that according to the knowledgebase may be caused by one or more of the genes on the patient candidate gene list. The higher the score, the better the gene or disease is thought to explain the provided set of patient phenotypes.

Phrank computes a score measuring the similarity between a patient’s phenotypes and each gene/disease–phenotype set. Each phenotype’s contribution to the Phrank score is a function of the number of genes known to cause the phenotype. Intuitively, the fewer genes are known to cause a phenotype, the more impressive is a candidate gene’s ability to explain this patient phenotype, and the higher the score derived from such a match (see Fig. 1 and below for details).

**Fig. 1: The Phrank information content based score.**

Defining a Phrank score between any two sets of phenotypes

Given two sets of phenotypes, Φ_A and Φ_B, we define the Phrank score between them as the information content of the intersection of the ancestral closure of the two sets (see Fig. 1). In Supplemental Note 1 we show that this quantity can be computed using

$$Phrank\left( {{\mathrm{\Phi }}_A,{\mathrm{\Phi }}_B} \right) = \mathop {\sum}\limits_{\phi \in {\mathrm{anc}}({\mathrm{\Phi }}_A) \cap {\mathrm{anc}}({\mathrm{\Phi }}_B)} { - \log _2\left( {\frac{{|G_\phi |}}{{|G_{pa\left( \phi \right)}|}}} \right)}$$

The (patient-specific) Phrank score of genes and diseases

Given the above definition of the Phrank score between any two sets of phenotypes, the patient-specific Phrank score of a disease is simply the Phrank score between the set of phenotypes associated with the disease in our knowledgebase and the input set of patient phenotypes. The Phrank score of a gene is defined as the maximal Phrank score for a disease that can be caused by the gene, according to the knowledgebase.

Human Phenotype Ontology (HPO) and HPO Annotations (HPO-A)

To test Phrank’s performance we use the popular Human Phenotype Ontology¹⁰ (HPO). HPO is a rooted directed acyclic graph (DAG) that includes a hierarchical comprehensive description of human phenotypes. The phenotype DAG root node is labeled “Phenotypic Abnormality” (HP:0000118). Parent–child “is a” relationships that are part of the graph consist of a more general parent term and a more specific child term. For example, the term “Hypertrichosis” is the parent of the term “Elbow hypertrichosis.” A term can have multiple children, and multiple parents.

The Human Phenotype Ontology project also curates gene–phenotype–disease relationships¹⁰ based on the Online Mendelian Inheritance in Man (OMIM)⁸ manually curated knowledgebase. We refer to these as HPO-A (HPO Annotations), and have downloaded them from http://compbio.charite.de/jenkins/job/hpo.annotations.monthly/. HPO build 127 provides a rooted DAG over 11,156 distinct phenotypes, and HPO-A provides direct mappings between 3406 genes, 3995 diseases, and 5640 HPO phenotypes.

Benchmark set of 169 real diagnosed patients

A dataset of diagnosed patients was downloaded from the Deciphering Developmental Disorders (DDD) study²⁴ in the European Genome-Phenome Archive (EGA)²⁵ study EGAS00001000775. The DDD study recruited patients satisfying a wide range of neurodevelopmental disorders and congenital anomalies.²⁶ The dataset contains patient variant call files (VCFs), a list of HPO phenotypes, and the causative gene for each patient. From this set we removed cases with a sibling already in the set, cases offering a novel causative gene hypothesis, and cases where the disease was caused by large structural or mosaic variants. A board-certified clinical geneticist on our team independently reviewed these patients (without the use of a gene- or disease-ranking tool) and identified OMIM disease IDs that best explain each patient’s condition and causal gene. A total of 169 patients suffering from nearly a hundred different diseases were found to have high confidence diagnoses (see Supplementary Table 1).

Variant annotation

ExAC¹³ v0.3 and the 1000 Genomes Project¹² (KGP) were used to annotate variants with observed control population frequencies. ANNOVAR²⁷ v527 was used to annotate variants with predicted effect on protein-coding genes using gene isoforms from the Ensembl gene set^{28, 29} version 75 for the hg19/GRCh37 assembly of the human genome.

Variant filtering

We observed on average a total of 98,815 variants per individual. Patient variants are filtered to only keep those predicted by ANNOVAR to be nonsynonymous, stopgain, stoploss, splice affecting, frameshift indel, or nonframeshift indel mutations.⁷ Further, variants are filtered to a set of candidate disease-causing mutations based on allele frequency in the control population. Genetic variants are filtered to only keep those with an allele frequency of 0.1% or less in any ExAC or KGP population if they are heterozygous and do not co-occur with at least one other variant in the same transcript. Similarly, variants are filtered to only keep those with an allele frequency of 0.5% or less if they are homozygous or co-occur with at least one other variant in the same transcript. There are on average 281 variants per individual after following this common filtration strategy.³⁰ Variants are considered to be likely benign if they do not satisfy these criteria. The causative variants in all 169 patients satisfy these criteria and are properly retained in the respective candidate variant/gene list.

Phenomizer, Phevor, and PhenIX gene/disease rankings

Phenomizer²² disease similarity scores were obtained for each patient by entering the patient’s HPO phenotypes into the Phenomizer website (http://compbio.charite.de/phenomizer/) and subsetting the output ranked list of all diseases to just those associated with a patient’s candidate genes containing possibly pathogenic variants.

Gene rankings by Phevor²⁰ were obtained for each patient by entering the patient’s HPO phenotypes into Phevor’s website (http://weatherby.genetics.utah.edu/phevor2/index.html) using an input file containing all 20,745 protein-coding genes in Ensembl build 75 and then subsetting the returned ranked list of all input protein-coding genes to the list of patient’s candidate genes.

Gene rankings by PhenIX²¹ were obtained for each patient by running Exomiser,³¹ an integrated variant filtering and prioritization tool, with no filters, on each patient’s filtered VCF file containing the patient’s possibly pathogenic variants.

Using AMELIE as the underlying knowledgebase

AMELIE (for Automatic Mendelian Literature Evaluation) is an alternative Mendelian gene–phenotype association knowledgebase. It is populated entirely using a natural language processing and machine learning approach, directly from the full-text primary literature itself. AMELIE links 12,295 genes to 6,669 HPO phenotypes. AMELIE does not extract disease names. Instead, it effectively trades the notion of a disease with that of a full-text paper. Every causal gene it extracts from a paper is linked to a set of HPO phenotypes extracted with it from the same paper. To compare Phrank on HPO-A with Phrank on AMELIE, we used the Phrank HPO-A based gene ranking as above. For AMELIE, paper replaced disease. In other words, the patient-specific AMELIE-based Phrank score of a gene was defined as the maximal Phrank score of any paper about the gene according to the AMELIE knowledgebase.

Phrank code availability

The Phrank source code will be available at https://bitbucket.org/bejerano/phrank.

Results

Patient benchmark set

Previous methods largely used simulated patients with artificially assigned phenotypes for performance evaluation.^20,21,22 On these synthetic sets, some of the methods we compare with ranked the causative gene at the top in 90% of cases. Taking advantage of the growing availability of real patient data, we curated a set of 169 patients from the Deciphering Developmental Diseases (DDD) study,²⁴ as described above (Methods and Supplementary Table 1). On average, each patient in the final set is characterized by 7.5 phenotypes, and carries a candidate list of 278.8 genes. As we show next, this real dataset poses a much bigger challenge to all tested methods.

Phrank greatly improves disease ranking

We gave Phrank as input each patient candidate gene list, along with their respective set of phenotypes. Phenomizer²² only takes as input the set of patient phenotypes, to rank all possible diseases. Both Phrank and Phenomizer provided as output a list of diseases ranked for their ability to explain the patient set of phenotypes. To compare the two methods, the Phenomizer output was subset only to diseases that according to OMIM can be caused by genes on the patient candidate list. For both methods, the rank of the correct disease diagnosis was noted. Phenomizer ranked the causative disease at the top in 3.6% of patients, in the top 5 in 17.2% of patients, and in the top 10 in 31.4% of patients. Across all 169 cases, Phenomizer ranks an average of 32 diseases ahead of the patient’s actual disease. In comparison, Phrank ranked the causative disease at the top in 26% of patients, in the top 5 in 55% of patients, and in the top 10 in 64.5% of patients (Fig. 2a), outperforming Phenomizer at all ranking thresholds. On average, Phrank ranks only 3 diseases before the patient’s actual disease. Phrank significantly outperforms Phenomizer (p ≤ 2.2*10⁻¹⁶, Wilcoxon signed rank test). Assuming equal time to evaluate each disease hypothesis, Phenomizer reduces clinician average time spent per patient by 25% compared with a randomly shuffled disease list baseline while Phrank reduces it by 90.9%, thus greatly accelerating diagnosis.

**Fig. 2: Phrank performance on a set of 169 diagnosed patients.**

Phrank modestly improves gene ranking

While ranking diseases enables the most natural way to convey recommendations to the attending clinician, some tools like Phevor²⁰ and PhenIX²¹ rank genes and not diseases. The Phrank score can be easily converted to this task by assigning to each gene the score of the highest scoring disease it is known to cause (Fig. 1 and Methods). Phrank, PhenIX, and Phevor results for each patient set of phenotypes was subset to the patient candidate gene list, and the rank of each known causative gene was collected. PhenIX and Phevor ranked the causative gene first in 15.4% and 21.3% of patients, in the top 5 in 54.4% and 52.7% of patients, respectively. In comparison, Phrank ranked the causative gene at the top in 27.8% of patients, and in the top 5 in 56.8% of patients (Fig. 2b). Over all cases Phrank gives the causative gene an average rank of 9.5, while Phevor and PhenIX only achieve an average rank of 21 and 15.5, respectively.

Phrank works with any underlying knowledgebase

Some existing tools combine their ranking algorithm and underlying knowledgebase such that tool users can only use the algorithm with the provided knowledgebase. Phrank completely decouples the two, and can be used with any appropriately populated knowledgebase. For example, HPO-A¹⁰ is mostly based on manually curated gene–disease–phenotype associations, whereas AMELIE uses machine learning to extract Mendelian gene–phenotype associations directly from the primary literature itself (see Methods). HPO-A and AMELIE have been compared elsewhere. Our goal here is only to show that Phrank can easily be used interchangeably with both knowledgebases to reveal pros and cons of each approach as they reflect in their ranking of the different cases. In particular, using Phrank, AMELIE outperforms HPO-A at all thresholds over our patient set (Fig. 3).

**Fig. 3: Phrank can be easily used with different knowledgebases.**

Discussion

Phrank aspires to enable clinicians to accelerate diagnosis of Mendelian diseases by ranking more appealing disease or gene matches to each patient higher. Here, we compare Phrank with Phenomizer, a disease-ranking method, and to Phevor and PhenIX, two gene-ranking methods. Phrank uses a natural information theoretic formulation that quantifies how informative a phenotype is for diagnosis to measure phenotype sets' similarity. It uses the number of genes known to cause a phenotype as a proxy for the likelihood of observing the phenotype itself. Computing anc(Φ) paired with the conditional probabilities allows neighboring phenotypes (e.g., from imprecise/overprecise phenotype annotation) to be included in the final similarity score without double counting the contribution from shared ancestors.

All three methods we compare with were previously benchmarked on simulated patients, likely due to the paucity of real patient cases. To evaluate these existing methods, phenotypes for simulated patients were drawn directly from gene–phenotype knowledgebases with some noise perturbation. The same knowledgebases are used by the algorithm for gene or disease ranking, likely resulting in somewhat overly optimistic performance. Indeed, on simulated data, Phenomizer reported that the causative disease is ranked at the top in over 75% of patients²² and Phevor²⁰ and PhenIX²¹ reported that the causative gene is ranked at the top in over 90% of patients. As shown in our study (Fig. 2), when using these phenotype-based ranking methods on real patients with clinician-noted phenotypes, performance drops significantly. We recommend that future disease- and gene-ranking methods use real patient sets, such as our Supplementary Table 1 EGA set, or other, to test their approaches.

Phrank is the first published method we are aware of that explicitly decouples its ranking method from both the underlying knowledgebase and the variant filtering scheme. Phrank is offered via both the ready-to-use AMELIE portal, as well as via a simple-to-use code package, offering clinicians maximum flexibility in incorporating Phrank into their preferred workflow.

We have focused our comparisons to Phevor, PhenIX, and Phenomizer because these methods for disease and gene ranking use minimal data beyond the phenotype DAG and annotations. Other tools such as Phenolyzer,³² eXtasy,³³ and Exomiser³¹ implement full variant prioritization methods incorporating a phenotype similarity measure, frequency based variant filters, pathogenicity scores and other helpful information. Phrank can effectively be used as the phenotype similarity measure and be incorporated into any such variant prioritization method.

While Phrank improves gene- and particularly disease-ranking methods, our set of real patients makes it clear there is still much room for improvement. Ideally, the causative gene or disease should be in the top 5 if not at the top of the ranked list of candidates with over 90% confidence. This improvement may come from improvements to the phenotype similarity algorithm, the underlying phenotype ontology’s relationship structure, and/or the knowledgebase of gene–disease–phenotype associations. Such improvements will be critical in handling the ever-growing flow of genomic data produced in the quest to better patient lives.

References

Iglesias A, Anyane-Yeboa K, Wynn J, et al. The usefulness of whole-exome sequencing in routine clinical practice. Genet Med. 2014;16:922–31. https://doi.org/10.1038/gim.2014.58
Article PubMed Google Scholar
Yang Y, Muzny DM, Reid JG, et al. Clinical whole-exome sequencing for the diagnosis of Mendelian disorders. N Engl J Med. 2013;369:1502–11. https://doi.org/10.1056/NEJMoa1306555
Article CAS PubMed PubMed Central Google Scholar
Lee H, Deignan JL, Dorrani N, et al. Clinical exome sequencing for genetic identification of rare Mendelian disorders. JAMA. 2014;312:1880–7. https://doi.org/10.1001/jama.2014.14604
Article CAS PubMed PubMed Central Google Scholar
Ng SB, Bigham AW, Buckingham KJ, et al. Exome sequencing identifies MLL2 mutations as a cause of Kabuki syndrome. Nat Genet. 2010;42:790–3. https://doi.org/10.1038/ng.646
Article CAS PubMed PubMed Central Google Scholar
Ng SB, Turner EH, Robertson PD, et al. Targeted capture and massively parallel sequencing of 12 human exomes. Nature. 2009;461:272–6. https://doi.org/10.1038/nature08250
Article CAS PubMed PubMed Central Google Scholar
Ng SB, Buckingham KJ, Lee C, et al. Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010;42:30–35. https://doi.org/10.1038/ng.499
Article CAS PubMed Google Scholar
Wenger AM, Guturu H, Bernstein JA, Bejerano G. Systematic reanalysis of clinical exome data yields additional diagnoses: implications for providers. Genet Med. 2017;19:209–14. https://doi.org/10.1038/gim.2016.88
Article PubMed Google Scholar
Amberger J, Bocchini C, Hamosh A. A new face and new challenges for Online Mendelian Inheritance in Man (OMIM®). Hum Mutat. 2011;32:564–7. https://doi.org/10.1002/humu.21466
Article PubMed Google Scholar
Rath A, Olry A, Dhombres F, et al. Representation of rare diseases in health information systems: the Orphanet approach to serve a wide range of end users. Hum Mutat. 2012;33:803–8. https://doi.org/10.1002/humu.22078
Article PubMed Google Scholar
Köhler S, Doelken SC, Mungall CJ, et al. The Human Phenotype Ontology Project: linking molecular biology and disease through phenotype data. Nucleic Acids Res. 2014;42:D966–74. (Database issue)
Article PubMed Google Scholar
Dewey FE, Grove ME, Pan C, et al. Clinical interpretation and implications of whole-genome sequencing. JAMA. 2014;311:1035–45. https://doi.org/10.1001/jama.2014.1717
Article CAS PubMed PubMed Central Google Scholar
1000 Genomes Project Consortium. An integrated map of genetic variation from 1,092 human genomes. Nature. 2012;491:56–65. https://doi.org/10.1038/nature11632
Article CAS Google Scholar
Lek M, Karczewski KJ, Minikel EV, et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature. 2016;536:285–91. https://doi.org/10.1038/nature19057
Article CAS PubMed PubMed Central Google Scholar
Taylor JC, Martin HC, Lise S, et al. Factors influencing success of clinical genome sequencing across a broad spectrum of disorders. Nat Genet. 2015;47:717–26. https://doi.org/10.1038/ng.3304
Article CAS PubMed PubMed Central Google Scholar
Church G. Compelling reasons for repairing human germlines. N Engl J Med. 2017;377:1909–11. https://doi.org/10.1056/NEJMp1710370
Article PubMed Google Scholar
Ng PC, Henikoff S. SIFT: predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003;31:3812–4.
Article CAS PubMed PubMed Central Google Scholar
Adzhubei I, Jordan DM, Sunyaev SR. Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet. 2013;Chapter 7:Unit7.20. https://doi.org/10.1002/0471142905.hg0720s76
PubMed Google Scholar
Kircher M, Witten DM, Jain P, et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet. 2014;46:310–5. https://doi.org/10.1038/ng.2892
Article CAS PubMed PubMed Central Google Scholar
Jagadeesh KA, Wenger AM, Berger MJ, et al. M-CAP eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat Genet. 2016;48:1581–6. https://doi.org/10.1038/ng.3703
Article CAS PubMed Google Scholar
Singleton MV, Guthery SL, Voelkerding KV, et al. Phevor combines multiple biomedical ontologies for accurate identification of disease-causing alleles in single individuals and small nuclear families. Am J Hum Genet . 2014;94:599–610. https://doi.org/10.1016/j.ajhg.2014.03.010
Article CAS PubMed PubMed Central Google Scholar
Zemojtel T, Köhler S, Mackenroth L, et al. Effective diagnosis of genetic disease by computational phenotype analysis of the disease-associated genome. Sci Transl Med. 2014;6:252ra123–252ra123. https://doi.org/10.1126/scitranslmed.3009262
Article CAS PubMed PubMed Central Google Scholar
Köhler S, Schulz MH, Krawitz P, et al. Clinical diagnostics in human genetics with semantic similarity searches in ontologies. Am J Hum Genet. 2009;85:457–64. https://doi.org/10.1016/j.ajhg.2009.09.003
Article CAS PubMed PubMed Central Google Scholar
Koller D, Friedman N. Probabilistic Graphical Models: Principles and Techniques—Adaptive Computation and Machine Learning. The MIT Press; 2009. Cambridge, MA
Deciphering Developmental Disorders Study. Large-scale discovery of novel genetic causes of developmental disorders. Nature. 2015;519:223–8. https://doi.org/10.1038/nature14135
Article CAS Google Scholar
Lappalainen I, Almeida-King J, Kumanduri V, et al. The European Genome-Phenome Archive of human data consented for biomedical research. Nat Genet. 2015;47:692–5. https://doi.org/10.1038/ng.3312
Article CAS PubMed PubMed Central Google Scholar
Wright CF, Fitzgerald TW, Jones WD, et al. Genetic diagnosis of developmental disorders in the DDD study: a scalable analysis of genome-wide research data. Lancet. 2015;385:1305–14. https://doi.org/10.1016/S0140-6736(14)61705-0
Article PubMed PubMed Central Google Scholar
Wang K, Li M, Hakonarson H. ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res. 2010;38:e164. https://doi.org/10.1093/nar/gkq603
Article CAS PubMed PubMed Central Google Scholar
Aken BL, Ayling S, Barrell D, et al. The Ensembl gene annotation system. Database. 2016;baw093. https://doi.org/10.1093/database/baw093
Article PubMed PubMed Central Google Scholar
Flicek P, Amode MR, Barrell D, et al. Ensembl 2014. Nucleic Acids Res. 2014;42(D1):D749–55. https://doi.org/10.1093/nar/gkt1196
Article CAS PubMed Google Scholar
Jagadeesh KA, Wu DJ, Birgmeier JA, et al. Deriving genomic diagnoses without revealing patient genomes. Science. 2017;357:692–5. https://doi.org/10.1126/science.aam9710
Article CAS PubMed Google Scholar
Smedley D, Jacobsen JOB, Jäger M, et al. Next-generation diagnostics and disease-gene discovery with the Exomiser. Nat Protoc. 2015;10:2004–15. https://doi.org/10.1038/nprot.2015.124
Article CAS PubMed PubMed Central Google Scholar
Yang H, Robinson PN, Wang K. Phenolyzer: phenotype-based prioritization of candidate genes for human diseases. Nat Methods. 2015;12:841–3. https://doi.org/10.1038/nmeth.3484
Article CAS PubMed PubMed Central Google Scholar
Sifrim A, Popovic D, Tranchevent L-C, et al. eXtasy: variant prioritization by genomic data fusion. Nat Methods. 2013;10:1083–4. https://doi.org/10.1038/nmeth.2656
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We thank Yosuke Tanigawa, Ethan Dyer, Golan Yona, and all other members of the Bejerano Lab for valuable discussions and project feedback. We would also like to thank the European Genome-Phenome Archive (EGA) and the Deciphering Developmental Diseases (DDD) project. The DDD study presents independent research commissioned by the Health Innovation Challenge Fund (grant number HICF-1009-003), a parallel funding partnership between the Wellcome Trust and the Department of Health, and the Wellcome Trust Sanger Institute (grant number WT098051). The views expressed in this publication are those of the author(s) and not necessarily those of the Wellcome Trust or the Department of Health. The study has UK Research Ethics Committee approval (10/H0305/83, granted by the Cambridge South REC, and GEN/284/12 granted by the Republic of Ireland REC). The research team acknowledges the support of the National Institute for Health Research, through the Comprehensive Clinical Research Network. as well as the patients and professionals involved in the Deciphering Developmental Disorders (DDD) study deposited in the European Genome Archive (EGA). This work was funded in part by the Stanford Graduate Fellowship and CEHG Fellowship to K.A.J., a Bio-X Stanford Interdisciplinary Graduate Fellowship to J.B., the Stanford Pediatrics Department, DARPA, a Packard Foundation Fellowship, and a Microsoft Faculty Fellowship to G.B.

Author information

These authors contributed equally: Karthik A. Jagadeesh, Johannes Birgmeier

Authors and Affiliations

Department of Computer Science, Stanford University, Stanford, California, 94305, USA
Karthik A. Jagadeesh MSc, Johannes Birgmeier MSc, Cole A. Deisseroth & Gill Bejerano PhD
Department of Pediatrics, Stanford University, Stanford, California, 94305, USA
Harendra Guturu PhD, Aaron M. Wenger PhD, Jonathan A. Bernstein MD, PhD & Gill Bejerano PhD
Department of Developmental Biology, Stanford University, Stanford, California, 94305, USA
Gill Bejerano PhD

Authors

Karthik A. Jagadeesh MSc
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Birgmeier MSc
View author publications
You can also search for this author in PubMed Google Scholar
Harendra Guturu PhD
View author publications
You can also search for this author in PubMed Google Scholar
Cole A. Deisseroth
View author publications
You can also search for this author in PubMed Google Scholar
Aaron M. Wenger PhD
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan A. Bernstein MD, PhD
View author publications
You can also search for this author in PubMed Google Scholar
Gill Bejerano PhD
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gill Bejerano PhD.

Ethics declarations

Disclosure

The authors declare no conflicts of interest.

Electronic supplementary material

Supplementary Information

Supplementary Table1

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jagadeesh, K.A., Birgmeier, J., Guturu, H. et al. Phrank measures phenotype sets similarity to greatly improve Mendelian diagnostic disease prioritization. Genet Med 21, 464–470 (2019). https://doi.org/10.1038/s41436-018-0072-y

Download citation

Received: 27 November 2017
Accepted: 15 May 2018
Published: 12 July 2018
Issue Date: February 2019
DOI: https://doi.org/10.1038/s41436-018-0072-y

KeyWords

This article is cited by

Pangenome graphs improve the analysis of structural variants in rare genetic diseases
- Cristian Groza
- Carl Schwendinger-Schreck
- Tomi Pastinen
Nature Communications (2024)
An AI-based approach driven by genotypes and phenotypes to uplift the diagnostic yield of genetic diseases
- S. Zucca
- G. Nicora
- I. Limongelli
Human Genetics (2024)
Starvar: symptom-based tool for automatic ranking of variants using evidence from literature and genomes
- Șenay Kafkas
- Marwa Abdelhakim
- Robert Hoehndorf
BMC Bioinformatics (2023)
Simulation of undiagnosed patients with novel genetic conditions
- Emily Alsentzer
- Samuel G. Finlayson
- Isaac S. Kohane
Nature Communications (2023)
PhenoScore quantifies phenotypic variation for rare genetic diseases by combining facial analysis with other clinical features using a machine-learning framework
- Alexander J. M. Dingemans
- Max Hinne
- Bert B. A. de Vries
Nature Genetics (2023)

Abstract

Purpose

Methods

Results

Conclusions

Similar content being viewed by others

Introduction

Materials and methods

General Phrank inputs: phenotype ontology and a gene–disease–phenotype knowledgebase

Computing the conditional probability of a phenotype in a DAG

Phrank ranks candidate genes and diseases using a novel phenotype similarity measurement

Defining a Phrank score between any two sets of phenotypes

The (patient-specific) Phrank score of genes and diseases

Human Phenotype Ontology (HPO) and HPO Annotations (HPO-A)

Benchmark set of 169 real diagnosed patients

Variant annotation

Variant filtering

Phenomizer, Phevor, and PhenIX gene/disease rankings

Using AMELIE as the underlying knowledgebase

Phrank code availability

Results

Patient benchmark set

Phrank greatly improves disease ranking

Phrank modestly improves gene ranking

Phrank works with any underlying knowledgebase

Discussion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Disclosure

Electronic supplementary material

Rights and permissions

About this article

Cite this article

Share this article

KeyWords

This article is cited by

Search

Quick links