Abstract
We present SplashRNA, a sequential classifier to predict potent microRNA-based short hairpin RNAs (shRNAs). Trained on published and novel data sets, SplashRNA outperforms previous algorithms and reliably predicts the most efficient shRNAs for a given gene. Combined with an optimized miR-E backbone, >90% of high-scoring SplashRNA predictions trigger >85% protein knockdown when expressed from a single genomic integration. SplashRNA can significantly improve the accuracy of loss-of-function genetics studies and facilitates the generation of compact shRNA libraries.
Similar content being viewed by others
Main
Experimental RNA interference (RNAi) acts by providing exogenous sources of double-stranded RNA that mimic endogenous triggers and enable reversible, transcript-specific gene knockdown1. Whereas short interfering RNAs (siRNAs) allow for rapid gene knockdown, they are not suitable for many long-term and in vivo studies due to their transient nature. Stem-loop shRNAs can be used as a continuous source of RNAi triggers when expressed from suitable vectors, but suffer from various technical limitations including inaccurate processing2 and off-target effects through saturation of the endogenous microRNA machinery3,4,5. State-of-the-art microRNA-based shRNA vectors can overcome these limitations by providing a natural substrate of the RNAi pathway that is accurately and efficiently processed6,7,8,9, resulting in minimal or no off-target effects when expressed from a single genomic integration (single-copy)10,11. Still, our limited understanding of RNAi processing requirements and the lack of robust algorithms for the design of microRNA-based shRNAs with high potency and low off-target activity has hampered the utility of RNAi tools.
To understand the sequence requirements of potent RNAi and identify efficient microRNA-based shRNAs for any gene, we previously developed a functional high-throughput “Sensor” assay that enables biological assessment of tens of thousands of shRNAs in parallel (Supplementary Fig. 1a)10. We used this assay to generate focused and genome-wide shRNA libraries11,12. Furthermore, to increase the potency of all shRNAs, especially when expressed at single-copy, we established miR-E7, an optimized microRNA backbone that boosts processing efficiency7,13 and leads to stronger target knockdown when compared to standard miR-30 designs7.
To build an accurate miR-E shRNA predictor, we developed SplashRNA, a sequential learning algorithm combining two support vector machine (SVM) classifiers trained on judiciously integrated data sets (Supplementary Table 1). SplashRNA models the sequential advances in shRNA technology to enable efficient learning on unbiased and biased data (Fig. 1a,b). To train the algorithm, we generated a large-scale miR-30 data set (referred to as M1; Supplementary Fig. 1b–f) and a miR-E data set (referred to as miR-E; Supplementary Fig. 1g) using our RNAi Sensor and reporter assays, respectively (Supplementary Table 2)7,10. We also used the previously published TILE10 and UltramiR12 sets. TILE is unbiased as it was generated by complete tiling of nine genes. By contrast, M1, miR-E and UltramiR are based on preselected input libraries showing biased coverage of the sequence space and divergence in the nucleotide composition of potent shRNAs (Supplementary Fig. 1h). Yet, together these data sets comprehensively sample the distributions of features of non-functional and functional shRNAs. Effective integration of all sets is thus crucial for efficient miR-E shRNA prediction.
Combining diverse data sets presents a machine-learning challenge. Our approach of using a sequential classifier stems from classification strategies used in face detection14,15, where a first classifier evaluates simple face-like features to reject obvious non-faces and a second classifier evaluates refined features on retained potential faces. Similarly, SplashRNA contains a sequence of two SVM classifiers trained on miR-30 and miR-E data. The miR-30 classifier evaluates shRNA sequence features to reject obvious non-functional shRNAs, whereas the miR-E classifier evaluates refined sequence features for retained, potentially potent shRNAs (Fig. 1b and Supplementary Fig. 2a). Each classifier is composed of a combination of k-mer feature representations16,17. To capture AU content and position-specific k-mer features10, we represented an shRNA as a sum of a spectrum kernel on sequence positions 1–15, a spectrum kernel on sequence positions 16–22 and a weighted degree kernel on the entire sequence (Supplementary Fig. 2b). We found that this kernel combination yields the best performance (Supplementary Fig. 2c,d).
Initially, we trained the miR-30 classifier on the combined positives and negatives from the TILE and M1 sets (Supplementary Table 1). This yielded a classifier that scored well in validation tests but was outperformed by one trained on TILE alone (Supplementary Fig. 2e,f). The M1 negatives degraded the performance due to their biased selection and lowered the relative importance of the unbiased TILE negatives. Consequently, our best miR-30 classifier (SplashmiR-30) was obtained by training on a combined data set of TILE and M1 positives (Supplementary Fig. 2f–h). The miR-E classifier (SplashmiR-E) was trained on the miR-E + UltramiR data sets using the same kernel combination. For the final SplashRNA predictor, SplashmiR-30 and SplashmiR-E were combined by tuning the two hyperparameters theta (above which predictions are passed to the second classifier) and alpha (the relative weighting of the scores from the two classifiers; Fig. 1b). We calculated the precision-recall trade-off between the two classifiers and chose a theta and alpha that maintained the high performance of the first classifier while also predicting well on miR-E data. This sequential classification strategy outperformed linear convex classifiers on our data sets (Supplementary Fig. 3a–c).
When tested on miR-30 (Fig. 2a and Supplementary Fig. 4a–c) and miR-E (Fig. 2b and Supplementary Fig. 4d) data sets, SplashRNA clearly outperformed DSIR18, the current reference algorithm in the field (originally developed for siRNA design). SplashRNA also outperformed the miR-30-based shERWOOD algorithm on the UltramiR set (Supplementary Fig. 4e), compared to its published maximum performance12. Additionally, SplashRNA consistently showed the highest predictive performance on independent data sets when benchmarked against DSIR and two other shRNA prediction tools, sequence score19 (seqScore) and miR_Scan20.
We also observed the high performance of SplashRNA in two large-scale biological RNAi screens19,21, run with shRNAs functionally equivalent to miR-E (Supplementary Fig. 4f,g)22, which tested ∼25 preselected shRNAs per gene. In both cases, SplashRNA was able to retrospectively predict which shRNAs were potent and thus were enriched or depleted in the positive or negative selection screen, respectively. SplashRNA achieved the most significant difference in potency between its top five and bottom five predictions per targeted gene and was the only algorithm to reach significance in both screens (P < 0.01, one-sided Wilcoxon rank sum test). Top SplashRNA predictions also showed equally good or better accuracy compared to larger sets of preselected shRNAs when tested on a subset of the negative-selection screen that included only a previously established set of 'gold-standard' essential genes21,23. The top ten SplashRNA predictions identified true positives significantly better than the bottom ten (P < 0.001, empirical permutation test), minimizing off-target hit identification (Fig. 2c).
Robust shRNA prediction starts with the selection of the right transcript region. Analyses of unbiased TILE data showed that efficient shRNAs are more prevalent in 3′ UTRs compared to coding sequences and 5′ UTRs (Supplementary Fig. 5a), likely due to the shared high AU content (Supplementary Fig. 5b–d)10. Whereas 3′ UTRs often present ample design space because of their lengths, when validating top predictions in mouse fibroblasts, many shRNAs targeting the distal end of Pten resulted in minimal or no protein knockdown (Supplementary Fig. 5e and Supplementary Table 2). Inspection of the Pten mRNA (NCBI, NM_008960) revealed that all these shRNAs target regions past alternative cleavage and polyadenylation (ApA) signals, which lead to shorter transcript variants24 lacking the respective target sites (Supplementary Fig. 5f). Hence, to eliminate ApA as a source of non-functional shRNAs, we used ApA atlases25,26 to annotate the human and mouse reference transcriptomes (NCBI) and discard 3′ UTR portions that may be absent due to ApA. Similarly, we report predictions only on the intersection of all transcript variants for each gene and filter multi-matching sequences.
Testing an extensive set of individual de novo predictions targeting Pten, Bap1, Pbrm1, Rela, Bcl2l11, Axin1, NF2 and Cd9 (Supplementary Table 2) under single-copy conditions7 by conventional western blot analysis (Fig. 2d,e and Supplementary Fig. 6a–f) or flow-cytometry-based immunofluorescence of surface proteins (Supplementary Fig. 6g), we found that protein knockdown levels were very high: 91% of predictions (41/45) with a SplashRNA score of >1 showed >85% protein knockdown (Supplementary Fig. 6h). Even in the case of human NF2, a gene with nine annotated transcript variants that share only 198 nucleotides (excluding the 5′ UTR, Supplementary Fig. 6e), the top eight SplashRNA predictions triggered 77–96% (median 89%) protein suppression under single-copy conditions (Supplementary Fig. 6f). Additionally, Cd9 knockdown analyses in mouse fibroblasts showed that SplashRNA clearly outperforms DSIR in de novo prediction and achieves near knockout levels comparable to CRISPR–Cas9 (Supplementary Fig. 6g). Potent microRNA-based shRNAs have an equally low chance of off-target effects as non-functional sequences when expressed at single-copy11.
Extrapolating beyond the tested shRNAs, we calculated the proportion of genes for which SplashRNA would find at least five shRNAs above a given threshold (Fig. 2f). After shortening of transcripts due to ApA and considering only the intersection of all transcript variants per gene, we found that 87% of mouse genes and 81% of human genes have at least five shRNAs with SplashRNA scores above 1, corresponding to an 80% probability (e.g., four out of five shRNAs) of more than 85% knockdown at single-copy (Supplementary Fig. 6h).
Building on our Sensor assay and the optimized miR-E backbone, here we have established a robust algorithm to predict ultra-potent microRNA-based shRNAs targeting nearly any gene. SplashRNA is able to accurately predict the potency of independently validated and novel shRNAs and outperforms existing algorithms. Our sequential predictor approach facilitates the integration of biased and unbiased data sets and can serve as a blueprint for other prediction problems. An open source implementation of SplashRNA is accessible at http://splashrna.mskcc.org. Mouse and human genome-wide predictions are also provided separately (Supplementary Table 3).
Methods
MicroRNA-based shRNAs and minimization of off-target effects.
Though RNAi triggers can be expressed as simple stem-loop shRNAs from RNA polymerase III (Pol-III) promoters in mammalian cells, such strategies can lead to off-target effects associated with high shRNA expression levels3, likely due to saturation of the endogenous microRNA machinery27. Many Pol-III-based systems also suffer from inaccurate processing of precursor molecules2, yielding undesired mature small RNAs. In contrast, use of microRNA-embedded shRNAs expressed from RNA polymerase II (Pol-II) promoters results in accurate processing8,9 and can alleviate the toxic side effects4,5,28, especially when used at single genomic integration (single-copy)11. Notably, highly potent miR-30-based shRNAs expressed at single-copy show the same low levels or absence of off-target effects as analogous weak and non-functional sequences11. Hence, to develop an improved shRNA prediction algorithm, we focused on the optimized miR-E system that is based on the endogenous human MIR30A7.
Here, to determine the extent of sequence-based off-target effects we applied the GESS algorithm29 to shRNAs validated by immunoblotting, and to previously reported Sensor assay and gene expression microarray results10,11. GESS analyzes 'genome-wide enrichment of seed sequence' matches. We tested whether potent shRNAs do not have more off-target effects than their weaker counterparts and if these targets have common sequences.
First, to investigate sequence-based off-target effects, we analyzed RNA expression microarray data from Trp53−/− MEF cells infected at single or high copy with one of six Trp53 shRNAs11. Repetition of the published differential expression analysis found zero differentially expressed genes in the single-copy transfection setting relative to control experiments for either potent or weak shRNAs. In the high-copy transfection setting, 702 genes were upregulated and 326 genes were downregulated in the cells with potent shRNA with respect to control experiments (FDR < 0.05). Additionally, 2,437 genes were upregulated and 1,731 genes were downregulated in cells transfected with weak shRNA relative to their controls. Therefore, potent shRNAs in this setting did not induce more gene expression changes than weak shRNAs. Furthermore, both the potent and weak high-copy transfections resulted in near identical lists of differentially expressed genes: 702 genes were significantly upregulated in both lists and 324 genes were significantly downregulated in both lists. These intersections significantly overlapped (upregulated: P < 2.2 × 10−16, downregulated: P < 2.2 × 10−16, Fisher's exact test), indicating that the main changes in gene expression are similar regardless of potency or shRNA sequence composition.
Second, we applied the GESS algorithm29 to our validation shRNAs that were quantified by immunoblotting to determine potential sequence-based off-target effects in our current experiments. We attributed our shRNAs to three categories based on western blot knockdown: Low (less than 80% knockdown), Mid (between 80% and 95% knockdown), High (95% knockdown or greater). For each gene and potency-level group, we ran GESS and found the genes that were potentially targeted by all the shRNAs in the group. We found no statistically significant off-targeted genes by GESS (FDR < 0.1). We also tested if the level of potency is associated with the number of potential off-target genes as measured by the number of perfect 7-mer seed matches (nucleotides 2–8). Grouping shRNAs into three groups by percent knockdown, High: >95%, Medium: 90–95%, and Low: 80–90%, and testing for a significant difference in the number of gene seed matches found no statistically significant difference between any pair of groups (P = 0.74, 0.53, and 0.73 for Low vs. Medium, Low vs. High, and Medium vs. High, respectively).
Third, we calculated all perfect 22-mer multi-mapping matches transcriptome-wide, since perfect matching of an shRNA to several genes would be highly undesirable. Consequently, we incorporated an additional feature into the SplashRNA algorithm and web site that alerts the user if a predicted hairpin perfectly matches multiple genes in the human or mouse transcriptomes (hg38, mm10).
Sequence requirements of potent RNAi and prediction rules.
The initial rules of RNAi potency contained many non-sequence elements30,31,32, but later rules inferred from larger screens found that sequence-based features are more predictive18,33 and capture the other characteristics34. BIOPREDsi, a neural network approach, was trained on over 2,000 functionally tested siRNAs and set a new performance standard33. Using the same data set, DSIR improved prediction through the use of an L1 regularized linear model with a combination of position-specific nucleotide features and mono-, di-, and tri-nucleotide counts18,35. However, the rules governing siRNA potency differ from the ones dictating shRNA potency due to the additional biogenesis steps10,36, and siRNA-based algorithms perform relatively poorly in shRNA prediction tasks. Hence, we and others have previously used our large-scale data sets to generate miR-30-specific prediction algorithms12,20. Yet, with a shift toward the more efficiently processed miR-E backbone, these algorithms are no longer designed for the task at hand as key sequence requirements have changed (Fig. 1a).
TILE, mRas + hRAS, and shERWOOD data sets.
Over the years, a series of diverse shRNA potency data sets have been created, each having different characteristics and leveraging knowledge gained from previous studies. In the initial RNAi Sensor assay (referred to as “TILE”)10, we screened nearly 20,000 miR-30 based shRNAs that tiled nine mammalian genes in an unbiased manner to test all possible 22-mer sequences within these genes. This sampling strategy produces a low fraction of potent shRNAs. To reduce costs and increase the ratio of potent shRNAs, subsequent screens only assessed shRNAs that were predicted to be efficient by various in silico methods; these include the “mRas + hRAS”11 and “shERWOOD”12 data sets. These data sets contain a higher percentage of potent shRNAs (as assessed by immunoblotting and functional RNAi screens, data not shown; Supplementary Table 1), but also represent a biased sampling of the sequence space. Additionally, the recent shift toward the use of “miR-E type” backbones7,12,19 that contain a 5′-DCNNC-3′ motif in their 3′-flank for improved pri-miRNA processing7,13 has further increased the fraction of efficient shRNAs and altered the overall sequence requirements for potent RNAi by relaxing constraints of Drosha processing (Supplementary Fig. 1h).
Sensor assay and M1 data set generation.
A drawback of the unbiased TILE data set is that it contains few positives (potent shRNAs), with the benefit that it includes a large and comprehensive representation of negatives. Using the Sensor assay10, we thus set out to establish a second large-scale miR-30 based data set containing a more comprehensive representation of positives (here referred to as M1; Supplementary Fig. 1a–f,h and Supplementary Table 2).
The Sensor assay evaluates pools of shRNAs under conditions of single-copy genomic integration (“single-copy”) for their ability to repress a cognate target sequence placed downstream of a fluorescent reporter expressed in cis. This surrogate system showed an 85–90% specificity in identifying potent shRNAs when compared to knockdown of the corresponding endogenous genes by immunoblotting10. Here, the Sensor assay was carried out as previously described10,11, with several improvements to enhance deep-sequencing library preparation and readout accuracy. To assemble the candidate list, 60 shRNAs per gene were selected using a combination of algorithmic predictions and “Sensor rules” requiring shRNA-specific features. Specifically, to generate the M1 shRNA Sensor library, a custom oligonucleotide array (Agilent Technologies) was designed containing 20,400 185-mer sequences (Supplementary Table 2). This included 19 standard Sensor control shRNAs spotted 5×, 325 performance control shRNAs that had been tested in previous Sensor assays spotted 1× (65 shRNAs per gene targeting mouse Bcl2, Kras, Mcl1, Myc and Trp53), and 19,980 shRNAs targeting 332 mouse genes and 1 rat gene (60 shRNAs per gene). For each of the 333 new genes, 300 primary predictions were generated by calculating the intersection of all transcript variants per gene (NCBI) and using DSIR18 supplemented with Sensor rules7,10 to further impose shRNA-specific sequence requirements. All shRNAs containing restriction sites used for cloning (XhoI, EcoRI, MluI, MfeI, BamHI) within the 60-nt target region encompassing the 22-nt guide sequence, as well as shRNAs closer than 15 nt to an artificial transcript junction (site where the common regions of transcript variants are joined), were eliminated. From the remaining set, the top 60 predictions per gene were selected, resulting in 20,324 unique sequences including the controls.
The vector libraries were constructed using the previously described two-step cloning procedure10,11. In step 1, oligonucleotides were amplified using the Sens3′Mfe (5′-TACAATACTCGAGAAGGTATATTGCTGTTGACAGTGAGCG-3′, IDT) and Sens5′Xho (5′-ATTCATCACAATTGTCCGCGTCGATCCTAGG-3′, IDT) primers, XhoI/MfeI (NEB) digested, and ligated into an XhoI/EcoRI (NEB) digested pTNL backbone vector. Ligation products were MfeI-HF (NEB) digested to reduce background noise. In step 2, the missing 3′ miR30-PGK-Venus fragment was cloned into the EcoRI/MluI sites, followed by BamHI-HF (NEB) digestion of the resulting ligation product to further reduce background noise. During each cloning step, a representation of at least 1,000-fold the complexity of the library was maintained. All cell culture and flow cytometry procedures of the Sensor assay, to gradually enrich for the most potent shRNAs, were conducted as previously described10,11.
High-throughput sequencing based quantification of library composition and analysis of changes in shRNA representation over sort cycles were carried out as previously described10,11, with several adaptations to enhance readout precision. In contrast to previous procedures, deep-sequencing template libraries were generated by PCR amplification of shRNA guide strands including adjacent 3′ flanking regions, from vector libraries or genomic DNA, leading to longer PCR products (361 nt). The forward primer binding to the shRNA loop, HiSeq_Loop (p7+loop, 5′-CAAGCAGAAGACGGCATACGAGATTAGTGAAGCCACAGATGT-3′, IDT), was shortened by one nucleotide in order for each PCR to start with the same base. To enable sequencing of pooled libraries, an index primer binding site and 6 nt indices were included in the reverse primers (HiSeq_Index-p5-N5, 5′-AATGATACGGCGACCACCGAGATCTGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNACTTGTGTAGCGCCAAGTGCCCAGC-3′, N = index, IDT). The indices used were (index, library): 5′-CGATGT-3′ for Vector 1, 5′-TTAGGC-3′ for Vector 2, 5′-TGACCA-3′ for Sort3-R1, 5′-ACAGTG-3′ for Sort3-R2, 5′-GCCAAT-5′ for Sort5-R1, 5′-CAGATC-3′ for Sort5-R2. All libraries were sequenced using the miR30EcoRIseq primer (5′-TAGCCCCTTGAATTCCGAGGCAGTAGGCA-3′, IDT) that reads reverse into the guide strand. Per library, 4 to 38 million initial sequencing reads were obtained (Illumina). For each shRNA and condition, the number of completely matching sequences was determined, normalized to the total reads per sample, and imported into a database for further analysis (Access 2007, Microsoft).
Deep sequencing after two-step cloning of the shRNA-Sensor libraries showed that >99.7% of all designed vectors were successfully constructed and detected in both replicates (Supplementary Table 2). Five iterative rounds of fluorescence-activated cell sorting, with gates set to progressively select for only the most functional shRNAs, enriched for potent shRNAs (Supplementary Fig. 1b,c), as previously shown10,11. While independent biological replicates correlated throughout the sorting procedure, correlation to the initial representation was progressively lost, showing that the assay specifically enriched potent shRNAs. The final Sensor score was uncorrelated to the initial representation (Supplementary Fig. 1d), and known controls behaved as expected and in high correlation with previous Sensor runs, even for non-functional shRNAs (Supplementary Fig. 1e,f). A Sensor score was computed as readout for shRNA potency (Supplementary Table 2). The Sensor score represents an integration of shRNA enrichment over all replicates. Sensor score = log2((eScoS3*eScoS52)+1); eScoS3 = geometric-mean(S3)/mean(V), eScoS5 = geometric-mean(S5)/mean(V). To avoid potential division by 0, the counts used for the calculations were reads (parts per million, p.p.m.) + 1. Potent shRNAs were identified for all genes, with a modest change in top score distribution across all assayed transcripts.
Reporter assay, miR-E data set generation and UltramiR data set.
We established a miR-E specific training data set (referred to as “miR-E”) by using a modified version of our Sensor assay specialized for high-accuracy one-by-one evaluation of shRNA potency7. This two-component RNAi reporter assay shows higher resolution in separating good shRNAs from the top candidates when compared to the pooled version. Using our neutral chicken reporter cell line10, we evaluated nearly 400 miR-E shRNAs targeting human and mouse genes in 42 individual batches (Supplementary Fig. 1g and Supplementary Table 2).
Candidate miR-E shRNAs were designed to target all transcript variants per gene (NCBI), and selected using DSIR18 supplemented with Sensor rules7,10. All candidate shRNAs were cloned into the LEPG vector for potency assessment, while double-stranded DNA gBlocks (IDT) were used to generate the target inserts of the respective TtNPT reporter vectors7. To produce stable reporter cell lines, ERC chicken reporter cells10 were infected with TtNPT viruses at high-copy, and selected in presence of doxycycline (0.5–1.0 μg/ml) and G418 (1,500–2,000 μg/ml). Experimental and control shRNAs were then transduced one-by-one, at single-copy (<20% infected cells), into the respective reporter cell lines. Quantification of reporter construct knockdown was assessed by flow cytometry 3–6 d after infection (LSR II, BD Biosciences), acquiring at least 1,000–5000 live GFP+ cells for each sample (n > 1,000).
Since reporter transcript characteristics can affect relative knockdown performance in this assay, established controls (miR-E Ren.713, miR-30 Pten.1524, miR-E Pten.1523, miR-E Pten.1524) were used to monitor the performance of the assay, and scale the data for comparison across different batches and for training of the algorithm. All constructs were tested in 42 individual batches. After normalization and scaling, reference shRNAs and cell line controls showed tight potency distributions (Supplementary Fig. 1g), indicating robust assay performance. For training of the miR-E predictor, all gene-specific shRNAs were divided into a positive and negative class based on a threshold value of 80% reporter knockdown relative to controls, giving rise to two similarly sized populations.
To increase the size of the miR-E data set, we also used shRNA performance data from a pooled cell viability (negative selection) screen that was previously run using UltramiR shRNAs (referred to as “UltramiR”)12, which contain the same basic backbone structure as miR-E shRNAs. This screen quantified the depletion of cells expressing shRNAs targeting 78 essential genes, alongside negative controls. When taken together, the miR-E and UltramiR data established a robust set of examples representing miR-E specific processing requirements (Supplementary Table 1).
Assessing the potency of an shRNA for the TILE and M1 data sets.
A Sensor score was computed as readout for shRNA potency (Supplementary Table 2). The Sensor score represents an integration of shRNA enrichment over all replicates. The Sensor score for each shRNA sequence (x) was quantified as the log fold-change of the number of read counts (rho) between third sort (S3) and its respective vector library (v), averaged over replicates (r). Thus the potency score takes the form:
To avoid potential division by 0, the counts used for the calculations were reads (parts per million, p.p.m.) + 1. To distinguish positives from negatives and integrate the data sets, we defined score cutoffs based on the score distributions for each data set. The distribution of scores for the TILE data set gives a clear separation of positive and negative shRNAs (Supplementary Fig. 2c and Supplementary Table 1). Thus we selected a threshold at the minimum score density between the two modes. The M1 set was generated by selecting shRNAs that were likely to be potent, and therefore the score distributions of the negatives and positives are less distinct. To determine the label for different shRNAs in the M1 set, we fit each mode of the distribution with a Gaussian function. Using these two Gaussians we calculated two thresholds, one at a false-positive rate of 5% and one at a false-negative rate of 5% (Supplementary Fig. 2e, Supplementary Table 1) in order to define the positive and negative examples.
Assessing the potency of an shRNA for the shERWOOD data set.
This data set was previously published12.
Assessing the potency of an shRNA for the miR-E data set.
The score for each shRNA in the miR-E set was calculated as the relative reporter knockdown level measured by flow cytometry, normalized to the knockdown level measured for miR-E Ren.713 and miR-30 Pten.1524 in the same batch. The data were scaled independently for each batch to set miR-E Ren.713 at 100% and miR-30 Pten.1524 at 60% relative knockdown. All shRNAs above 80% were classified as positive, while all shRNAs below 80% were classified as negative (Supplementary Figs. 1g, 4d and Supplementary Table 1).
Assessing the potency of an shRNA for the UltramiR data set.
The scores from the UltramiR cell viability screen were previously published (NCBI Gene Expression Omnibus, Series GSE62185)12. We limited our analysis to the shRNAs targeting 78 essential genes, as defined in the shERWOOD paper (Supplementary Table 2). UltramiR shRNAs were considered to be potent if they had a depletion score of less than −0.5 (Supplementary Fig. 4d).
Assessing the potency of an shRNA for the Essential genes data set.
This data set was previously published21. Phenotypes for each shRNA were calculated as the mean log2 fold-change for the two replicates. Gene-level scores were calculated as the mean phenotype for the five shRNAs with the most negative phenotypes for each gene.
Assessing the potency of an shRNA for the Sensitivity genes data set.
This data set was previously published19. Only shRNAs appearing in both replicates were used for the analyses. Hit genes were defined as those with a reported P-value less than 0.05. The top sensitivity genes were those with the most positive mean phenotypes of their top five targeting shRNAs. Phenotype is defined as log2 (toxin-treated/untreated).
Identifying gold-standard essential genes.
The set of gold-standard essential genes and gold-standard non-essential genes was previously published23. We reevaluated data from a published RNAi screen that used approximately 25 shRNAs per gene, or 4 sgRNAs per gene21, to assess the efficiency of SplashRNA predictions to identify hit genes. We ranked shRNAs according to their SplashRNA score and compared the mean cell depletion values for the top scoring shRNAs against the reported gene-level cell depletion values using the reported gold-standard genes. We found that a library made from the top ten SplashRNA predictions per gene performed at least as well as the full library when identifying the gold standard genes (Supplementary Fig. 2c). Additionally, a library created by selecting the ten lowest scoring SplashRNA predictions for each gene performed statistically worse than a library created by selecting the ten top scoring shRNAs per gene (P < 0.001, empirical permutation test). This shows that SplashRNA allows selecting superior shRNAs, which in turn decreases off-target effects by reducing the false-discovery rate. The need for fewer shRNAs per gene also enables minimizing the complexity of RNAi libraries for multiplexed screens.
Classifier kernel.
All SVMs were trained with the Shogun package37 using a weighted-degree kernel of order 22 and two spectrum kernels (k-mer length = 3). Each of our classifiers was constructed by the following kernel combination: ClassifierKernel = SpectrumKernel(pos1-15) + SpectrumKernel(pos16-22) + WeightedDegreeKernel(pos1-22) (Supplementary Fig. 2b,d).
Training the miR-30 classifier.
When fitting the regularization parameter C for our miR-30 SVM, we used leave-one-gene-out nested cross-validation. We grouped shRNAs from the TILE miR-30 data set by target gene into outer-folds. For each outer fold, we held out shRNAs targeting one gene and optimized the parameter C on the shRNAs targeting the remaining genes through tenfold cross-validation. The M1 positive set was added to all training sets but was not used for selection of C or for validation. Performance on the TILE set is reported on the outer held-out genes (Supplementary Fig. 2f). We trained our final classifier with the parameter setting C = 15 using all the TILE data and the M1 positive shRNAs. This classifier was used to predict on all other data sets.
Training the miR-E classifier.
We used nested tenfold cross-validation to fit the C parameter for our miR-E SVM. We did not use leave-one-gene-out due to the lower number of shRNAs targeting each gene. The miR-E and UltramiR shRNAs were combined and split into ten outer folds. Within each of these folds, tenfold cross validation was performed to determine the optimal C parameter for that fold. Performance on the miR-E and UltramiR sets is reported on the outer held-out folds (Supplementary Fig. 3c). We trained our final classifier with the parameter setting C = 15 using all the miR-E and UltramiR data. This classifier was used to predict on all other data sets.
Calculating sequential predictor (SplashRNA) scores.
The potency scores for all shRNA are first calculated using the miR-30 classifier. If the score does not exceed the threshold theta, this partial score is the final score for the shRNA. If the score does exceed the threshold, the final score is a weighted combination of the predicted scores from the miR-30 and miR-E classifiers.
Here x is the sequence of the shRNA to be evaluated, alpha (α) is the mixing proportion between the two classifiers and theta (θ) is the threshold.
Optimizing the sequential predictor.
We set alpha to 0.6 and theta to 1.1 to retain good performance on miR-30 classification after analysis of the precision-recall trade-off between the miR-30 and miR-E classifiers. This performance accuracy is unattainable by a simple linear classifier αSVMmiR30 + (1 − α)SVMmiRE (Supplementary Fig. 3a–c).
Calculation of DSIR scores.
DSIR scores were calculated according to the published 21-nt linear model18,35.
Calculation of sequence score (seqScore) scores.
Scores were calculated as described in the paper19.
Calculation of miR_Scan scores.
Scores were calculated using software provided by the authors20.
Calculation of intersections of all transcript variants per gene.
Genomic regions and annotations for hg38 and mm10 were downloaded using the makeTranscriptDbFromUCSC function from the GenomicFeatures Bioconductor package38,39. Transcript variants were grouped by gene using their Entrez ID and regions shared across all RefSeq transcript variants were calculated in R using the BiocGenerics intersect function. Sequences for these intersections were then extracted using the BSgenome.Hsapiens.UCSC.hg38 and BSgenome.Mmusculus.UCSC.mm10 packages.
Primary data for hg38 was obtained from: Team TBD. BSgenome.Hsapiens.UCSC.hg38: Full genome sequences for Homo sapiens (UCSC version hg38). R package version 1.4.1.
Primary data for mm10 was obtained from: Team TBD. BSgenome.Mmusculus.UCSC.mm10: Full genome sequences for Mus musculus (UCSC version mm10). R package version 1.4.0.
Cell culture.
Phoenix HEK293T viral packaging cells were grown in DMEM supplemented with 10% FBS (FBS), 100 U/ml penicillin and 100 μg/ml streptomycin (100-Pen-Strep). ERC chicken reporter cells were grown in DMEM supplemented with 10% FBS, 1 mM sodium pyruvate and 100-Pen-Strep, and frozen in 5% DMSO, 70% FBS and 25% culture medium. NIH/3T3 (ATCC) cells were maintained in DMEM with 10% bovine calf serum or 10% FBS (FBS) containing 100-Pen-Strep and were tested for absence of mycoplasma contamination. A375 (kind gift from Neal Rosen, MSKCC) were maintained in DMEM with 10% FBS and 100-Pen-Strep. All cell cultures were maintained in a 37 °C incubator at 5% CO2.
Retroviral transduction.
Cells were transduced as previously described10. Transduction efficiency was assessed 48 h after infection by quantification of fluorescent reporters using flow cytometry (Guava EasyCyte, Millipore). Where a specific infection rate was desired, test infections were carried out at different dilution rates and ideal infection ratios deduced. All shRNAs were assessed at single copy genomic integration (“single-copy”) by infecting target cell population at <20% of their maximal infection rate, guaranteeing <2% cells with multiple integrations10. Transduced cell populations were usually selected 48 h after infection, using 1.0-2.0 μg/ml puromycin (Sigma-Aldrich) or 500–2,000 μg/ml G418 (Geneticin, Gibco-Invitrogen).
Immunoblotting.
Cells were transduced at single-copy with the constitutive retroviral vector LEPG7 expressing the indicated miR-E shRNA constructs. NIH/3T3 or A375 cell pellets were lysed in Laemmli buffer (100 mM Tris-HCl pH 6.8, 5% glycerol, 2% SDS, 5% 2-mercaptoethanol). Equal amounts of protein were separated on SDS-polyacrylamide gels and transferred to PVDF membranes. The abundance of β-actin (ACTB, Actb) was monitored to ensure equal loading. Images were analyzed using the AlphaView software (ProteinSimple) and quantified by ImageJ. Immunoblotting was performed using antibodies for Pten (1:1,000, Cell Signaling Technology, #9188, https://media.cellsignal.com/pdf/9188.pdf), Bap1 (1:500, Bethyl Laboratories, #A302-243A, http://www.bethyl.com/product/pdf/A302-243A.pdf), Pbrm1 (1:500, Bethyl Laboratories, #A301-591A, https://www.bethyl.com/product/pdf/A301-591A.pdf), NF2 (1:1,000, Abcam, #ab109244, http://www.abcam.com/NF2-Merlin-antibody-EPR25732-ab109244.pdf), Axin1 (1:1,000, Cell Signaling, Technology, #2087, https://media.cellsignal.com/pdf/2087.pdf), Bcl2l11 (a.k.a. Bim, 1:1,000, Cell Signaling Technology, #2933, https://media.cellsignal.com/pdf/2933.pdf), Rela (a.k.a. NFκB p65, 1:1,000, Santa Cruz, sc-372, https://datasheets.scbt.com/sc-372.pdf), β-actin (1:10,000, Sigma-Aldrich, clone AC-15, http://www.sigmaaldrich.com/content/dam/sigma-aldrich/docs/Sigma/Datasheet/6/a5441dat.pdf).
Evaluation of shRNA and CRISPR-Cas9 based suppression of Cd9 in immortalized MEFs.
miR-E shRNAs targeting murine Cd9 were designed using SplashRNA or our previous design strategy involving DSIR18 predictions filtered by “Sensor rules”10,40. The six top predictions from each algorithm were cloned into RT3CEN (TRE3G-mCherry-miRE-PGK-Neo; generated based on RT3GEN7). sgRNAs were cloned into a retroviral vector (RU6sgC; pSIN.U6.sgRNA-EF1as-mCherry), which we constructed based on the pQCXIX backbone (Clontech). Parallel Tet-inducible shRNA and CRISPR-Cas9 based loss-of-function studies were performed in immortalized double-transgenic MEFs (CRT-MEFs) constitutively expressing Cas9 and rtTA-M2 from transgenic knock-in alleles at the Rosa26 loci. These MEFs were isolated from Rosa26.CAGGS-Cas9.P2A.GFP41; Rosa26.rtTA-M2 (ref. 42) double-transgenic embryos (using standard protocols) and immortalized through retroviral transduction of a potent shRNA targeting Trp53 (MSCV-shTrp53.814), followed by serial passaging. Retroviral shRNA/sgRNA expression vectors were packaged using standard calcium-phosphate based transfection into Platinum-E cells (Cellbiolabs) and transduced into CRT-MEFs and RRT-MEFs10 under strict single-copy conditions, as previously described10. Two days post-infection, shRNA expression was induced through addition of doxycycline (1 μg/ml); 6 d later cells were stained for surface Cd9 expression (Anti-mouse Cd9-APC, eBioscience, #17-0091-82). Cd9 expression was analyzed in mCherry+/shRNA-expressing cells and quantified by flow cytometry (LSR-II Fortessa, BD Biosciences). The sgRNA transduced cells were analyzed in the same way, quantifying Cd9 expression in mCherry+/sgRNA-expressing cells 8 d post-infection.
Statistical analysis.
Specific statistical tests used are indicated in all cases.
Code availability.
Source code that implements the main SplashRNA algorithm is provided (Supplementary Code).
Data availability.
Screen data from the M1 Sensor assay and the miR-E reporter assay are provided (Supplementary Table 2). UltramiR data is also provided (Supplementary Table 2). Data from the other screens used for SplashRNA training and validation (Supplementary Table 1) have been previously published as reported.
Additional information
Publisher's note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Accession codes
References
Fellmann, C. & Lowe, S.W. Nat. Cell Biol. 16, 10–18 (2014).
Guda, S. et al. Mol. Ther. 23, 1465–1474 (2015).
Grimm, D. et al. Nature 441, 537–541 (2006).
McBride, J.L. et al. Proc. Natl. Acad. Sci. USA 105, 5868–5873 (2008).
Baek, S.T. et al. Neuron 82, 1255–1262 (2014).
Zuber, J. et al. Nat. Biotechnol. 29, 79–83 (2011).
Fellmann, C. et al. Cell Rep. 5, 1704–1713 (2013).
Gu, S. et al. Cell 151, 900–911 (2012).
Watanabe, C., Cuellar, T.L. & Haley, B. RNA Biol. 13, 25–33 (2016).
Fellmann, C. et al. Mol. Cell 41, 733–746 (2011).
Yuan, T.L. et al. Cancer Discov. 4, 1182–1197 (2014).
Knott, S.R.V. et al. Mol. Cell 56, 796–807 (2014).
Auyeung, V.C.C., Ulitsky, I., McGeary, S.E.E. & Bartel, D.P.P. Cell 152, 844–858 (2013).
Viola, P. & Jones, M. Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. 1, 511–518 (2001).
Pelossof, R. Learning with Stochastic Focus of Attention PhD thesis, (Columbia Univ. 2011).
Leslie, C., Eskin, E. & Noble, W.S. Pac. Symp. Biocomput. 575, 564–575 (2002).
Sonnenburg, S., Rätsch, G. & Rieck, K. Large scale learning with string kernels. Large-scale Kernel Machines. (eds. Bottou, L., Chapelle, O., DeCoste, D. & Weston, J.) 73–104 (MIT Press, Cambridge, MA 2007).
Vert, J.P., Foveau, N., Lajaunie, C. & Vandenbrouck, Y. BMC Bioinformatics 7, 520 (2006).
Kampmann, M. et al. Proc. Natl. Acad. Sci. USA 112, E3384–E3391 (2015).
Matveeva, O.V., Nazipova, N.N., Ogurtsov, A.Y. & Shabalina, S.A. Front. Genet. 3, 163 (2012).
Morgens, D.W., Deans, R.M., Li, A. & Bassik, M.C. Nat. Biotechnol. 34, 634–636 (2016).
Kampmann, M., Bassik, M.C. & Weissman, J.S. Proc. Natl. Acad. Sci. USA 110, E2317–E2326 (2013).
Hart, T., Brown, K.R., Sircoulomb, F., Rottapel, R. & Moffat, J. Mol. Syst. Biol. 10, 733 (2014).
Spies, N., Burge, C.B. & Bartel, D.P. Genome Res. 23, 2078–2090 (2013).
Derti, A. et al. Genome Res. 22, 1173–1183 (2012).
Lianoglou, S., Garg, V., Yang, J.L., Leslie, C.S. & Mayr, C. Genes Dev. 27, 2380–2396 (2013).
Yi, R., Doehle, B.P., Qin, Y., Macara, I.G. & Cullen, B.R. RNA 11, 220–226 (2005).
Boudreau, R.L., Martins, I. & Davidson, B.L. Mol. Ther. 17, 169–175 (2009).
Sigoillot, F.D. et al. Nat. Methods 9, 363–366 (2012).
Khvorova, A., Reynolds, A. & Jayasena, S.D. Cell 115, 209–216 (2003).
Reynolds, A. et al. Nat. Biotechnol. 22, 326–330 (2004).
Schwarz, D.S. et al. Cell 115, 199–208 (2003).
Huesken, D. et al. Nat. Biotechnol. 23, 995–1001 (2005).
Saetrom, P. & Snøve, O. Biochem. Biophys. Res. Commun. 321, 247–253 (2004).
Filhol, O. et al. PLoS One 7, e48057 (2012).
Taxman, D.J. et al. BMC Biotechnol. 6, 7 (2006).
Sonnenburg, S. et al. J. Mach. Learn. Res. 11, 1799–1802 (2010).
Huber, W. et al. Nat. Methods 12, 115–121 (2015).
Lawrence, M. et al. PLoS Comput. Biol. http://dx.doi.org/10.1371/journal.pcbi.1003118 (2013).
Dow, L.E. et al. Nat. Protoc. 7, 374–393 (2012).
Platt, R.J. et al. Cell 159, 440–455 (2014).
Hochedlinger, K., Yamada, Y., Beard, C. & Jaenisch, R. Cell 121, 465–477 (2005).
Acknowledgements
We thank J.A. Doudna, G.J. Hannon, L.E. Dow and S.N. Floor for continuous support and valuable discussions. We gratefully acknowledge assistance and support from A. Banito, V. Sridhar, L. Faletti, C.C. Chen and S. Tian. C.F. was supported in part by a K99/R00 Pathway to Independence Award (K99GM118909) from the National Institutes of Health (NIH), National Institute of General Medical Sciences (NIGMS). C.F. is a founder of Mirimus Inc., a company that develops RNAi-based reagents and transgenic mice. This work was also supported in part by grant CA013106 (S.W.L.). S.W.L. is a founder and member of the scientific advisory board of Mirimus Inc., the Geoffrey Beene Chair of Cancer Biology at MSKCC and an investigator of the Howard Hughes Medical Institute. J.Z. is a member of the scientific advisory board, and P.K.P. is a founder and employee of Mirimus Inc. C.S.L. was supported in part by NHGRI U01 grants HG007033 and HG007893 and NCI U01 grant CA164190. A375 cells were a kind gift from Neal Rosen, MSKCC.
Author information
Authors and Affiliations
Contributions
R.P., L.F., C.S.L. and C.F. conceived and designed the study, and developed the data integration framework. R.P., L.F., and C.W. built the algorithm, and carried out the model training and computational validation. C.-H.H., N.S., D.-Y.L., Y.G., P.K.P., D.F.T., T.H., J.Z., S.W.L. and C.F. generated the biological data sets and validated knockdown potency. R.P., L.F., C.W. and V.T.S. built the web page. V.T. and G.R. assisted with study design and advised on algorithmic development. Q.X. and R.J.G. helped with validation of predictions. R.P., L.F., C.-H.H., T.H., J.Z., S.W.L., C.S.L. and C.F. analyzed data and wrote the manuscript.
Corresponding authors
Ethics declarations
Competing interests
C.F. is a founder of Mirimus Inc., a company that develops RNAi-based reagents and transgenic mice. S.W.L. is a founder and member of the scientific advisory board of Mirimus Inc. J.Z. is a member of the scientific advisory board of Mirimus Inc. P.K.P. is a founder and employee of Mirimus Inc. R.P. and L.F. have filed intellectual property on SplashRNA.
Integrated supplementary information
Supplementary Figure 1 Data set generation.
(a-f) Generation of the M1 (miR-30, 20,400 shRNAs) Sensor assay data set (Supplementary Table 2, Online Methods).
(a) Schematic of our previously published Sensor assay that enables large-scale functional assessment of shRNA potency (Online Methods).
(b) Library complexity over Sensor assay sort cycles. Shown are normalized read numbers (parts per million, ppm) in both duplicates for each shRNA represented within the initial libraries (Vector) and the pools after the indicated sorts (Sort 3, 5).
(c) Correlation of reads per shRNA between the two replicates before sorting (left panel), after Sort 5 (middle panel) and between the initial and endpoint population (right panel; shown for one representative replicate). r, Pearson correlation coefficient.
(d) Correlation of Sensor score and reads per shRNA in the vector libraries, showing that the score is independent of the initial shRNA representation. r, Pearson correlation coefficient.
(e) Enrichment or depletion of 17 control shRNAs after Sort 5. All controls have been used in previous Sensor assays (e.g. TILE, mRas + hRAS) and are classified into a strong, intermediate and weak class according to their knockdown potency assessed by immunoblotting.
(f) Rank correlation of 325 performance control shRNAs. 65 shRNAs per gene targeting mouse Bcl2, Kras, Mcl1, Myc and Trp53 that had previously been tested as part of the TILE data set were chosen as supplemental controls to assess Sensor assay performance for weak, intermediate and strong shRNAs. The individual shRNA ranks between TILE and M1 were highly correlated (325 shRNAs, Spearman rank correlation coefficient rho: 0.63; gene-specific correlation coefficients are also reported), even though the TILE and M1 data sets were generated several years apart, using mostly different equipment, reagents and operators.
(g) Generation of the miR-E reporter assay data set (Supplementary Table 2, Online Methods). Normalized reporter knockdown values of miR-E shRNAs assessed one-by-one in an RNAi reporter assay. The shRNAs were tested in 42 individual batches, each including several control shRNAs for data scaling (miR-E Ren.713, miR-30 Pten.1524) and quality control (miR-E Pten.1523, miR-E Pten.1524). Background fluorescence of the parental chicken cell line (ERC) and maximal fluorescence of the batch-specific reporter cell line (ERC cells expressing the shRNA target reporter) were also measured. All shRNAs were grouped into either a positive or negative class. A threshold value of 80 was chosen as a cutoff, based on the performance of miR-30 Pten.1524 and miR-E Ren.713.
(h) Nucleotide representation of positive shRNAs from the indicated data sets. Shown are the nucleotides one to eight of the guide strand (starting in the center), including the entire seed region. Unbiased TILE (miR-30) set, showing a diversified nucleotide composition (left panel). Preselected M1 (miR-30, DSIR + Sensor rules selected) set, showing a biased nucleotide representation (middle panel). Preselected miR-E + UltramiR set, showing a different nucleotide bias due to the altered shRNA backbone. More shRNAs starting with a C were found to be potent (compared to TILE, p = 0.002, Fisher’s exact test), indicating less restrictive sequence requirements when using the miR-E backbone.
Supplementary Figure 2 Kernel selection and data integration.
(a) Schematic of the first support vector machine (SVM) classifier that serves to eliminate non-functional sequences and prioritize shRNAs that are likely to be potent.
(b) Schematic of the kernel representation used by SplashRNA. A weighted degree kernel is calculated across the entire guide sequence, while two spectrum kernels are calculated across nucleotides 1-15 and 16-22, respectively.
(c) TILE score distribution (Online Methods ). We set a potency threshold separating the negative from the positive class at the minimal point between the two modes of the distribution (green line, for thresholds see Supplementary Table 1).
(d) Testing of multiple kernel combinations in a leave-one-gene-out nested cross-validation setting on the TILE data set found that the combination of a weighted degree kernel over positions 1-22 and two spectrum kernels at positions 1-15 and 16-22 (allKernels) yields the best performance. Spec1 is a spectrum kernel over positions 1-15. Spec2 is a spectrum kernel over positions 16-22. Spec1_spec2 is a combination of spec1 and spec2. Wdk is a weighted degree kernel over positions 1-22. Wdk_spec1 is a combination of wdk and spec1. Wdk_spec2 is a combination of wdk and spec2. All_kernels is a combination of wdk, spec1 and spec2.
(e) M1 score distribution (Supplementary Table 1, Online Methods). Cutoffs (green lines) were calculated by fitting Gaussian distributions to the modes and setting thresholds at 5% false positive rate (FPR) and 5% false negative rate (FNR).
(f) Incorporation of M1 positives, negatives or both into the TILE training set was tested in a nested leave-one-gene-out cross-validation setting. Inclusion of M1 negatives deteriorated performance on the TILE data set, whereas inclusion of the M1 positives alone improved performance. Note: TILE+M1pos = SplashmiR-30, the miR-30 classifier.
(g) Score distribution for the shERWOOD miR-30 set (Supplementary Table 1, Online Methods). We set the threshold at an arbitrary cutoff of zero (green line).
(h) Incorporation of M1 positives into the TILE training set improved performance on the external shERWOOD data set. Note: TILE+M1pos = SplashmiR-30, the miR-30 classifier.
Supplementary Figure 3 Calibration of the sequential SVM classifier SplashRNA.
(a) Precision-recall trade-off between the two classifiers SplashmiR-30 and SplashmiR-E. Selection of alpha (α) and theta (θ) hyperparameters leads to varied performance (area under the precision-recall curve, auPR) on the TILE miR-30 (x-axis) and miR-E + UltramiR (y-axis) sets. Each line represents a setting of alpha; points on the line represent distinct theta values. The circle indicates the alpha and theta choices for the final sequential classifier (SplashRNA: α = 0.6, θ = 1.1). The dashed line represents the performance of the convex linear classifier without a threshold at every alpha. Note that the performance of a sequential classifier equals or exceeds that of a linear combination since one can set the threshold (θ) to a small enough value such that all examples are evaluated by both classifiers.
(b) Performance on the TILE set, varying the value for theta with alpha set to 0.6. The insert shows a zoom in of the first 15% of the precision-recall.
(c) Performance on the miR-E + UltramiR set, varying the value for theta with alpha set to 0.6.
Supplementary Figure 4 Prediction performance of SplashRNA.
(a) Precision-recall curves on the TILE data set, comparing leave-one-gene-out nested cross-validation predictions from SplashRNA (auPR: 0.696) and SplashmiR-30 (auPR: 0.699) against the alternative prediction tools DSIR (auPR: 0.594), seqScore (auPR: 0.526) and miR_Scan (auPR: 0.449).
(b) Score distribution of the mRas + hRAS set (DSIR + Sensor rules selected). The green line indicates the threshold (Online Methods, Supplementary Table 1).
(c) Prediction performance comparison of the indicated algorithms on the external mRas + hRAS Sensor data set (Supplementary Table 1). SplashRNA outperformed the other algorithms.
(d) Score distributions of the miR-E and UltramiR data sets. For the miR-E set, the threshold was set to 80 (green line, Online Methods ). The UltramiR set represents the distribution of log depletion scores of shRNAs tested in a cell-viability screen (Supplementary Table 1).
(e) SplashRNA and DSIR based re-ranking of shERWOOD selected UltramiR shRNAs targeting essential genes that were tested in a cell-viability screen. X-axis: mean SplashRNA or DSIR score for equally sized groups (purple and blue dots, 20 groups) of 39 shRNAs each. Y-axis: Percent of shRNAs in each group that were potent (Online Methods ). SplashRNA and DSIR were compared against the published minimum (Min), median (Med) and maximum (Max) shERWOOD algorithm performance on the same data set (green-brown dots).
(f) Retrospective potency prediction of shRNAs from a large-scale essential genes RNAi screen. The biological screen used 20-25 miR-E-like shRNAs per gene to identify essential genes. shRNA potency was quantified by assessing their log fold changes (Online Methods ). For each of the top 50 essential genes, all tested algorithms selected their top and bottom five sequences by prediction score. Log fold changes for all selected shRNA across the 50 genes were compared. SplashRNA achieved the most significant discrimination between top and bottom predictions (p = 1.8e-11, one-sided Wilcoxon rank sum test). seqScore (p = 2.3e-5) was used to generate the initial library of approximately 25 shRNAs per gene.
(g) Retrospective potency prediction of shRNAs from a large-scale toxin resistance and sensitivity RNAi screen. The biological screen used 25 miR-E-like shRNAs per gene to identify resistance and sensitivity genes. shRNA potency was quantified by assessing their log fold changes (Online Methods ). For each of the top 20 sensitivity genes, all tested algorithms selected their top and bottom five sequences by prediction score. Log fold changes for all selected shRNA across the 20 genes were compared. SplashRNA was the only algorithm to achieve significant discrimination between the top and bottom predictions at p < 0.01 (p = 4.8e-4, one-sided Wilcoxon rank sum test). Of note, SplashRNA also outperformed the other algorithms when selecting smaller or larger numbers of top sensitivity genes from the biological screen (data not shown). seqScore was used to generate the initial library of approximately 25 shRNAs per gene.
Supplementary Figure 5 Transcript selection.
(a) Distribution of shRNA potency in functionally distinct transcript regions. Shown is the potency distribution of shRNAs in the unbiased TILE data set that target the 5’UTR, CDS or 3’UTR. Since these shRNAs were evaluated using the Sensor assay, their targets are not subject to alternative cleavage and polyadenylation (ApA) and/or splicing events.
(b) AU content of potent and weak miR-30 shRNAs from the unbiased TILE set. Potent shRNAs tend to have a higher proportion of A/U nucleotides (p < 2.2e-16, two-sided Kolmogorov-Smirnov test).
(c) AU content of functionally distinct transcript regions in the human genome. Shown are the AU densities in 5’UTR, CDS and 3’UTR.
(d) AU content in mouse transcripts.
(e) Alternative cleavage and polyadenylation (ApA) prevents potent shRNAs from inhibiting their putative target gene. Immunoblotting of Pten in NIH/3T3s transduced at single-copy with LEPG expressing the indicated shRNAs. Nine top predictions targeting the CDS or the 3’UTR after early ApA sites were compared alongside controls for their ability to suppress mouse Pten. Actb was used as loading control.
(f) Comparison of knockdown efficiency and annotation of ApA sites. Shown are potent Pten shRNA predictions and their position (start, end) on the mouse genome (mm9). KD indicates a qualitative degree of the knockdown observed in immunoblotting analyses of NIH/3T3s (e). ApA indicates previously published positions on the mouse genome (mm9) of ApA sites (alternative 3’ ends) identified in NIH/3T3 and mouse ES cells by 3P-Seq. 2P-Seq shows the quantification of transcript expression levels measured by 2P-Seq. All shRNAs and ApA sites are ordered according to their position along the mouse genome.
Supplementary Figure 6 Extensive validation of de novo SplashRNA predictions.
(a-f) Western blot validation of de novo SplashRNA predictions. All shRNAs were expressed using LEPG at single-copy conditions. β-Actin (Actb, ACTB) was used for normalization.
(a) Immunoblotting of Pbrm1 in NIH/3T3s (median KD: 97%, median SplashRNA score: 1.7).
(b) Immunoblotting of Rela in NIH/3T3s (median KD: 90%, median SplashRNA score: 1.1).
(c) Immunoblotting of Bcl2l11 in NIH/3T3s (median KD: 97%, median SplashRNA score: 0.7).
(d) Immunoblotting of Axin1 in NIH/3T3s (median KD: 95%, median SplashRNA score: 1.3).
(e) Schematic of the multiple human NF2 transcript variants. NF2 has nine variants with an intersection of only 198 nucleotides, excluding the 5’UTR, rendering the prediction task especially difficult due to limited sequence space.
(f) Predicting miR-E shRNAs for extremely short transcripts. Immunoblotting of NF2 in A375s transduced with the indicated shRNAs targeting all nine NF2 variants (median KD: 89%, median SplashRNA score: 0.6).
(g) Comparison of SplashRNA and DSIR predictions against CRISPR-Cas9 mediated suppression of Cd9 in mouse embryonic fibroblasts (MEFs). Shown are normalized (relative to the indicated controls) median anti-Cd9-APC fluorescence intensities of RRT-MEFs and CRT-MEFs expressing the indicated shRNAs or sgRNAs (Online Methods ). The six top-scoring predictions from DSIR + Sensor rules (DSIR) or SplashRNA (ordered according to their respective scores) were compared to six sgRNA sequences (Supplementary Table 2). *, Cd9.1137 is the top prediction from both algorithms and was plotted twice for clarity. While DSIR predictions triggered Cd9 knockdown with variable efficacy, SplashRNA predictions consistently induce strong Cd9 suppression, closely approaching knockout conditions.
(h) Transfer function of SplashRNA score versus protein knockdown for all 62 de novo predicted shRNAs validated by immunofluorescence (Supplementary Table 2). Green triangles indicate the minimum knockdown for 80% of the predictions for a given SplashRNA score bin. Bins were defined to have a width of 0.5 with the leftmost bin starting at 0.25. For the bin centered on SplashRNA score = 1, 80% of predictions showed at least 86% protein knockdown. The expected knockdown for the top 80% of predictions (e.g. 4/5 shRNAs) increases with the SplashRNA score. Overall, 91% of predictions with a SplashRNA score >1 showed more than 85% protein knockdown.
(i) Uncropped images of Pten (Figure 2d) and Bap1 (Figure 2e) western blots, and their respective β-Actin controls. Pten predicted molecular weight (MW): 47 kDa; MW validated by Cell Signaling Technology: 54 kDa. Bap1 predicted MW: 80 kDa; MW validated by Bethyl Laboratories: 80-95 kDa. β-Actin MW validated by Sigma-Aldrich: 42 kDa.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–6 and Supplementary Table 1 (PDF 2720 kb)
Supplementary Table 2
Novel datasets and sequences of validated shRNAs (XLSX 4243 kb)
Supplementary Table 3
Genome-wide SplashRNA predictions for all human and mouse protein coding genes. (XLSX 25766 kb)
Supplementary Code
Source code that implements the main SplashRNA algorithm (ZIP 2201 kb)
Rights and permissions
About this article
Cite this article
Pelossof, R., Fairchild, L., Huang, CH. et al. Prediction of potent shRNAs with a sequential classification algorithm. Nat Biotechnol 35, 350–353 (2017). https://doi.org/10.1038/nbt.3807
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/nbt.3807
This article is cited by
-
Ubiquitin ligase subunit FBXO9 inhibits V-ATPase assembly and impedes lung cancer metastasis
Experimental Hematology & Oncology (2024)
-
LncRNA Malat1 suppresses pyroptosis and T cell-mediated killing of incipient metastatic cells
Nature Cancer (2024)
-
PAF1c links S-phase progression to immune evasion and MYC function in pancreatic carcinoma
Nature Communications (2024)
-
SOX17 enables immune evasion of early colorectal adenomas and cancers
Nature (2024)
-
ROCK1/2 signaling contributes to corticosteroid-refractory acute graft-versus-host disease
Nature Communications (2024)