Main

Single-particle cryo-electron microscopy (cryo-EM) is a method capable of resolving high-resolution structures of proteins in near-native states. Cryo-EM projection images (micrographs) can contain hundreds or thousands of individual protein projections (particles). Given a sufficient number of particles, the three-dimensional (3D) structure of the protein can be determined1. However, owing to the low signal-to-noise ratio (SNR) of cryo-EM images, large numbers of observations are required for accurate reconstruction. Studies show a log–linear relationship between the number of particles included and the inverse resolution of the reconstruction2,3. The concentration of protein on EM grids, efficiency of data collection and completeness and accuracy of particle identification are factors determining the total number of particles available for downstream reconstruction and, hence, the achievable resolution. In particular, particle identification (particle picking) is a major bottleneck, often taking weeks or even months with current workflows for small or non-globular particles owing to variability in particle shapes and structured noise in micrographs.

A variety of methods have been developed for particle picking automation. The most widely used are difference of Gaussians (DoG) and template-based approaches4,5,6,7,8. However, these methods are unable to detect unusually shaped particles and suffer from high false-positive rates necessitating post-picking curation. Most commonly, researchers use iterative 2D–3D classification and discard poor subsets by eye. These picking methods and downstream curation introduce bias into the final particle set, potentially removing rare particle views and conformations9,10,11. Newer methods based on convolutional neural networks (CNNs) have been proposed12,13,14, which use positive- and negative-labeled micrograph regions to train CNN classifiers, which then predict labels for the remaining regions. However, owing to factors like low SNR, structured background and the distribution of particle morphologies, researchers must label a large number of regions for training—a non-trivial and time-consuming task. Moreover, the diverse characteristics of negative data make it difficult to manually label a representative set of negative examples, and, hence, the number of labeled negatives must be an order of magnitude larger than the number of positives to achieve acceptable performance15. This has limited adoption by the cryo-EM community and hand-labeling remains the gold standard.

To overcome the challenges inherent to current automatic particle-picking methods, we frame particle picking as a positive-unlabeled (PU) learning problem. In PU learning, we seek to learn a classifier of positives and negatives given a small number of labeled positive regions and the remaining unlabeled regions. This has proved to be an effective paradigm when working with partially labeled data in other domains (for example, document classification16, time series classification17 and anomaly detection18). Recent work has explored general-purpose PU learning for neural network models based on estimating the true positive–negative risk, but overfitting remains a challenge for PU learning19. Therefore, we instead approach PU learning as a constrained optimization problem in which we wish to find classifier parameters to minimize classification errors on the labeled data subject to a constraint on the expectation over the unlabeled data. By imposing this constraint softly with a novel generalized expectation (GE) criteria20, we are able to mitigate overfitting and train high-accuracy particle classifiers using very few labeled data points. Furthermore, by combining our PU learning method with autoencoder-based regularization, we can further reduce the amount of labeled data required for high performance.

Here we present Topaz, a pipeline for particle picking using CNNs with PU learning. Topaz retrieves many more particles than alternative methods, while maintaining a low false-positive rate. It substantially reduces the need for particle curation, removes systematic bias in particle picking introduced by conventional pickers and 2D–3D classification procedures and allows for robust and representative particle analysis and classification. Furthermore, Topaz is capable of reliably picking previously challenging particles (for example, small, non-globular and asymmetric particles) and avoiding aggregation, grid substrate and other background objects, while requiring minimal example particles.

We first demonstrate the capabilities of Topaz on a novel protein dataset for the Toll receptor—a ~105-kDa non-globular asymmetric particle. Despite aggregation and sparse labeling in the dataset, Topaz enables a 3.7-Å reconstruction and resolves secondary structures that could not be resolved with other methods. Topaz also decreases anisotropy by better detecting conventionally difficult particle views. Additionally, on three publicly available datasets, we find that by using Topaz with only 1,000 labeled training examples, we are able to retrieve many more real particles than were included in the published particle sets. This enables us to solve 3D structures of equal or greater quality to those found using the published particles. Remarkably, the Topaz results do not require any of the ad hoc postprocessing that is typically required for high-resolution structures; we feed Topaz particles directly into alignment and reconstruction. Finally, we compare our GE-based PU learning method against other off-the-shelf PU learning approaches and find that our method offers improvements over the current state-of-the-art methods when applied to training particle-detection models. Topaz was a critical component in determining the single particle behavior of an elongated clustered protocadherin21.

Topaz source code is freely available (https://github.com/tbepler/topaz) and can be installed through Anaconda, pip, Docker, Singularity and SBGrid22. Topaz is designed to be modular, has been integrated into Appion23, is being integrated into Relion24, CryoSparc25, EMAN25, Scipion26 and Focus27, and can easily be integrated into other cryo-EM software suites in the future. Topaz runs efficiently on a single GPU computer and includes a standalone graphical user interface (GUI)28 to assist with particle labeling.

Results

Topaz pipeline

The Topaz particle-picking pipeline is composed of the following main steps (Fig. 1): (1) whole micrograph preprocessing with an optional mixture model newly designed to capture micrograph statistics (Methods; Supplementary Figs. 1, 2 and 3), (2) neural network classifier training with our PU learning framework, and (3) sliding window classification of micrographs and extraction of particle coordinates by non-maximum suppression.

Fig. 1: Topaz particle-picking pipeline using CNNs trained with positive and unlabeled data.
figure 1

a, Given a set of labeled particles, a CNN is trained to classify positive and negative regions using particle locations as positive regions and all other regions as unlabeled. Labeled particles from EMPIAR-10096 are indicated by blue circles and a few positive and unlabeled regions are depicted. b, Once the CNN classifier is trained, particles are predicted in two steps. First, the classifier is applied to each micrograph region to give per region predictions. Second, coordinates are extracted from the region predictions using non-maximum suppression. The left image shows a raw micrograph from EMPIAR-10096. The middle image depicts the micrograph with overlaid region predictions (blue indicates low confidence and red indicates high confidence). The right image indicates predicted particles after using non-maximum suppression on the region predictions.

Classifier training from positive and unlabeled data

We frame particle picking as a PU learning problem in which we seek to learn a classifier that discriminates between particle and non-particle micrograph regions given a small number of labeled particles and many unlabeled micrograph regions. CNN classifiers are trained using minibatched stochastic gradient descent with a novel objective function, GE-binomial (Methods), which explicitly models the sampling statistics of minibatches to regularize the posterior of the classifier over the unlabeled data. Combining this with an optional autoencoder module allows high-accuracy classifiers to be trained using very few positive examples. This approach allows us to overcome overfitting problems associated with recent PU learning methods developed for neural networks in domains other than cryo-EM analysis, and to effectively pick particles in challenging cryo-EM datasets.

Micrograph region classification and particle extraction

Given a trained CNN particle classifier, we extract predicted particle coordinates and their associated predicted probabilities. First, we calculate the per pixel predicted probabilities by applying the classifier to each micrograph region in a sliding window. Then, to extract coordinates from these dense predictions, we use the well-known non-maximum suppression algorithm which iteratively selects high-scoring pixels and removes their neighbors from consideration as particle centers. This yields a list of predicted particle coordinates and their associated model scores for each micrograph.

Topaz picks challenging particles and orientations

We explore the ability of Topaz to detect challenging particles of a small, asymmetric, non-globular and aggregated protein, a Toll receptor. To this end, we compared particles picked by Topaz (trained with 686 labeled particles) with particles picked using several other methods (DoG7 and template picking followed by 2D class averaging and manual filtering, and the CNN-based methods crYOLO29 and DeepPicker12) (Methods). The CNN-based methods were all trained following the software instructions with default settings and identical labeled particles.

After four rounds of 2D classification and filtering, DoG found 770,263 good particles from an initial stack of 1,599,638 and template picking found 627,533 good particles from an initial stack of 1,265,564. Using Topaz, after one round of 2D classification, we were left with 1,006,089 of an initial 1,010,937 particles, indicating that Topaz gives a remarkably low false-positive rate of only 0.5% on this data. We then compared the quality of the picked particles by taking each particle set through reconstruction (Fig. 2a–c). We found that particles picked using Topaz yield a structure with 0.731 sphericity at Fourier shell correlation (FSC)0.143 = 3.70 Å resolution, as compared to 0.706 sphericity at 3.92 Å for template-picked particles and 0.652 sphericity at 3.86 Å for particles picked using DoG. Furthermore, only the density map based on Topaz particles was of high enough quality to reliably resolve secondary structure (β-strands) and allow for model building. Other CNN-based picking methods, crYOLO and DeepPicker, were unable to find sufficient numbers of good particles for high-resolution reconstruction. crYOLO found 131,300 particles resulting in a 6.8 Å structure while DeepPicker failed to find any meaningful particles in this dataset (Supplementary Figs. 4–7).

Fig. 2: Reconstructions of the Toll receptor using particles picked by Topaz and DoG and template-based methods.
figure 2

Template and DoG particles were filtered through multiple rounds of 2D classification before analysis. Topaz particles were not filtered. a, Density map using particles picked with Topaz. The global resolution is 3.70 Å at FSC0.143 with a sphericity of 0.731. Scale bar, 5 nm. b, Density map using particles picked using template picking. The global resolution is 3.92 Å at FSC0.143 with a sphericity of 0.706. c, Density map using particles picked using DoG. The global resolution is 3.86 Å at FSC0.143 with a sphericity of 0.652. d, Quantification of picked particles for each protein view on the basis of 2D classification. e, Example micrograph (representative of >100 micrographs examined) showing Topaz picks (red circles) and protein aggregation (outlined in green).

We next quantified the ability of these methods to detect different particle views. The Toll receptor is strongly asymmetric and non-globular, thus it is important for picking methods to retrieve the full spectrum of view angles. By counting the number of particles assigned to each view in 2D class averages, we found that Topaz retrieved a much larger fraction of oblique, side and top views of the Toll receptor as compared to DoG and template-based methods (Fig. 2d). In addition, we note that these micrographs are challenging, containing junk and protein aggregation, yet Topaz is uniquely able to avoid these micrograph regions while picking only good particles (Fig. 2e and Supplementary Fig. 4).

Topaz enables high-resolution reconstruction with no postprocessing

We next evaluated the full Topaz particle-picking pipeline by generating reconstructions for three cryo-EM datasets containing T20S proteasome (EMPIAR-10025), 80S ribosome (EMPIAR-10028) and rabbit muscle aldolase (EMPIAR-10215). Each of these datasets already had a curated set of particles yielding high-quality reconstructions, which we compared with particles predicted by Topaz (trained with 1,000 positives on the basis of reconstruction quality; Methods). We standardized the reconstruction procedure by using cryoSPARC homogeneous refinement on the raw Topaz particle sets (that is, no postprocessing was applied) and published particle sets with identical settings for each dataset. By considering the reconstruction resolution at decreasing probability thresholds (increasing numbers of particles) predicted by Topaz, we selected the particle set that optimized the resolution for each dataset.

We found that Topaz was able to retrieve substantially more good particles than were present in the curated particle sets, finding 3.22, 1.72 and 3.68 times more particles in EMPIAR-10025, EMPIAR-10028 and EMPIAR-10215, respectively. Furthermore, reconstructions from the Topaz particle sets were of equal or higher quality as compared to those given by the curated particles (Fig. 3). Topaz maps reached roughly equivalent resolution to the published structures for the 80S ribosome and rabbit muscle aldolase while improving the resolution by ~0.15 Å over the published structures for the T20S proteasome. Remarkably, this was achieved using only 1,000 labeled examples and no filtering of the particle set (for example, particle filtering with 2D or 3D class averaging or iterative reconstructions removing poor particles). We note that even though these labeled training particles are extremely sparse, PU learning enabled Topaz to pick with high precision as seen in example micrographs (Supplementary Figs. 8, 9 and 10). We verified that the additional particles found by Topaz were good particles by performing reconstructions using only the newly picked particles and find nearly identical structures (Fig. 3). For aldolase, although Topaz found many more particles than were in the published dataset, the Topaz, curated and the Topaz minus curated particle sets achieved the same reconstruction resolution (2.63 Å at FSC0.143), suggesting that the ~200,000 particles in the published set is already sufficient to reach the resolution limit of the data given standard reconstruction methods.

Fig. 3: Single particle reconstructions from published particles, Topaz particles and Topaz particles with published particles removed.
figure 3

Published particles are on the left, Topaz particles in the middle, and Topaz particles with published particles removed are on the right. Below each reconstruction is the corresponding 3D FSC plot. a, T20S proteasome (EMPIAR-10025) using the provided aligned and dose-weighted micrographs. b, 80S ribosome (EMPIAR-10028). c, Rabbit muscle aldolase (EMPIAR-10215). Scale bars, 3 nm.

Topaz particle predictions are well-ranked and contain few false positives

We next quantified the quality of the particles predicted by Topaz over varying predicted probability thresholds by calculating the reconstruction resolution and estimating the number of false-positive particles on the basis of 2D class averaging. For each dataset, reconstructions were calculated using particles predicted by Topaz at decreasing probability cutoffs (Fig. 4a). The resolution of Topaz structures increased as we included more good particles and then dropped once the threshold became small and too many false positives were included, as demonstrated by the dip in resolution for the last threshold of EMPIAR-10025. Furthermore, we compared these curves with those obtained by randomly subsampling the published particle sets and found that Topaz particles quickly matched the resolution of the published particles for the proteasome and ribosome datasets. For the aldolase dataset, we saw that more Topaz particles were required to match and then exceed the resolution of the curated particle set. This could be because Topaz did not find enough side views of the particle until the probability was sufficiently lowered whereas the curated dataset had been filtered to be enriched for these views (Supplementary Fig. 11).

Fig. 4: Reconstruction resolution and 2D class averages for Topaz particles at decreasing log-likelihood ratio thresholds.
figure 4

a, Number of particles versus reconstruction resolution for Topaz particles (increasing number of particles corresponds to decreasing log-likelihood threshold) and randomly sampled subsets of the published particle set. Resolution is as reported by cryoSPARC. For the published particle sets the mean of three replicates is marked with s.d. shaded in gray. b, Stacked bar plots show the quantification of the number of true and false positives at each threshold on the basis of 2D class averages. The decreasing threshold corresponds to an increasing number of predicted particles. True and false positives are shown in blue and orange, respectively. c, 2D class averages obtained at each score threshold for the T20S proteasome (EMPIAR-10025). Number of particles (ptcls) and effective sample size (ess) for each class are reported by cryoSPARC. NaN is reported for classes without any particles assigned. Classes determined to be false positives are marked with orange boxes. Several classes which appear to be false positives at high score thresholds do not contain any particles and, therefore, are not highlighted.

We also classified the particle sets at each threshold into ten classes and manually examined the class averages to determine whether each class represented true particles or false positives. As expected, we found that as the probability threshold was decreased, the fraction of false positives increased (Fig. 4b), yet remained remarkably low even at relaxed thresholds. Furthermore, particles appear to be well-ranked, in that noisy or unusual particle classes only start to appear at low thresholds. For example, the T20S proteasome dataset was contaminated with gold particles, which appeared as dark spots in the micrographs. Particles in close proximity to gold were only selected when the probability threshold was decreased (Fig. 4). Similar trends can be observed in the ribosome (Supplementary Fig. 12) and aldolase (Supplementary Fig. 11) class averages. This can also be seen in the precision-recall curves for these datasets (Supplementary Figs. 13), in which Topaz maintains remarkably high precision even at high recall levels.

GE-criteria-based PU learning method outperforms other general-purpose PU learning approaches

Comparison of PU learning methods

We considered two GE-based approaches to PU learning, GE-KL and GE-binomial (Methods), and evaluated their effectiveness by benchmarking against the recent non-negative risk estimator approach of Kiryu et al.19 (NNPU) and the naive approach in which unlabeled data are considered as negative for classifier training (the naive approach is hereafter referred to as PN) on two additional cryo-EM datasets. This is important to keep the development of our PU learning methods separate from the full Topaz evaluation above. The first dataset, EMPIAR-10096, is a publicly available dataset containing influenza hemagglutinin trimer particles and the second, EMPIAR-10234 (clustered protocadherin), is a challenging dataset provided by the Shapiro lab containing a stick-like particle with low SNR (Supplementary Fig. 14). For purposes of comparison, we simulated positively labeled datasets of varying sizes by randomly subsampling the set of all positive examples within the training set of each dataset.

We found that across all experiments, classifiers trained with our GE-criteria-based objective functions dramatically outperformed those trained with the NNPU or PN methods. Generally, GE-binomial and GE-KL classifiers displayed similar performance with a few important exceptions where GE-binomial gave better results. For the dataset with more compact particles, EMPIAR-10096, GE-binomial gave significantly (P < 0.05, Student’s paired t test) better test set average-precision scores than GE-KL when the number of data points was tiny (ten positive examples; Fig. 5a). At larger numbers of positives, both methods were statistically equivalent. On the challenging EMPIAR-10234 dataset, GE-binomial significantly outperformed GE-KL at 1,000 labeled examples (P < 0.05) whereas GE-KL gave better results (P < 0.05) within the 50–250 range of labeled examples. These results indicate that our GE-based PU learning approaches dramatically outperform previous PU learning methods, enabling particle picking despite few labeled positives on the challenging EMPIAR-10234 dataset and substantially improving picking quality on the easier EMPIAR-10096 dataset. Although GE-binomial and GE-KL performed similarly in this experiment, we did find that GE-binomial outperformed GE-KL in the two important cases of ten easy particles and 1,000 difficult particles.

Fig. 5: Comparison of models trained using different objective functions with varying numbers of labeled positives on the EMPIAR-10096 and EMPIAR-10234 datasets.
figure 5

a, Mean ± s.d. of the average-precision score for predicting positive regions in the EMPIAR-10096 and EMPIAR-10234 test set micrographs for models trained using the naive PN, NNPU, GE-KL or GE-binomial objective functions. Each number of labeled positives was sampled ten times independently. Asterisks indicate experiments in which GE-binomial achieved higher average-precision than GE-KL with P < 0.05. Daggers indicate experiments in which GE-KL achieved higher average-precision than GE-binomial with P < 0.05 according to a two-sided dependent t test. b, Mean ± s.d. of the average-precision score for models trained jointly with autoencoders with different reconstruction loss weights (γ). γ = 0 corresponds to training the classifier without the autoencoder. γ = 10/N means the reconstruction loss is weighted by ten divided by the number of labeled positives used to train the model.

Augmentation with autoencoders

We next considered whether classifier performance could be improved when few labeled data points are available by introducing a generator network with a corresponding reconstruction error term in the objective to form a hybrid classifier with autoencoder network (Methods). We hypothesized that including this reconstruction component would improve the generalizability of the classifier when few labeled data points are available by requiring that the feature vectors given by the encoder network be descriptive of the input—acting as a sort of machine learning technique known as regularization.

We evaluated this hypothesis by training classifiers on the EMPIAR-10096 and EMPIAR-10234 datasets with different settings of the autoencoder weight, γ, and varying numbers of labeled data points, N (Methods). We found that including the decoder network with a reconstruction error term in the objective (γ = 1 and \(\gamma = \frac{{10}}{N}\)) improved classifier performance in the regime with few labeled data points (Fig. 5b). As the number of data points increased, the benefit of using the autoencoder decreased and then hurt classifier performance owing to over-regularization. Our results from both datasets suggest that using the autoencoder with \({\mathrm{\gamma = }}\frac{{{\mathrm{10}}}}{N}\) gives best results when N ≤ 250

Discussion

As our work originally appeared in RECOMB 201830 and as an arXiv preprint, other works have followed on bioRxiv that propose alternative CNN-based particle-picking methods29,31. However, these methods follow the supervised learning paradigm (that is, some variant of PN learning) and are limited by the associated assumptions. In the future, it may also be possible to provide particle-detection models pretrained on many publicly available datasets; however, we note that fully labeled ground-truth datasets are presently unavailable and that these models are unlikely to generalize to new datasets with conventionally difficult particles, which we focus on here. While it may seem difficult to provide labeled data upfront, in practice we find that explicitly relaxing the requirement to completely label micrographs eases this burden, and is a major advantage of Topaz over other CNN-based methods. Users may also ‘bootstrap’ the labeling procedure using existing picking and curation methods, while remaining cautious against reintroducing bias. We note that there may be some difference between randomly sampling from a curated particle set and particles that would be labeled by a user. However, the Toll receptor and clustered protocadherin training sets were both provided by hand labeling and demonstrate that labeling a small representative set of particles is easily achievable even for conventionally difficult datasets.

Although we use a simple CNN architecture with reasonable default hyperparameters, and show that it performs well on these datasets, any model architecture that can be trained with gradient descent can use our GE criteria objective functions to learn from positive and unlabeled data. Furthermore, additional hyperparameter tuning, such as L2 or dropout regularization, can improve model performance. The only hyperparameters introduced by our objective function is the unknown positive class prior, π, and the constraint strength, λ. Although the positive class prior could also be chosen by cross validation, we observed that our results were relatively insensitive to its choice (Supplementary Fig. 15). Furthermore, we do not find that λ needs to be changed from the default setting. Our proposed GE-binomial PU learning method could also have widespread utility for object detection in other domains where positive labels are frequently incomplete, for example, in light microscopy or medical imaging. Additionally, although we proposed GE-binomial for positive-unlabeled learning, it is straightforward to extend to the typical semisupervised case (where some labeled negative regions are provided) by taking the expectation of the loss over all labeled data in the first term.

Topaz particle probability thresholding allows particles to be included iteratively until the reconstruction resolution stops improving. It is possible for reconstruction algorithms to explicitly take these probabilities into account when determining 3D structures in the future.

Topaz requires researchers to label very few particles to achieve high quality predictions. It performs well independently of particle shape, opening automated picking to a wide selection of proteins previously too difficult to locate computationally. In addition, our pipeline is computationally efficient—training in a few hours on a single GPU and producing predictions for hundreds of micrographs in only minutes. Furthermore, once a model is trained for a specific particle, it can be applied to new imaging runs of the same particle. Topaz greatly expedites structure determination by cryo-EM, enabling particle picking for previously difficult datasets, reducing the manual effort required to achieve high-resolution structures, and thus increasing the efficiency of cryo-EM workflows and the completeness of particle analytics.

Methods

Dataset description

Aligned and summed micrographs and star files containing published particle sets were retrieved from the Electron Microscopy Public Image Archive (EMPIAR) for datasets EMPIAR-10025 (ref. 32), EMPIAR-10028 (ref. 33) and EMPIAR-10096 (ref. 34). Aligned and summed micrographs and hand-labeled particle coordinates were provided by the Shapiro lab for the EMPIAR-10234 dataset. Aligned and summed micrographs and a curated in-house particle set were provided by the New York Structural Biology Center for the EMPIAR-10215 dataset. Micrographs for each dataset were downsampled to the resolution specified in Table 1 and normalized as described in the following section. Each dataset was then split into training and test sets at the micrograph level. The number of micrographs and labeled particles in each split are also reported in Table 1. To demonstrate the utility of our Gaussian mixture model (GMM) normalization method, we also retrieved micrographs for EMPIAR-10261 (ref. 35) from EMPIAR.

Table 1 Summary of cryo-EM datasets and hyperparameters used for classifier training on each; each dataset was downsampled and split into training and test sets at the whole micrograph level

Micrograph normalization

Images were normalized using a per-image scaled two component GMM. Given K images, each pixel is modeled as being drawn from a two component GMM, parameterized by ρ, the mixing parameter and μ0, σ0, μ1 and σ1, the means and standard deviations of the Gaussian distributions, with a scalar multiplier for each image, α1…K. Let xi,j,k be the value of the pixel at position i,j in image k, it is distributed according to

$$z_{i,j,k} \sim {\mathrm{Bernoulli}}\left( \rho \right)$$
$$x_{i,j,k}|z_{i,j,k} \sim {\mathrm{Gaussian}}\left( {\alpha _k\mu _{z_{i,j,k}},\left( {\alpha _k\sigma _{z_{i,j,k}}} \right)^2} \right)$$

where \(z_{i,j,k}\) is a random variable denoting the component membership of the pixel. The maximum likelihood values of the parameters ρ, μ0, σ0, μ1, σ1 and α1…K are found by expectation-maximization for each dataset. Then, the pixels are normalized by first dividing by the image scaling factor and then standardizing to the dominant mixture component. Let μ′,σ′ be μ0,σ0 if ρ < 0.5 and μ1,σ1 otherwise, then the normalized pixel values \(x^{\prime}_{i,j,k}\) are given by

$$x^{\prime}_{i,j,k} = \frac{{\left( {\frac{{x_{i,j,k}}}{{\alpha _k}} - \mu^{\prime} } \right)}}{\sigma }$$

We positively contrasted this normalization with standard affine normalization of micrographs (Supplementary Figs. 1, 2, and 3). In affine normalization, micrographs are transformed by subtracting the mean and dividing by the s.d. of all pixel values in each micrograph.

PU learning baselines

Let P be the set of labeled positive micrograph regions (centered on a particle), and U be the set of unlabeled micrograph regions where π is the fraction of positive examples within U. Then, the task is to learn a classifier (g) that discriminates between positive and negative regions given P and U. When π is small, treating the unlabeled examples as negatives for the purposes of classifier training with the following standard loss minimization objective, given cost function L, can be effective (referred to as PN)

$$\pi E_{x \sim P}\left[ {L\left( {g\left( x \right),1} \right)} \right] + \left( {1 - \pi } \right)E_{x \sim U}\left[ {L\left( {g\left( x \right),0} \right)} \right]$$

However, in general, this approach suffers from overfitting owing to poor specification of the classification objective—it is minimized when positives are perfectly separated from unlabeled data points. To address this, Kiryo et al.19 recently proposed an unbiased estimator of the true positive–negative classification objective for positive and unlabeled data with known π and a non-negative estimator (PU), which is shown to reduce overfitting still present in the unbiased estimator.

PU learning with generalized expectation criteria

Here we adopt an alternative approach to positive-unlabeled learning that is not based on estimating the PN misclassification risk. Instead, we observe that unlabeled data with known π can be used to constrain a classifier such that it minimizes the classification loss on the labeled data and matches the expectation (π) over the unlabeled data. In other words, we wish to find the classifier, g, that minimizes Ex~P[L(g(x),1)] subject to the constraint Ex~U[g(x)] = π. This constraint can be imposed ‘softly’ through a regularization term in the objective function with weight λ (referred to as GE-KL)

$$E_{x \sim P}\left[ {L\left( {g\left( x \right),1} \right)} \right] + \lambda KL\left( {E_{x \sim U}\left[ {g\left( x \right)} \right]||\pi } \right)$$
(1)

In this objective function, we impose a constraint through the KL-divergence between the expectation of the classifier over the unlabeled data and the known fraction of positives, which is minimized when these terms are equal. This approach is an instance of a general class of posterior regularization called GE criteria, as specifically proposed by Mann and McCallum20. However, because we wish for our classifier to be a neural network and to optimize the objective using minibatched stochastic gradient descent, the gradient of the objective must be approximated using samples from the data. Estimates of the gradient of the GE-KL objective from samples are biased, which could cause stochastic gradient descent to find a suboptimal solution.

To address this issue, we propose an alternative GE criteria, GE-binomial, defined so as to minimize the difference between the distribution over the number of positives in the minibatch and the binomial distribution parameterized by π. The number of positive data points, k, in a minibatch of N samples from U follows the binomial distribution with parameter π. Furthermore, the classifier g also describes a distribution over the number of positives in the minibatch as

$$q\left( k \right) = \mathop {\sum }\limits_{\textbf{y} \in Y\left( k \right)} \mathop {\prod }\limits_{i = 1}^N g\left( {x_i} \right)^{y_i}\left( {1 - g\left( {x_i} \right)^{\left( {1 - y_i} \right)}} \right)$$

where x is a micrograph region, y is an indicator vector (\(y_i \in \left\{ {0,1} \right\}\)) denoting which data points are positive (yi = 1) and negative (yi = 1) and Y(k) is the set of all such vectors summing to k. This allows us to define the new GE criteria as the cross entropy between these two distributions \(\mathop {\sum }\limits_{k = 1}^N q\left( k \right){\mathrm{logp}}\left( k \right)\) giving the full GE-binomial objective function

$$E_{x \sim P}\left[ {L\left( {g\left( x \right),1} \right)} \right] + \lambda \mathop {\sum }\limits_{k = 1}^N q\left( k \right){\mathrm{logp}}\left( k \right)$$
(2)

In practice, because computing exact q(k) is slow, we make a Gaussian approximation with mean \(\mathop {\sum }\limits_{i = 1}^N g\left( {x_i} \right)\) and variance \(\mathop {\sum }\limits_{i = 1}^N g\left( {x_i} \right)\left( {1 - g\left( {x_i} \right)} \right)\) and substitute the Gaussian probability density function with these parameters for q in the above equation.

Autoencoder-based classifier regularization

When including the autoencoder component, we break our classifier network into the following two components: an encoder network composed of all layers except the final linear layer and the linear classifier layer. We denote these networks as f and c, respectively, with the full network, g, being given by g(x) = c(f(x)). Furthermore, we introduce a deconvolutional (also called transposed convolutional; see next section) decoder network, d, which takes the output of the feature extractor network and returns a reconstruction of the input image, x′ = d(f(x)). The objective function is then modified to include a term penalizing the expected reconstruction error over all images in the dataset, D, with weight γ

$$E_{x \sim P}\left[ {L\left( {c\left( {f\left( x \right)} \right),1} \right)} \right] + \lambda \mathop {\sum }\limits_{k = 1}^N q\left( k \right){\mathrm{logp}}\left( k \right) + \gamma E_{x \sim D}\left[ {||x - d\left( {f\left( x \right)} \right)||_2^2} \right]$$

This forms the full GE-binomial objective function with autoencoder component used in Topaz.

Classifier and autoencoder architectures and hyperparameters

We use a simple three-layer convolutional neural network with striding, batch normalization36 and parametric rectified linear units (PReLU) as the classifier in this work. The model is organized as 32 conv7×7 filters with batch normalization and PReLU, stride by 2, 64 conv5×5 filters with batch normalization and PReLU, stride by 2, 128 conv5×5 filters with batch normalization and PReLU, and a final fully connected layer with a single output. We use sigmoid activation on this output to convert it into the predicted probability of a region being from the positive class (that is, the output is interpreted as the log-likelihood ratio between positive and negative classes).

When augmenting with an autoencoder, we use a decoder structure similar to that of DCGAN37. The d-dimensional representation that is output by the final convolutional layer of the classifier network is projected to a representation with small spatial dimensions but large feature dimensions. This is repeatedly projected into representations with larger spatial dimensions and smaller feature dimensions until the final output is of the original input image size. Specifically, this model is structured as repeated transpose convolutions with batch normalization and leaky ReLU activations. Let z be the representation output by the final convolutional layer of the classifier and X′ be the image reconstruction given by the decoder, the decoder structure is z → transpose conv4×4 128-d, batch normalization, leaky ReLU → transpose conv4×4 64-d, stride 2, batch normalization, leaky ReLU → transpose conv4×4 32-d, stride 2, batch normalization, leaky ReLU → transpose conv3×3 1-d, stride 2 → X′.

PU learning benchmarking

To compare classifiers trained with the different objective functions, we simulated hand-labeling with various amounts of effort by randomly sampling varying numbers of particles from the training sets to treat as the positive examples. All other particles were considered unlabeled. We used cross entropy loss for the labeled particles. The values of π used for training are specified in Table 1. For GE-KL we set the GE criteria weight, λ, to 10 as recommended by Mann and McCallum20. For GE-binomial, we set this parameter to 1. The classifier was then trained with those positives and evaluated by average-precision score (see next section for description of classifier evaluation) on the test set micrographs. This was repeated with ten independent samples of particles for each number of positives. Statistical significance of performance differences between methods at each number of labeled positive examples was assessed using a two-sided t test.

We also evaluated classifiers trained with autoencoder components and input reconstruction weight, γ, and varying numbers of labeled data points, N. We compared models trained with γ = 0 (no autoencoder), γ = 0, and \(\gamma = \frac{{10}}{N}\). For each setting of γ and N, we trained ten models with different sets of N randomly sampled positives and calculated the average-precision score for each model on the test split of each dataset.

Classifier evaluation

Classifiers were evaluated by average-precision score. This score is a measure of how well ranked the micrograph regions were when ordered by the predicted probability of containing a particle, and corresponds to the area under the precision-recall curve. It is calculated as the sum over the ranked micrograph regions of the precision at k elements times the change in recall

$$\mathop {\sum }\limits_{k = 1}^n {\mathrm{Pr}}\left( k \right)\left( {{\mathrm{Re}}\left( k \right) - {\mathrm{Re}}\left( {k - 1} \right)} \right)$$

where precision (Pr) is the fraction of predictions that are correct and recall (Re) is the fraction of labeled particles that are retrieved in the top k predictions. Let TP(k) be the number of true positives in the top k predictions, then Pr and Re are given by

$$\begin{array}{rcl}{\mathrm{TP}}\left( k \right) &=& \mathop {\sum }\limits_{i = 1}^k y_i\\{\mathrm{Pr}}\left( k \right) &=& \frac{{{\mathrm{TP}}\left( k \right)}}{k}\\{\mathrm{Re}}\left( k \right) &=& \frac{{{\mathrm{TP}}\left( k \right)}}{{\mathop {\sum }\nolimits_{i = 1}^n y_i}}\end{array}$$

This measure is commonly used in information retrieval.

Non-maximum suppression algorithm for extracting particle coordinates

Non-maximum suppression chooses coordinates and the corresponding predicted probabilities of being a particle greedily starting from the highest scoring region. To prevent nearby pixels from also being considered particle candidates, all pixels within a second user-defined radius are excluded when a coordinate is selected. We set this radius to be the half major-axis length of the particle; however, smaller radii may give better results for closely packed, irregularly shaped particles.

Micrograph preprocessing

For EMPIAR-10025 and EMPIAR-10096, the aligned and summed micrographs along with contrast transfer function (CTF) estimates were taken directly from the public data release on EMPIAR. For EMPIAR-10028 and EMPIAR-10261, frames were aligned and summed without dose compensation using MotionCor238. Whole micrograph CTF estimates provided with the public release were used for this dataset.

For the clustered protocadherin dataset (EMPIAR-10234), single particle micrographs were collected on a Titan Krios electron microscope (Thermo Fisher Scientific) equipped with a K2 counting camera (Gatan); the microscope was operated at 300 kV with a calibrated pixel size of 1.061 Å. Ten-second exposures were collected (40 frames per micrograph) for a total dose of 68 e Å−2 with a defocus range of 1–4 µm. A total of 896 micrographs were collected using Leginon39. Frames were aligned using MotionCor238. A total of 1,540 particles were picked manually using Appion Manual Picker23 from 87 micrographs and used as a training dataset for Topaz.

The rabbit muscle aldolase dataset (EMPIAR-10215) was collected on a Titan Krios electron microscope (Thermo Fisher Scientific) equipped with a K2 counting camera (Gatan) in super-resolution mode; the microscope was operated at 300 kV with a calibrated super-resolution pixel size of 0.416 Å. Six-second exposures were collected (30 frames per micrograph) for a total dose of 70.32 e Å−2 with a defocus range of 1–2 µm. A total of 1,052 micrographs were collected using Leginon39. Frames were aligned, Fourier binned by a factor of 2 and dose compensated using MotionCor238. Whole-image CTF estimation was performed using CTFFIND440.

The Toll receptor dataset was collected on a Titan Krios electron microscope (Thermo Fisher Scientific) equipped with a K2 counting camera (Gatan); the microscope was operated at 300 kV with a calibrated pixel size of 0.832 Å. Six-second exposures were collected (40 frames per micrograph) for a total dose of 73.48 e Å−2 with a defocus range of 1.5–2.0 µm. A total of 9,323 micrographs were collected using Leginon. Frames were aligned using MotionCor238. Whole-image CTF estimation was performed using CTFFIND440.

3D reconstruction procedure

Reconstruction was performed using cryoSPARC25. For each particle set, we first generated an ab initio structure with a single class. These structures were then refined using the homogeneous refinement option of cryoSPARC with symmetry specified depending on the dataset (T20S proteasome, D7; 80S ribosome, C1; and aldolase, D2). For the aldolase dataset, we used C2 symmetry for ab initio structure determination. Otherwise, all other parameters were left in the default setting. When evaluating the quality of Topaz particle sets for decreasing score thresholds, each particle set was selected by taking all particles predicted by the Topaz model with scores greater than or equal to the given threshold. Reconstructions were calculated for each of these sets independently as described above.

Removal of overlapping particles

To evaluate the quality of the extra particles predicted by Topaz, we removed particles from the Topaz particle set that were also included in the published particle set. This was done by removing all Topaz particles with centers within the particle radius of a particle center in the published particle set.

2D class averages (EMPIAR-10025, EMPIAR-10028 and EMPIAR-10215)

Class averages were calculated using the cryoSPARC 2D classification option. All settings were left as default except the number of 2D classes, which was set to ten for every particle set.

3D structure analysis (EMPIAR-10025, EMPIAR-10028 and EMPIAR-10215)

The final 3D reconstructions were analyzed visually in UCSF Chimera41 and 3DFSC34. In Chimera, the previous 3D reconstruction was first loaded (with the fitted Protein Data Bank structure, if available), the newly processed 3D reconstruction was then aligned to the previous reconstruction. The structures were visually compared and representative areas were chosen for display in Fig. 4. The 3DFSC reconstructions were calculated using the public server (https://3dfsc.salk.edu), which compares Fourier shell components for several solid angles to determine the range of resolutions and the amount of anisotropy in the reconstruction.

Toll receptor particle picking

A total of 1,599,638 particles were picked using DoG Picker 2 (ref. 7) from 8,974 micrographs and imported into cryoSPARC for all subsequent processing. After particle curation using 2D classification described below, the particle picks from 44 micrographs were visually inspected. Picks in areas of obvious particle aggregation were removed, and lower SNR particles corresponding to views typically missed by DoG Picker were selected. The resulting 1,048 particles were split into 686 training and 362 testing particles at the micrograph level. Topaz was then trained on the training particles and applied with the default score threshold of 0 for particle prediction. The ‘oblique’, ‘side’, and ‘top’ 2D classes (Fig. 3d) were lowpass filtered to 15 Å and used for template correlation with FindEM42 implemented in the Appion23 software package.

The crYOLO29 network was trained on the complete set of 1,048 labeled particles with 20% held out for validation by default. Micrographs were filtered and training was performed as described in the crYOLO tutorial. Picking was performed at the default threshold of 0.3.

The DeepPicker12 network was also trained on the complete set of 1,048 particles. Though no micrograph processing is required in the DeepPicker tutorial, micrographs were binned in Fourier space and lowpass filtered to 10 Å using EMAN25. Even with a threshold of 0, no particles were predicted by DeepPicker.

Toll receptor 3D reconstruction

All reconstructions were performed using cryoSPARC25. For all particle-picking approaches, we performed 2D classification with default parameters and 100 2D classes, then removed obvious non-particles. For the DoG dataset, four rounds of 2D classification yielded 770,263 particles from an initial stack of 1,599,638. For the template dataset, four rounds of 2D classification yielded 627,533 particles from an initial stack of 1,265,564. For the Topaz dataset, one round of 2D classification yielded 1,006,089 particles from an initial stack of 1,010,937. For the crYOLO dataset, one round of 2D classification yielded 131,300 particles from an initial stack of 133,644. For all datasets, ab initio reconstruction was used to generate an initial model, and the structures were further refined using homogeneous refinement with C1 symmetry, followed by non-uniform refinement. All parameters were left in their default setting. Unfiltered half maps and masks were used to calculate 3DFSC reconstructions using the public server (https://3dfsc.salk.edu).

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.