Main

The development of powerful microscopy techniques that allow us to characterize biological structures with subcellular resolution and on large fields of view has tremendously increased the complexity of quantitative image analysis tasks1. The resulting images exhibit a wide range of structures that need to be identified, counted, precisely located and segmented. Expert knowledge is commonly required to achieve successful identification and segmentation of the structures of interest in microscopy images2,3. These tasks can be tedious and time-consuming, especially for large databanks or for the comparison of multiple biological conditions. It was recently demonstrated that deep convolutional neural networks (DNN) are excellent feature extractors4. They were successfully applied to segmentation (for example, whole cells, nuclei and dendritic spines), enumeration (for example, cell counting) and classification (for example, the state of a cell) in microscopy images5,6,7,8,9,10,11. The most common deep learning (DL) approaches applied to microscopy and biomedical images are fully supervised and require precisely annotated datasets8,10,11. Hence, the annotation process is often a limiting step in the application of DL for quantitative analysis of biomedical imaging3,12. To alleviate this, weakly supervised DL methods were introduced12,13,14,15. Bounding box annotations are commonly used for weakly supervised segmentation tasks as they are simple, allow the task to be spatially constrained2,14,16,17,18 and have been shown to decrease the annotation phase by 15-fold compared to precise identification of structure boundaries19. Methods for training with binary, image-level targets, reducing even further the complexity and duration of the annotation task, have been proposed when multiple instances are displayed on a single image20. Unfortunately, when applied to microscopy and biomedical image analysis, such weakly supervised approaches using whole image annotations resulted in lower segmentation precision compared to approaches using precisely identified structures21,22,23. Visualizing the spatial attention of the model through class activation maps is a powerful tool in assessing whether a classification is biologically meaningful24. The resultant attention maps were leveraged in the context of microscopy images to extract coarse segmentation masks of the underlying structure23,25.

In this Article we propose MICRA-Net (MICRoscopy Analysis Neural Network), a new approach relying only on image-level classification annotations for training a DNN to perform different types of microscopy image analysis task, such as semantic segmentation, cell counting and the detection of sparse features. MICRA-Net builds on latent learning26, which refers to a model retaining information (that is, latent space) that is not required for the task at hand in order to learn new complementary tasks26. MICRA-Net uses binary, image-level, classification targets to build a general representation of the specific dataset. It generates detailed feature maps from which specific tasks, such as semantic segmentation and detection, can be addressed without the need for a precisely annotated dataset for training.

Results

The generation of precisely annotated large datasets to train DNNs in a fully supervised manner remains a challenge in the field of microscopy and biomedical imaging. MICRA-Net, a convolutional neural network (CNN)-based method, addresses this challenge by using solely whole-image binary targets for training. This approach outperforms established DL baselines trained in a weakly supervised manner for the semantic segmentation of diverse biological structures. It is therefore of great interest for the automated quantitative analysis of microscopy datasets for which no fully supervised training dataset is available. In the following we first investigate the impacts of the annotation burden, before characterizing the performance of MICRA-Net on synthetic and real data for various tasks. We then evaluate how MICRA-Net can be fine-tuned to leverage information from a previously acquired, but different, dataset. Finally, we show how the proposed approach could be used to support experts in the annotation of sparse and small structures in large images.

Annotation task reduction analysis

MICRA-Net is trained on a simple multi-class classification task and therefore only requires the expert to identify class-specific positive and negative images with respect to the structures of interest. In contrast to the identification of structure boundaries using precise or bounding box contours, image-level annotations do not require specification of the positions of objects in the images (Fig. 1a).

Fig. 1: Various supervision levels can be employed for training a DL model to segment structures of interest in microscopy images.
figure 1

a, Representative image from the CTC dataset7, overlayed with the corresponding fully and weakly supervised annotations. Annotated images are presented in decreasing spatial level of supervision and required annotation time (from left to right). b, The averaged inter-participant variability from the User-Study from six selected cell lines of the CTC using three levels of supervision (precise, bounding boxes (BBOX) and points). The inter-participant agreement was calculated using the F1 score as a function of IOU for precise (blue) and BBOX (orange) annotations in an all-versus-one manner27. The F1 score for points annotation (green) was calculated with a maximal distance of association of 30 pixels. Given that it is not possible to report the IOU between points annotations, we show the average F1 score as a constant line. Plotted are the bootstrapped mean (line) and 95% confidence interval (shading, 10,000 repetitions). ce, Distribution of median scores from the inter-participant comparison calculated in an all-versus-one manner: distance between associated point markers (c); Hausdorff distance between the precise contours of participants annotations in the User-Study62 (d); average annotation time per object on different cell lines for each supervision level calculated from the User-Study (e). f, Evaluation of the annotation time that would be required to generate the training set for the four microscopy datasets used throughout the Article for the MICRA-Net (MN) and fully supervised (FS) approaches. Reported is the required time in hours required for manual annotation. For MICRA-Net, an average of 2 s per annotated crop is required to generate the image-level binary label, regardless of the dataset. For fully supervised learning, the annotation time was evaluated for each dataset separately on a precisely annotated subset of images (Methods).

We quantified the time required to generate annotations with different levels of precision (precise contours, bounding boxes and points) by conducting a User-Study in which we asked microscopists to annotate testing images from the Cell Tracking Challenge (CTC) dataset7 (Methods). We chose this annotation task, as whole-cell detection and segmentation is a common task in microscopy, allowing us to recruit experienced participants in the User-Study. We analysed the inter-participant variability using a metric that combines both the level of association between objects (F1 score) and the quality of annotated objects27 (intersection over union (IOU); Fig. 1b and Supplementary Figs. 13). As a general tendency, simpler annotation tasks reduced the inter-participant variability (higher F1 score at a given IOU). We measured an error on the position of point annotations ranging from 4 to 10 pixels (median = 7 pixels; Fig. 1c), whereas for precise annotations the Hausdorff distance on the cell boundaries was between 5 and 24 pixels (median = 16 pixels) depending on the cell line (Fig. 1d). Several factors can reduce the precision of the annotations, such as the contrast (Fluo-N2DL-HeLa is high contrast whereas PhC-C2DL-PSC is low contrast) and the shape (Fluo-N2DH-GOWT1 is round whereas PhC-C2DH-U373 is irregular) (Fig. 1c,d and Supplementary Fig. 3).

The time required to annotate a single cell is increased by approximately twofold when going from points annotations to bounding boxes and from bounding boxes to precise annotations (Fig. 1e). Compared to the precise annotations required to train fully supervised DL approaches, the generation of whole image binary annotations reduces the required annotation duration by 6.7–157.6 h ([1.2, 19] fold change, median = 5.76, Supplementary Table 1) on the four datasets evaluated in this study (Fig. 1f).

MICRA-Net architecture and baselines

Figure 2a shows the architecture of MICRA-Net, which was designed around a CNN architecture composed of eight convolutional layers (L1 to L8), followed by a fully connected layer. For each dataset, the number of predicted classes was adjusted (Supplementary Notes 15). The gradient class activated maps (grad-CAM; Methods) were extracted for each predicted class and at every layer of the network (L1–8; Fig. 2a–e and Extended Data Fig. 1) by backpropagating each activated class through the network. Thereafter, rectified linear unit (ReLU) activation and thresholding on the grad-CAM of the last convolutional layer (L8) were applied to generate a coarse class-specific feature map28. This yielded activation maps that are a general overview of the structure24,25. To increase the information contained in the extracted feature map, we concatenated the local maps from layers L1–7, resulting in a class-specific seven-dimensional feature space (Fig. 2b,c). We retrieved the first principal component of every pixel using principal component analysis (PCA) decomposition on the feature space to generate a single feature map for each dataset (Fig. 2b,c and Methods).

Fig. 2: MICRA-Net architecture and experimental results on the modified MNIST dataset.
figure 2

a, MICRA-Net architecture (detailed in the Methods). Each depth is composed of two sequential convolutional layers (Conv2D), batch normalization (BatchNorm2D) and ReLU. A 2 × 2 maxpooling (MaxPool2D) was employed to increase the richness of the representation from the model. A linear layer is used to project the globally pooled L8 layer (256 filters, Global Maxpool2D) to the specified number of classes. b, Concatenation of low- and high-level feature maps obtained from the grad-CAMs of every layer is performed to generate the multi-dimensional feature space for every predicted class. c, Feature maps generated from the calculated grad-CAMs for classes 0 and 6 on the modified MNIST dataset. d, Coarse and detailed segmentation maps of the digits of a representative image (256 × 256 pixel) and insets (right, taken from the dashed white boxes) from the modified MNIST dataset using MICRA-Net. The colour code corresponds to the digit class and the red arrow indicates a missed digit in the field of view. e, The segmentation performances of the coarse (using only the L8 grad-CAM) and precise (using a PCA on the L1–7 grad-CAMs) methods are compared using F1 score, IOU and SBD. The bar graph shows the average and s.d. over the ten classes (see Supplementary Fig. 5 for class-wise and density-wise performances). A significant increase is shown when using the information from L1–7 grad-CAMs in comparison to L8 only, for all calculated metrics (Supplementary Table 2). f, Mean performance over the ten classes obtained with the U-Net trained with and without dilation of the ground-truth contours. The segmentation maps are presented in Supplementary Fig. 6a. The MICRA-Net segmentation performance (colour-coded dashed lines) surpasses the U-Net trained with ten-pixel dilation and is not statistically different from the U-Net trained with five-pixel dilation on all measured metrics. Only fully supervised training outperforms MICRA-Net segmentation on all measured metrics. P values were calculated using resampling (Methods) and are reported in Supplementary Table 2 with *P < 0.05, **P < 0.01 and ***P < 0.001. Bar graphs show the mean values and s.d.

To characterize the performance of MICRA-Net, we compared the results obtained on different datasets with three established baselines in microscopy: (1) U-Net8, (2) Mask R-CNN9 and (3) Ilastik29. These baselines were chosen as they are widely used in the literature and they allow semantic segmentation, detection or enumeration of biological structures in microscopy images with none or simple modifications (Supplementary Notes 2 and 3).

Multi-class segmentation of synthetic images

To validate the classification and segmentation performance of MICRA-Net, we created a synthetic dataset containing N randomly sampled cluttered handwritten digits from the MNIST dataset30 (Modified MNIST dataset; Fig. 2c and Methods). The first step was to classify the digits appearing on each image to validate the representation capability of the network, which was confirmed by the obtained class-wise mean classification testing accuracy of 98.9(5)% (mean ± s.d.).

The information embedded in the grad-CAMs was used to generate class-specific segmentation of the digits. As described above, we concatenated the grad-CAMs of layers L1–7 to generate detailed segmentation maps (Fig. 2 and Supplementary Fig. 5). The semantic segmentation performance of MICRA-Net was compared to a U-Net31 trained with fully and weakly supervised training schemes. Fully supervised learning consisted of training with the binary digits contours from MNIST, whereas weak contours were generated by a dilation of the digits with a square of size {5, 10, 25} pixels as a structuring element (Supplementary Note 1). Figure 2f shows that MICRA-Net achieves similar or superior segmentation performance compared to all weakly supervised training instances of the U-Net and is only outperformed on all measured metrics (F1 score, IOU and symmetric boundary dice (SBD)) by fully supervised training (Supplementary Fig. 6 and Supplementary Table 2).

Semantic segmentation of super-resolution microscopy images

The next question that we addressed was the applicability of our approach for super-resolution microscopy image segmentation, for which precisely annotated datasets are rarely available. The specific task was the semantic segmentation of a publicly available weakly annotated dataset of stimulated emission depletion (STED) microscopy images of two F-actin nanostructures in neurons: (1) a periodical lattice structure and (2) longitudinal fibres (Fig. 3a,c)2. Image-level annotation reduced—by more than threefold—the annotation process compared to polygonal bounding boxes annotations, which were recently used for weakly supervised training of the U-Net architecture2.

Fig. 3: Semantic segmentation of F-actin nanostructures observed on super-resolution microscopy images.
figure 3

a,c, Representative raw images from a dataset of STED microscopy images of two F-actin nanostructures in fixed cultured hippocampal neurons: periodical lattice (a) and longitudinal fibres (c). Arrows point towards the periodical lattice (green) and longitudinal fibres (magenta). Segmentation masks obtained from an expert, MICRA-Net, weakly supervised U-Net, weakly supervised Mask R-CNN and weakly supervised Ilastik are also reported for both structures as comparison. b, Performance evaluation of MICRA-Net and weakly supervised baselines segmentation on the precisely annotated testing dataset using custom metrics for the periodical lattice. The FFT metrics compare the frequency contents of the provided masks. The segmentation resulting from MICRA-Net is not significantly different from the expert annotations, whereas the other baselines are (U-Net, Mask R-CNN and Ilastik). d, Average difference between the number of low-intensity pixels found within the precise expert annotations and the DL-based segmentation approaches for the F-actin fibre nanostructures (Methods). The raw number of low-intensity pixels segmented by MICRA-Net is not significantly different for any low value of intensity pixel from the expert. This is not the case for all baselines (U-Net, Mask R-CNN and Ilastik), as these annotated a significantly different number of low-intensity pixels corresponding to the space between single fibres that was not annotated by the expert. P values were calculated using resampling (Methods) and are reported in Supplementary Tables 5 and 6 with *P < 0.05. Performance evaluation was performed within the dendritic mask (a,c; yellow lines). Scale bars (a,c), 1 μm.

On the classification task, MICRA-Net achieved accuracies of 75.2% and 83.7% on the testing dataset for the F-actin periodical lattice and longitudinal fibres, respectively. This is in line with a mean inter-participant classification accuracy of 80(5)% and 75(7)% for periodical lattice and longitudinal fibres, respectively (calculated for six participants using a leave-one-out scheme from 50 images; Supplementary Fig. 7). Detailed masks were obtained from MICRA-Net to solve the semantic segmentation task. We relied on a precisely annotated dataset consisting of 25 images of each structure (Extended Data Fig. 2) to evaluate the performance of all trained models: (1) MICRA-Net, (2) multi-participants polygonal bounding box annotations (six participants on 25 images of each structure: User-Study), (3) U-Net and (4) Mask R-CNN trained with polygonal bounding boxes2 and (5) Ilastik trained using scribbles (Supplementary Note 2). MICRA-Net achieved equivalent or superior segmentation performance on the precisely annotated dataset in comparison to both the User-Study and all baselines when comparing the common segmentation metrics (Extended Data Fig. 2, Supplementary Figs. 8 and 9 and Supplementary Tables 3 and 4). Thus, even if trained with weak image-level annotations, MICRA-Net can extract the necessary structural information to generate segmentation maps for both nanostructures.

A qualitative visual inspection of the segmentation masks suggested that MICRA-Net segmentation produced a finer detailed mask of the boundaries of the F-actin nanostructures compared to the weakly supervised baselines segmentation2 (for fibres especially, for which it provided detailed segmented contours of single fibre strains). Custom performance metrics that were adapted to the F-actin nanostructures were required to better characterize this observation. For the F-actin periodical lattice, we measured the fast Fourier Transform (FFT) of the segmented areas for frequencies corresponding to the periodicity of the lattice (180–190 nm, ref. 32; Fig. 3b and Methods). The FFT metric calculated on the areas segmented with MICRA-Net is not significantly different from the one obtained from the precisely annotated dataset (Fig. 3b), whereas all other baselines were significantly increased (Supplementary Tables 4 and 5). Similarly, a custom metric based on the pixel intensity distribution of the segmented areas was developed to evaluate the approaches on the fibre segmentation task (Methods). We inspected the reduction of low-intensity pixels between single fibre strains that is observed in the predicted mask compared to the precisely annotated dataset. Only the segmentation masks obtained with MICRA-Net do not show a significantly different pixel intensity distribution to the one obtained from the precisely annotated dataset (Fig. 3d and Supplementary Table 6). This supports the ability of MICRA-Net to precisely identify the contours of individual fibres and the boundaries of the periodical lattice regions in STED images using solely image-level annotation for training.

Single-cell semantic segmentation

Cell counting and segmentation is a common challenge in high-throughput analysis of optical microscopy images7,8,11,33,34,35. Both fully and weakly supervised DL approaches were shown to be very powerful to assess these tasks on multiple cell lines6,23. To characterize the performance of MICRA-Net for semantic instance segmentation, we decided to use the images from six cell lines in the CTC dataset. Although this dataset is generally used to compare DL approaches on the instance segmentation task, we tested the ability of MICRA-Net to perform semantic instance segmentation on those images. We chose the CTC as it is widely used across the microscopy community, it is associated with a precisely labelled multi-class training set and it could be integrated in a User-Study. We acknowledge that the more recent dataset CoNSeP (2019) could also have been considered35. We trained MICRA-Net on resampled images of the CTC and obtained a classification accuracy of 95.8(4)%. Despite having a high classification accuracy, the semantic instance segmentation performance of MICRA-Net was strongly reduced when no negative samples were provided (Supplementary Fig. 10; DIC-C2DH-HeLa and Fluo-N2DH-GOWT1). The cell density on the images of DIC-C2DH-HeLa did not allow us to find a scaling factor that would generate negative and positive crops with entire cells. This cell line was therefore removed from the training set (Supplementary Fig. 11). Hence, we report the performance of all trained models on five selected cell lines from the CTC for the detection and semantic instance segmentation tasks.

We compared the semantic instance segmentation of MICRA-Net with fully and weakly supervised baselines: U-Net8, Mask R-CNN9 and Ilastik29 (Supplementary Note 3 and Supplementary Tables 7 and 8). The weak labels were generated by dilating/eroding each label of the fully supervised dataset by a value sampled from a normal distribution with 0 mean and s.d. in {2, 5, 10} (Altered-\({{{\mathcal{X}}}}\) or ALT-\({{{\mathcal{X}}}}\)), or by taking the bounding boxes of each object (Methods and Fig. 4a). We compared the variability in the annotation obtained from participants of our User-Study to that of the altered versions of the precise labels that were used to train fully supervised approaches (Fig. 4a,b). Figure 4b shows the distribution of IOU between associated objects (Object-IOU) of the User-Study and the altered versions of the precise labels (eight repetitions) compared to the original precisely annotated dataset. From Fig. 4b we can conclude that the distribution of the User-Study is similar to the distribution of ALT-5. Hence, training DL architectures with a training set obtained from multiple participants (for example, crowd-sourced) should result in similar baseline performance as when trained with ALT-5.

Fig. 4: Semantic instance segmentation on five selected cell lines of the CTC dataset.
figure 4

a, Representative examples of the various levels of supervision used to train the selected baselines. b, Quantification of the IOU between associated objects for the User-Study and altered versions (ALT-\({{{\mathcal{X}}}}\)) of the testing set with the ground-truth objects for each cell line of the CTC. The precision of the participants is similar to an ALT-5 version of the testing set. c, Representative examples of MICRA-Net semantic instance segmentation. Each outline colour depicts a different segmented object. See Supplementary Figs. 1214 for baseline examples. d, Left: comparison of the difference of the pooled area under the curve (AUC; F1 score versus IOU) of all cell lines for MICRA-Net over the baselines on the precisely annotated dataset. The raw curves and non-pooled data are provided in Supplementary Figs. 1520. Higher and lower performances of MICRA-Net are reported in blue and red, respectively. MICRA-Net is only outperformed by U-Net trained using ALT-2 or fully supervised (FS) training. Right: comparison of the pooled AUC for all cell lines for the conducted User-Study using precise annotations and bounding boxes (BBOX). The precision of the segmentation masks generated with MICRA-Net is similar to the precise annotations and better than the bounding boxes obtained in the User-Study. Asterisks are used to highlight a significant change (Supplementary Tables 10 and 11). Scale bars, 25 μm.

For the semantic instance segmentation task, MICRA-Net was trained to predict both the presence of a cell and the contact between cells, which was subtracted from the former (Methods and Extended Data Fig. 3). We quantify the result by extracting the normalized area under the F1 score detection as a function of the IOU for the associated object curve27 (AUC; Methods). Figure 4d (left) reports the variation of MICRA-Net in AUC from the baselines trained with various levels of supervision when pooling data from all cell lines (Methods, Supplementary Figs. 1220 and Supplementary Tables 916). As shown in Fig. 4d and Supplementary Fig. 20, the performance of the baselines that were developed for fully supervised datasets is affected when reducing the supervision level (Supplementary Figs. 1214). MICRA-Net achieves similar performance to the baselines trained with weak labels (ALT-2 and ALT-5 datasets). On the ALT-10 dataset, MICRA-Net achieves superior performances to all considered baselines, which is also reflected when using the standard metric (DET and SEG) from the CTC (Supplementary Tables 17 and 18). Notably, we observed that training MICRA-Net only on phase contrast images (PhC) in an instance segmentation task significantly improved the performances (Supplementary Table 19). We also measured a similar or significantly increased testing AUC when using MICRA-Net compared to the precise contour annotations and bounding boxes that were obtained from the User-Study (Fig. 4d, right and Supplementary Figs. 2 and 3). Therefore, when no precisely annotated and proofed dataset is available, or when the manual annotation error may be high due to the difficulty of the task, the performance of baseline architectures cannot be guaranteed to achieve superior semantic instance segmentation performance (Supplementary Fig. 20 and Supplementary Tables 12 and 13). Although the CTC is a precisely annotated and curated dataset that can be easily used for fully supervised approaches, using image-level binary labels for training would reduce by more than ~157 h the necessary annotation time needed to complete the annotation task if the training set of the CTC needed to be generated from scratch (Figs. 1f and 4b).

Multi-device analysis

Although DL approaches can be very powerful when tackling tasks on very similar images, challenges are often encountered when the imaging conditions change over time (for example, due to a new device)36,37. We investigated how MICRA-Net could be fine-tuned on a dataset that contains similar structures but acquired on a different device. We chose a brightfield microscopy dataset of Giemsa-stained Plasmodium vivax (malaria) infected human blood smears (Fig. 5a), for which the training and testing datasets had very distinct intensity distributions (Fig. 5a,b)33,38. In the context of parasite detection and stage determination for malaria, the most important task, which we have addressed in this publication, consists in the detection of infected smears, but some related work also perform phenotype classification33,38,39,40.

Fig. 5: Detection of Giemsa-stained red blood cells from two different datasets of brightfield microscopy images from ref. 38.
figure 5

a, Representative images from the training (two left) and testing (two right) datasets. The training dataset is composed of images taken from two different laboratories, and the testing images were acquired in a third laboratory. b, A change in the brightness and contrast is observed between the training and testing datasets. This results in a large difference in the mean pixel intensities (training, blue line; testing, orange line; with s.d. represented by the pale region) of the training and testing images. c, Left: a precision–recall graph quantifying the detection performance of MICRA-Net on the testing dataset. Without fine-tuning, the performance on the testing dataset (Naïve, grey ellipse) is characterized by a recall of 0.79 and a poor precision of 0.32. A variable number of images ({12, 24, 36}) from the testing dataset were used to adjust the detection threshold (Threshold, blue ellipse), which increased the precision but also reduced the recall by approximately twofold. Fine-tuning of the model on the sampled {12, 24, 36} images from the testing set with different settings: (1) allowing the linear layer (orange) and (2) different depths (depth 4, green; depths 3, 4, red; all, purple) to be updated (Extended Data Fig. 4 and Supplementary Note 4) resulted in a precision–recall above human agreement. Right: magnified region of the precision–recall performance of MICRA-Net. When the number of trainable parameters increases, the number of images required for a model with good generalization properties also increases. d, Detection efficiency (F1 score) of the various trained fine-tuned models. As a general tendency, increasing the number of images sampled from the testing set and allowing more layers to be updated resulted in better detection of infected red blood cells. The best detection accuracy of all trained models is highlighted in bold. See Supplementary Table 21 for the calculated P values.

We predicted the presence of infected smears in a 256 × 256-pixel image. A mean classification accuracy of 80(10)% (mean ± s.d., five different instances) was obtained. A classification accuracy over 87% was obtained when updating the threshold of the linear layer and over 88% for fine-tuned models trained on {12, 24, 36} sampled images from the original test set acquired on another device (Supplementary Note 4, Extended Data Fig. 3 and Supplementary Table 20).

With fine-tuning of at least the linear layer and the depth 4 of MICRA-Net, the F1 score was significantly increased, beating the inter-expert agreement (0.64 averaged from infected cells reported in ref. 38). Additionally, increasing the number of images sampled from the testing set can significantly increase the detection accuracy (0.82(1)), as obtained by updating either Linear + 3, 4 or All layers (Supplementary Table 21.

We compared the segmentation results of MICRA-Net with expert precise annotations. Owing to the lack of a precisely annotated dataset in the original publication33, we asked an expert to manually segment all infected smears from the test set (303 smears), which were then proofed by a second expert in microscopy annotation processes. In contrast to the results obtained for the detection accuracy, updating more layers while fine-tuning (Linear + 3, 4 {12, 24, 36} and All {12, 24}) significantly reduced the IOU compared to only updating the linear layer (Supplementary Fig. 21 and Supplementary Table 22). Hence, a trade-off should be made by the users according to their specific needs. For instance, with these P. vivax datasets, the best trade-off to maximize both detection and segmentation efficiency requires the fine-tuning of at least the linear layer and depth 4.

Expert detection and segmentation assistance

The next step was to assess how MICRA-Net could be implemented as a tool to guide experts in the annotation of sparse and small structures in large images of an electron microscopy (EM) dataset. Our approach was tested on a dataset of scanning electron microscopy (SEM) images of ultrathin mouse brain sections in which axons were genetically labelled with a small engineered peroxidase APEX241 (axon DAB (3,3′-diaminobenzidine); Methods). In the SEM dataset, 1–10 small axonal regions (average size of 113 × 113 pixels) needed to be identified in images of around 10,000 × 10,000 pixels (Fig. 6a). Applied to this dataset, MICRA-Net was used to suggest regions containing the axon DAB marker.

Fig. 6: MICRA-Net is used as a tool to assist experts in the detection of sparse axon DAB markers in large SEM images of ultrathin mouse brain sections.
figure 6

a, Schematic representation of the proposed approach. MICRA-Net is first swept over the entire field of view with a 75% overlap in both directions to output the probability of the presence of an axonal DAB marker. The probabilities of overlapping crops are then averaged to generate a probabilistic map of positions. The plausible positive regions are then viewed by the expert, who can accept or decline it. For each accepted region, the segmentation maps of the axon DAB generated by MICRA-Net can be inspected by the expert. b, The total percentage of annotated area is colour-coded as a function of the PU ratio at the inter-expert for different recall. Using MICRA-Net trained with a PU ratio of 1:5 as an assisting tool results in the validation of ~3% of an image, which would require an expert less than 15 min to validate the complete testing set (44 images) and result in a recall of 0.9. The annotated area as a function of the recall for each PU ratio is shown in Supplementary Fig. 22. c, Total number of detections from the testing dataset with and without assistance from MICRA-Net. Using MICRA-Net, the expert could identify 57 new axon DAB positive regions, corresponding to an increase of 25% in the total number of detections. Scale bars: 5 μm, full field of view; 1 μm, extracted crops.

An expert identified axon DAB positive regions on the training (158 images) and testing (44 images) sets (Methods). Thereafter, all positive and negative crops (1,024 × 1,024 pixel, 5.12 × 5.12 μm2) were extracted from the training images without overlap to train MICRA-Net (Fig. 6a and Supplementary Note 5).

In the context of very sparse detections, positive-unlabelled (PU) learning can improve the performance of a given architecture42. On the classification task, an accuracy between 83% and 90% was obtained for all PU ratios between 2:1 and 1:16 (Fig. 6b and Supplementary Tables 23 and 24). With MICRA-Net, this sparse detection task resulted in an increase of the measured recall above the inter-expert agreement (0.791, Supplementary Fig. 22), while requiring an expert to proof only 3.13% of a newly acquired image. The detection recall obtained by MICRA-Net (>0.90) is similar to other methods applied in EM to detect small structures such as synapses (0.8843 and 0.8544) or extracellular vesicles (0.75, average over three datasets45). Using MICRA-Net, the area that would be required by an expert to inspect new images would be reduced by 30-fold. Additionally, MICRA-Net allowed the expert to detect 57 new axon DAB regions in the test set (representing 25% more detections) that had been missed during the initial image annotation process (Fig. 6c). In comparison, Ilastik trained with circle annotation of the axon DAB structures achieved a classification accuracy of 8%, requiring expert inspection of nearly all newly acquired images (Supplementary Fig. 23). MICRA-Net could also be used to generate coarse segmentation masks of the detected axon DAB regions (Supplementary Fig. 24a and Supplementary Table 25). Application of MICRA-Net to this EM annotation task was thus successful in reducing the burden of generating the training dataset, while also notably increasing the discovery of regions that were missed by the manual expert annotation.

Discussion

Although pixel-wise metrics and ground-truth annotations are well established in the field of DL and computer vision with natural images, retrieval of ground-truth annotations in biomedical imaging is a laborious process, requires highly trained experts, and annotation imprecision often occurs3,46 (Fig. 1). This highlights the need for weakly supervised DL approaches that do not rely on spatially precise annotations of the structure, but rather on annotations that are easier and faster to retrieve. MICRA-Net, a CNN-based method, relies on the information embedded in the latent space of a classification task to solve complementary tasks without the need to generate task-specific precisely annotated training sets. This is done by combining high- and low-level features of the hidden layers with grad-CAMs to generate detection or segmentation maps. We designed multiple experiments to challenge MICRA-Net at solving common tasks (semantic segmentation, detection or localization) on different modalities (STED, fluorescence, phase contrast, brightfield and EM) relevant to high-throughput microscopy image analysis3,8. Unlike multi-task learning47, MICRA-Net does not combine auxiliary tasks to increase the learning performance of a main task, nor does it require more annotations from the dataset for each task48,49. Hence, MICRA-Net should substantially reduce the burden of task-specific annotation of bioimaging datasets for which precise annotations are challenging to retrieve, thereby increasing the accessibility of such DL-based microscopy image analysis.

Although fully supervised DL approaches have the drawback of being costly to train, they can benefit from pre-training8,50,51, given that the image space is similar52, and have access to precise information about the structure boundaries. On the other hand, MICRA-Net extracts spatial features from the hidden layers of the network to generate detailed feature maps using solely easy-to-retrieve, binary image-level annotations for training. Considering the observed reduction of the inter-expert variability when diminishing the complexity of the annotations (Fig. 1), this will be an important aspect for future DL applications leveraging crowd-sourced annotations for training3,53.

Additionally, we have demonstrated that MICRA-Net could be fine-tuned, for example, when images are acquired on two different microscopes. This is of particular interest for large-scale studies, conducted on multiple sites, that require the analysis framework to be easily adaptable to new experimental conditions24,33,54. Future work on fine-tuning of such approaches to new structures of interest and analysis tasks will be an important step to increase their accessibility to a larger community of researchers.

Finally, MICRA-Net was used to assist an expert in the complex annotation task of detecting small sparse objects in EM images. Originally, this task was prone to identification errors and fatigue, limiting the performance of the experts and increasing inter-expert variability. When using the detections from MICRA-Net, experts could focus their attention on less than 5% of the image. Not only was the annotation time decreased by 30-fold, but with MICRA-Net the number of detected objects was also increased by 25%.

MICRA-Net provides similar performance on multiple tasks to the established weakly and fully supervised learning approaches, thus making it an unprecedented alternative to address bioimaging analysis challenges for which large and precisely annotated datasets are not available.

Methods

MICRA-Net

Architecture

Figure 2a presents a schematic representation of the MICRA-Net architecture. MICRA-Net is based on the encoder part of a U-Net31. The rationale is that U-Net is an established method to solve different analysis tasks (for example, segmentation, localization and detection) on biomedical datasets. Each depth of the network contains two blocks of convolutions (kernel size of 3) followed by batch normalization, and ReLU activation. The number of filters in the convolutional layers is doubled after maxpooling (stride and kernel size of 2) to increase the richness of the representation. The number of filters for each layer is {32, 64, 128, 256}. Global maxpooling on the output layer allows a reduction of the dimensionality, and a fully connected layer (FCL) is used to provide a classification prediction. Dropout (probability of 0.5) is applied on the input features of the FCL.

At inference, MICRA-Net predicts a whole-image target from a given sample. Then, from each activated class c, a local map Ll is calculated from the weighted combination of the activation map Al,k and the mean gradient \({\alpha }_{l,k}^{c}\) of each l layer28. The mean gradient \({\alpha }_{l,k}^{c}\) is calculated from the backpropagated class activation yc as

$${{\alpha}^{c}_{l, k}} = {\frac{1}{Z}}{ \sum\limits_i}{\sum\limits_j} \underbrace{\frac{\partial {y^c}}{\partial {{A}^{l, k}_{i, j}}}}_{{\text{gradients via backprop}}}.$$
(1)

The local map Ll is calculated as the linear combination of the activation map and the mean gradient of each layer of convolutions in the network

$${{L}^{l}}={\mathop{\sum}\limits_{k}}{{\alpha }_{l,k}^{c}}{{A}^{l,k}}.$$
(2)

Because MICRA-Net produced spatially reduced feature maps, local maps are upsampled using nearest-neighbour interpolation to match the input image size of 256 × 256 pixels. These images are then normalized in the range [0, 1] using a min–max scaling. ReLU activation is applied on the last layer (L8) of the network, as in the seminal implementation of grad-CAM28, to be used for coarse segmentation. Local maps from layers L1–7 (Fig. 2a–c) are concatenated into a feature space and retrieve the first principal component of every pixel using PCA55 decomposition to retain prominent information from the feature space. The network was built and trained with the PyTorch library56.

To facilitate the analysis of new images using MICRA-Net, a graphical user interface (GUI) is provided to qualitatively analyse the influence of each local map (Extended Data Fig. 1). Although the implementation of MICRA-Net uses layers L1–7 with a PCA decomposition of the resultant feature space, the GUI allows us to arbitrarily combine different local maps of the MICRA-Net architecture and threshold the resultant detailed feature map.

Training procedure

The general training procedure for the MICRA-Net architecture is reported in this section. MICRA-Net was trained using the Adam optimizer with a learning rate specific to each dataset and other default parameters57. A learning rate scheduler was used to reduce the learning rate of the optimizer with a minimal possible learning rate of 1 × 10−5. The number of training epochs was adapted to the specific dataset (Supplementary Tables 2630). Early stopping was used to reduce overfitting. Unless otherwise specified, we used binary cross-entropy with logits loss. We kept the model with the best generalization properties on the validation set (calculated from the objective loss function). The training of MICRA-Net was repeated five times with different random seeds.

Data augmentation was used to increase the performance of the network. Supplementary Tables 2630 provide a detailed data augmentation procedure for each dataset. All operations were applied in a random order with a probability of 50%.

Specific tasks

This section presents how MICRA-Net can be used to solve common tasks in microscopy images analysis.

Segmentation (and semantic segmentation) is solved by first extracting a detailed semantic feature map. The segmentation masks are obtained by thresholding the resultant feature map using a common thresholding algorithm (for example, Otsu or percentile thresholding). The dataset-specific thresholding is detailed in Supplementary Notes 15.

The detection task on the P. vivax and EM microscopy dataset is solved by predicting the probability of the presence of an object on all extracted crops. The overlap between the crops is 75% in both directions. Overlapping crops are averaged and reassigned to an output feature map of the same shape as the image. The detection threshold is inferred from the validation set using a precision–recall curve.

The semantic instance segmentation task is required on the CTC dataset. MICRA-Net is required to predict (1) the presence of an object and (2) the contact between objects. The grad-CAMs of the activated objects are extracted from the architecture and combined using a PCA. If a contact is predicted on an image, the grad-CAM from L8, which contains the prominent information of the contact, is extracted. The contact feature map is subtracted from the object feature map as in some fully supervised techniques27. An Otsu threshold is used to generate the semantic segmentation masks of the instances.

Datasets

The Modified MNIST dataset

We generated the modified MNIST training dataset by randomly sampling N digits from the original MNIST training dataset and randomly distributed them on a 256 × 256-pixel field of view. To avoid overlap between digits we used a random Poisson disc sampling algorithm with a radius size of 25 pixels58. The number of digits N was uniformly sampled from {1, 2, 3, 4, 5, 10, 15, 20, Max}, where Max corresponds to the maximum number of digits that can be placed without overlap. A rotation of ±30°, uniformly sampled, was applied to the digits before placement on the image. We applied, in a random order, a Gaussian blur with sigma uniformly sampled in [0, 2] and artificial normalized Poisson noise with \({\lambda }={\frac{\sqrt{255}}{2}}\) to mimic slight variations akin to those that may be observed in microscopy images. The resulting image intensities were clipped to lie in [0, 1]. Using this technique, we generated 2,000 and 1,000 images for training and validation, respectively.

The modified MNIST testing dataset consists of 1,000 images of handwritten digits sampled from the original MNIST testing dataset. As for the training dataset, we also applied, in random order, Gaussian blur and artificial normalized Poisson noise, sampled as before.

The F-actin dataset

The F-actin dataset was generated by using a sliding window (256 × 256 pixels) with a stride of 192 pixels over 260 complete images with an approximate size of 1,000 × 1,000 pixels. Because the super-resolution microscopy images used are mostly composed of background, we set out to keep the crops containing at least 10% of dendritic area, thereby reducing the number of crops to identify. The dendritic mask was obtained from the foreground detection on the confocal imaged of the dendritic marker MAP2 using a global Otsu thresholding on the normalized Gaussian blurred image2,59. The sigma parameter of the Gaussian blur was set to 20 pixels as it provided suitable dendrite detection over a wide range of images. We next annotated each generated crop as being positive to the presence of the F-actin periodical lattice or longitudinal fibres. The resulting training dataset contained 3,832 crops (256 × 256 pixels, 897 images positive to the periodical lattice and 1,456 positive to the longitudinal fibres), the validation dataset contained 1,287 crops (405 positive to the periodical lattice and 377 positive to fibres) and the testing dataset contained 416 crops (83 positive to the periodical lattice and 132 positive to fibres). The images \((x)\) were rescaled \((x^\prime)\) \(x^\prime\) to lie in the [0, 1] interval. The maximum value for scaling (max) was obtained by sampling the maximal value of all training images from which we calculated the median in addition to 3 s.d. The minimum value was calculated as the median of minimas (min). To ensure a proper scaling of the images we also added a scaling factor of 0.8:

$${x^{\prime}} ={\frac{{x}-{\min }}{0.8({\max }-{\min })}}.$$
(3)

To evaluate the segmentation performance of the trained models, an expert precisely highlighted the contours of the structures in 50 images (25 images positive to the periodical lattice and 25 images positive to fibres) randomly sampled from the testing set. The contours of the structures were proofed by a second expert user. This small segmentation dataset only served to compare the segmentation performances from MICRA-Net, weakly supervised baselines (U-Net, Mask R-CNN, Ilastik) and User-Study.

The CTC dataset

We selected six cell-line datasets from the CTC7: the DIC-C2DH-HeLa dataset, which was acquired using differential interferometry contrast microscopy, three non-synthetic fluorescence microscopy datasets (Fluo-C2DL-MSC, Fluo-N2DH-GOWT1 and Fluo-N2DL-HeLa) and two phase contrast microscopy datasets (PhC-C2DH-U373 and PhC-C2DL-PSC). All original images were rescaled in the [0, 1] range using a per image min–max scale. We then resized each image and associated precise annotations according to the specific needs using bi-linear interpolation and nearest neighbours, respectively, with the Scikit-Image60 Python library (Supplementary Table 7 provides the scaling factors). We used a sliding window of size 128 × 128 pixels or 256 × 256 pixels with a 25% overlap between crops in both directions. Using this sliding window technique yielded a total of 27,106 positive crops and 3,364 negative crops for the 256 × 256-pixel crops resized to have an effective pixel size of 0.5 μm. The sliding window with size 128 × 128-pixel crops and resized to have single cells in the field of view yielded a total of 66,466 positive crops (20,724 positive to contact) and 88,722 negative crops for training and 17,621 positive crops (5,606 positive to contact) and 22,279 negative crops for validation. We simulated weak annotations from the precise contours of the cells provided in the original CTC dataset by identifying an image crop as positive if the corresponding annotated crop contained at least the size of the average annotated cell, and negative otherwise. To evaluate the segmentation and detection tasks, we manually segmented four images randomly sampled per cell line in the testing set. The images were annotated by one person and verified by a second person. Both were experts in optical microscopy and very well trained in image annotation processes.

The P. vivax dataset

We used image set BBBC041v1, available from the Broad Bioimage Benchmark Collection33. The complete dataset contained 1,327 three-channel images and was already split into training (1,207 images) and testing (120 images) sets. The dataset is composed of blood smears that were stained with Giemsa reagent and acquired on three different brightfield microscopes from three different laboratories. All blood smears (infected or uninfected) were annotated using bounding boxes. The blood smears were later classified as infected (gametocytes, rings, trophozoites and schizonts) or uninfected (red blood cells and leukocytes) by an expert. The task was to differentiate infected from uninfected blood smears. The dataset is highly unbalanced towards red blood cells, which compose over 95% of the annotated cells.

For training and testing we applied a whitening normalization (null mean and s.d. of 1) to each image (and channel) to minimize the impact of a very different intensity distribution. The binary targets for training were generated using the provided bounding boxes. A crop was considered positive if it contained at least 5% of overlap with an infected cell (otherwise negative). The crops were 256 × 256 pixels.

We manually extracted and precisely annotated all infected cells in the testing set, resulting in 303 small crops of size 256 × 256 pixels centred on the cell of interest. Each cell was annotated by a person and verified by a second person. Both were experts in optical microscopy and very well trained in image annotation processes.

The SEM dataset

The SEM dataset contained 92 images of 10,240 × 10,240 pixels for training, 66 for validation and 44 for testing. An expert annotated the images using positional markers to locate the axon DAB markers. On average, the large fields of view contained three small detections (113 × 113 pixels, between one and ten detections per image). This resulted in an annotation time of ~30 min per field of view. Training and inference were performed on 512 × 512-pixel crops. The dataset contained all positive crops (1,024 × 1,024 pixels, centred on the axon DAB markers) and all negative crops (without overlap). To manually annotate the images, the expert inverted the acquired images. Hence, we provided MICRA-Net with the inverted image to mimic the expert task. We rescaled the provided 8-bit depth images in the [0, 1] range by dividing by a scalar value of 255.

All axon DAB markers were extracted from the testing set (170 positive markers) and an expert carefully identified their contours.

Evaluation procedure

Classification

The classification accuracy of MICRA-Net was evaluated by inferring the testing images. To quantitatively assess the performances, the classification accuracy was calculated for each trained model. We reported the mean ± s.d. of the trained models.

Detection

The centroid of each detected object was obtained from MICRA-Net by using the dataset-specific procedures detailed in Supplementary Notes 15. Each detected centroid was associated with the centroid of objects in the ground-truth mask using the Hungarian algorithm61 with a maximal distance of N pixels, where N is approximately the object radius. In this context, an associated detected object is considered as a true positive, a non-associated detected object is a false positive and a missed ground-truth object is a false negative. To evaluate the detection capability of MICRA-Net, we reported the F1 score. For a quantitative comparison, we repeated the evaluation for each trained model. We then bootstrapped the average of the trained models to show the bootstrapped mean and 95% confidence interval (CI) (10,000 repetitions).

Segmentation

The segmentation performance of the trained models was evaluated using three common evaluation metrics: F1 score, IOU and SBD62. If multiple instances of a model were trained on the same task, we bootstrapped the average of the trained models to show the bootstrapped mean and 95% confidence interval (10,000 repetitions).

Instance segmentation

Before evaluation we removed small objects (<20 × 20 pixels) from the segmentation mask and filled holes for all trained models. All segmentation masks were resized to the baseline scale (Supplementary Table 7) for proper comparison. The instance segmentation performance were evaluated using the method proposed by ref. 27 (Supplementary Figs. 1518). Briefly, this method evaluates the detection and failures of the architecture dependent on the IOU. Reference 27 used a minimal IOU of 0.5 to avoid multiple predicted objects being associated with a ground-truth object. The goal is to maximize the F1 score versus IOU, while the failure modes should be minimized. We, on the other hand, solved the association between the ground-truth and predicted objects using the Hungarian algorithm61, allowing us to report the performance and failure modes across the entire range of IOU. Using a broader range of IOU allows reporting the performance in instance detection and segmentation. The normalized area under the resultant curves for each trained model is bootstrapped to obtain the mean and 95% confidence interval (10,000 repetitions) and is reported in Fig. 4.

Custom performance metrics

The F-actin periodical lattice is detected as an oscillating pattern between high- and low-intensity stripes with 180–190-nm periodicity32. We designed a metric that would take this periodicity into account to evaluate the MICRA-Net detailed segmentation performance. We computed, as a baseline, the Fourier transform (FT) of the original image (FTb) and the FT of the segmented regions: for the expert (FTe), and for the predicted segmentation masks (FTpred). The variation from the baseline was computed as the difference in the FT spectrum, for spatial frequencies in the range [170, 200 nm], between FTe,pred and FTb over the sum of FTb. A smaller absolute difference between the variation of the expert and the variation of the predicted mask implies more similar segmentation.

Given that F-actin fibres are contiguous and have a high intensity on the dendrites, we designed a metric that would use the distribution of pixels under a segmented mask. The rational behind this metric is that the F-actin nanostructures on dendrites are composed of both high- and low-intensity pixels. Because F-actin fibres have high intensities, a detailed segmentation of fibres would imply few low-intensity pixels annotated, while a coarse segmentation would introduce more low-intensity identified pixels. Hence, we considered a pixel within the segmentation mask as part of a fibre if its value was superior to a given threshold. We calculated this threshold by first measuring the 25th percentile of pixel intensities outside of the expert mask for all images. We then extracted the 90th percentile intensity values from all images containing F-actin fibres. This resulted in a threshold between high- and low-intensity pixels within the dendritic mask of 9.

User-Study

We conducted two different examples of User-Study in this Article, one for the F-actin nanostructure segmentation and one for the instance segmentation on the CTC. All participants were trained microscopists who are familiar with biomedical images.

F-actin segmentation

We performed a User-Study in which six participants highlighted the contours of the F-actin periodical lattice and longitudinal fibres on a small dataset of 50 images using polygonal bounding boxes. We used polygonal bounding boxes as this annotation method reduces the time required by a participant by more than threefold compared to precisely identifying the boundaries of the structures (Supplementary Fig. 9). We used our own annotation application that was optimized for this type of task. Annotation of the full dataset required ~40 min for the participants. The averaged performance of the six participants was compared to MICRA-Net using F1 score, IOU and SBD.

CTC instance segmentation

A User-Study was conducted using the CTC to analyse the required time per cells and the achievable performance of inter-participant annotation for such a task. The User-Study consisted of annotation of the 24 testing images using different levels of supervision (precise, bounding boxes and points). For each level of supervision, the participants were asked to annotate a quarter of the testing image, which was the same for all participants. The image intensity scale was set at a constant value for all participants. The participants used Fiji software to annotate the images. The medians of the participant scores on the testing set are reported, as well as the inter-participant scores. The time required by the participant to annotate each image was recorded, which allowed us to calculate the time per cell for each cell line. Representative examples from the participant annotations are provided in Supplementary Figs. 13 and the specific curves per cell line in Supplementary Fig. 4. The images annotated for the User-Study were not used in the testing phase of MICRA-Net and other baselines.

In-house datasets acquisition

Sample and STED imaging of F-actin

Before dissection of hippocampi, neonatal Sprague–Dawley rats were euthanized by decapitation, in accordance with procedures approved by the Animal Care Committee of Université Laval. Dissociated cells were plated on poly-d-lysine-coated glass coverslips, fixed and immunostained as described previously2. F-actin was stained with Phalloidin-STAR635 (Abberior) and dendrites with microtubule-associated-protein (MAP2)2. STED images of the F-actin nanostructures were acquired on a four-colour Abberior Expert-Line STED microscope (Abberior Instruments), equipped with a ×100 1.4 NA oil objective and using pulsed (40 MHz) excitation (640 nm) and depletion (775 nm) lasers. Fluorescence was detected with an avalanche photodiode (APD) and an ET685/70 (Chroma) fluorescence filter. Pixel size was set to 20 nm.

Animals and stereotaxic injections for the SEM dataset

This study was carried out on three-month-old mice, weighing 25–35 g. Animals were housed under a 12-h light–dark cycle with water and food ad libitum. All procedures were approved by the Comité de Protection des Animaux de l’Université Laval, in accordance with the Canadian Council on Animal Care’s Guide to the Care and Use of Experimental Animals (Ed2), and with the ARRIVE guidelines. Maximum efforts were made to minimize the number of animals used. Transgenic e-Pet Cre mice expressing Cre recombinase under the control of Fev promoter, known to be specific for serotonin (5-HT) neurons63, were injected in the dorsal raphe nucleus (DRN) with 1 μl of AAV9-CAG-DIO-APEX2NES-WPRE. Stereotaxic injections were performed using a 30° angle along the frontal plane at anteroposterior, −4.78; mediolateral, +2.00; dorsoventral, −3.20. In these injected transgenic mice, the small engineered peroxidase APEX241 is specifically expressed in the cytosol/cytoplasm of 5-HT-infected neurons of the DRN and is used, in the presence of hydrogen peroxide, to oxidize the DAB chromogen, which is readily visible at the light and electron microscope levels.

Tissue preparation for the SEM dataset

After a period of 21 days following stereotaxic injection, the mice were anaesthetized with a mixture of ketamine (100 mg kg−1) and xylazine (10 mg kg−1) and transcardially perfused with 50 ml of phosphate-buffered-saline (PBS; 50 mM at pH 7.4) followed by 150 ml of 4% paraformaldehyde (PFA) and 1% glutaraldehyde diluted in phosphate buffer (PB; 100 mM at pH 7.4). Brains were dissected out, post-fixed for 24 h in the same fixative solution and cut with a vibratome (model VT1200; Leica) into 50-μm-thick frontal sections, which were serially collected in sodium PBS (100 mM, pH 7.4). Frontal brain sections at the level of the subthalamic nucleus (STN) were processed to reveal the presence of APEX2 in axons arising from DRN-infected neurons using DAB (cat. no. D5637, Sigma-Aldrich) as the chromogen. Briefly, selected 50-μm-thick sections were washed three times in PBS and then twice in Tris. Sections were then incubated for 1 h in 0.05% DAB solution diluted in Tris, then for 1 h in 0.05% DAB solution containing 0.015% hydrogen peroxide. The sections were then rinsed twice in Tris and three times in PBS, then temporary mounted in PBS and coverslipped for light microscope examination. STN sections containing DAB-labelled axons were selected for further processing. These sections were washed three times in phosphate buffer (PB), then incubated for 1 h in 2% osmium tetroxide diluted in 1.5% potassium ferrocyanide solution. They were then washed three times in ddH2O, incubated for 20 min in 1% thiocarbohydrazide (TCH) solution and washed again three times in ddH2O. The sections were placed for 30 min in 2% osmium tetroxide and washed three times in ddH2O, then dehydrated in ethanol and propylene oxide and flat-embedded in Durcupan (Electron Microscopy Science). Areas of interest were cut from embedded sections and glued to the tip of resin blocks. The blocks were cut with an ultramicrotome (Leica EM UC7) into ultrathin sections (80 nm), which were serially collected on a silicon-coated 10 × 10-mm chip wafer (Ted Pella; #16006).

SEM

Serial sections were imaged in a SEM (Zeiss Gemini 540) with the help of the ATLAS acquisition software. Images were acquired at a resolution of 5 nm per pixel, using an acceleration voltage of 1.4 kV and current of 1.2 nA. Serial sections acquisitions produced a stack of 38 rectangle images of 25,370 × 25,633 pixels (126.850 × 128.165 μm) taken out of 38 ultrathin sections. In addition, a large single section acquisition was acquired and produced a single trapezoidal image of 31,065 pixels for the small base (155.329 μm), 91,393 pixels for the large base (456.967 μm) and 53,161 pixels for the height (265.809 μm). All acquired images were subdivided into overlapping square tiles of 10,240 × 10,240 pixels (51.2 × 51.2 μm).

Statistical assessment using resampling

Resampling was used as a statistical test to verify the statistical difference between two groups64. Statistical analysis was performed using a randomization test with the null hypothesis being that the different conditions (A, B) belong to the same distribution. The absolute difference between mean values of A and B was calculated (Dgt = μA − μB). For the randomization test, each value belonging to A and B was randomly reassigned to A′ and B′, with the sizes of A′ and B′ being NA and NB, respectively. The absolute difference between the mean values of A′ and B′ was determined (\({D}_{{{{\rm{rand}}}}}=| {\mu }_{{\rm{A}}^{\prime} }-{\mu }_{{\rm{B}}^{\prime} }|\)) and the randomization test was repeated 10,000 times. The obtained distribution was compared with the absolute difference of the mean of A and B (Dgt) to verify the null hypothesis.

When the number of groups was greater than two, the F-statistic was sampled from each group using a resampling method. The F-statistic was calculated from all groups (A, B, C and so on) as a ground truth (Fgt). Each value was randomly reassigned to new groups (A′, B′, C′ and so on) where group X′ has the same size as group X. The F-statistic of newly formed groups (Frand) was calculated and this process was repeated 10,000 times. We compared Frand with Fgt to confirm the null hypothesis that the groups have the same mean distribution. When the null hypothesis was rejected, that is, at least one group did not have the same mean distribution, we compared each group in a one-to-one manner using the randomization test described above. In all cases, a confidence level of 0.05 was used to reject the null hypothesis. Because the precision of the calculation of the P value is limited to \({\frac{1}{N}}\), where N is the number of repetitions, we report a P value of <1.0000 × 10−4 instead of 0.

Evaluation of the time required for fully supervised training

For F-actin, the mean annotation time per crop (50) was calculated using the precisely annotated dataset. For CTC, the mean image annotation time of 900 s was obtained from the precise annotation of each image of the testing set. For the P. vivax dataset, the annotation time for fully supervised annotations was estimated at 2 min per image from the precise annotation of ten images. For EM, the required annotation time was calculated as the average time required by the expert per image (30 min per image, 156 images) to detect all axon DAB markers. We added 14 s (calculated from highlighting the contours of the axon DAB regions on the testing set) for each positive detection (537 detections) to account for precise annotation.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this Article.