Introduction

Macaca fascicularis (Raffles 1821), also known as the ‘long-tailed macaque’ or the ‘cynomolgus macaque’, is a macaque species that is native to Southeast Asia and widely distributed in Malaysia, Thailand, Myanmar, Laos, Cambodia, Vietnam, Indonesia, Timor Leste, and the Philippines1. Despite being one of the predominant macaque species in Malaysia, the genetic structure of their wild populations remains unclear. In Malaysia, most studies were conducted to examine the distribution2, behaviour3, human-macaque conflict4,5 and their association with zoonotic diseases6,7. Only a few genetic studies involving phylogeography and population genetics of cynomolgus macaques were conducted thus far. Most genetic studies conducted were based on the maternally inherited mtDNA marker8,9,10, and a few reports were based on the Y-chromosome11 and genomic SSR markers12,13.

Simple sequence repeats (SSRs) are repetitive DNA sequences, generally with motifs of 2–6 bp long, and present abundantly in eukaryotic genome. Its codominant and multi-allelic properties are highly valued by geneticist and evolutionary biologist, and are commonly used as DNA markers in genetic studies. Despite the recent thriving of single nucleotide polymorphism (SNP) markers, SSR markers are still relevant in many applications14,15,16,17,18. SSRs can be broadly categorized into genomic SSR and genic-SSR, depending on their locations in the genome. SSRs sited in the transcribed region are generally known as genic-SSRs. As more SSRs associated with protein coding genes are found, it is more evident now that the previously presumed junk-DNA possibly play a crucial role in adaptive evolution19. While genic-SSR is not as abundant and as polymorphic as genomic SSR, it offers several advantages over genomic SSR marker – higher probability of finding association with functional gene, higher degree of transferability across related species, and lower occurrence of null alleles20,21. Despite its lower polymorphism level compared to genomic SSR, genic-SSR has been used successfully in population genetic and evolutionary studies in many species20,21,22,23

In recent years, advancement in sequencing technologies has made whole-genome or transcriptome sequencing of both model and non-model organisms feasible. The massive amount of transcriptome data obtained via RNA sequencing can be used in various applications, from gene identification to comparative functional analysis and differential gene expression. It also serves as an excellent sequence resource for marker development. Transcriptome sequencing coupled with established bioinformatic pipeline have been used effectively for high throughput identification of genic-SSR markers from various organisms24,25,26. Some of the tools used for SSR mining include MicroSatellite identification tool (MISA)27, FullSSR28 and Genome-wide Microsatellite analysing tool package (GMATA)29.

There are fewer reported studies on the development of SSR markers for the cynomolgus macaque compared to Macaca mulatta, another non-human primate model. Hitherto, development of genic-SSR markers from whole transcriptome sequencing data of cynomolgus macaque has yet to be attempted. Therefore, the present study aimed to develop genic-SSR markers from an in-house transcriptome dataset of the Malaysian cynomolgus macaque generated from our previous studies30,31. This study is the first comprehensive report on the development of genic-SSR markers from the transcriptome of cynomolgus macaque. Here, we mined sequences containing SSRs from the transcriptome dataset, designed primers flanking pure di- and trinucleotide SSRs, and identified their associations with functional genes. Some randomly selected markers were further validated. The genic-SSR markers reported in this study are useful for population, functional genomic and comparative mapping studies of cynomolgus macaque and other related species.

Results and Discussion

De novo assembly and functional annotation

De novo assembly of the transcriptome data generated a total of 597,457 contigs with an average contig length of 400 bp; minimum and maximum contig lengths of 178 bp and 21,411 bp, respectively. Of the total contigs generated, 356,560 (~60%) of the contigs had an average coverage of more than 10 reads, and annotation of these contigs revealed 73,880 (~21%) contigs associated with functional genes. Out of the 73,880 annotated contigs, 67,399 contigs matched to M. fascicularis (GCF 000364345.1) RNA sequences. Subsequent protein sequence similarity searched against M. mulatta (GCF 000772875.2), Homo sapiens (GRCH38) and SwissProt databases, further annotated 1,461, 742 and 4,278 contigs respectively.

Identification and classification of genic-SSRs

We identified a total of 14,751 genic-SSRs in this study, reflecting the effectiveness of SSR mining from the transcriptome dataset. Of the total identified genic-SSRs, 13,709 (92.94%) were perfect repeats; while complex and compound repeats constituted the remaining 7.07% (Fig. 1). Among the perfect SSRs, dinucleotide repeats were the most abundant (8,918; 65.05%), followed by tri- (2,817; 20.55%), tetra- (1,062; 7.75%), penta- (767; 5.59%) and hexa- (145; 1.06%) nucleotide repeats. Di- and trinucleotide repeats constituted the largest groups of repeat motifs in our dataset, concurring with the results reported in other animal species such as human32, chicken33 and fish34.

Figure 1
figure 1

Classification of SSR types identified from the transcriptome sequences of M. fascicularis. SSR repeats were categorized into three groups: perfect, compound and complex SSRs.

Among the dinucleotide repeats, AC/GT (64.03%) accounted for the highest proportion, while CG/CG repeats were the lowest in proportion (0.29%). Amongst the ten types of trinucleotide repeats identified, AAC/GTT repeats were the most abundant (23.86%), and ACG/CGT repeats were the least common (~0.1%). The distributions of di- and trinucleotide SSRs according to motif are shown in Fig. 2. As for tetra-, penta- and hexanucleotide repeats, the most common motifs were AAAC/GTTT (8.3%), AAAAC/GTTTT (15.5%) and AAAAAC/GTTTTT (8.3%) respectively. Analysis of SSR densities in the human genome revealed that dinucleotide (AC/GT and AT/AT) and trinucleotide (AAC/GTT, AAT/ATT, AAG/CTT and AGG/CCT) repeats were the most common in humans32. AC/GT repeats were also reported to be the most common dinucleotide repeat in other organisms, including fish35 and sheep36. CG-rich SSR motifs are very rare in the transcriptome of the M. fascicularis, occurring less than 1%, which corroborated the results reported in the genomes of humans32 and other primate species37. CG/CG dinucleotide repeats are significantly low in vertebrates due to the methylation of cytosine, which favours the deamination of cytosine to thymidine38.

Figure 2
figure 2

Frequency of di- and trinucleotide repeat motifs in the transcriptome of M. fascicularis.

Functional annotation of SSR loci, primer development and screening

Out of the 300 SSR loci used for primer design in this study, 105 loci were associated with genes involved in specific biological processes, cellular component and/or molecular functions. The complete list of these 300 SSRs and their respective predicted functions is provided in Supplementary Table S1. From the 30 SSR markers tested, 20 markers (66.67%) produced clear amplicons of expected sizes across all samples reproducibly. Nineteen out of these 20 markers (Table 1) were polymorphic, demonstrating that more than 60% (19 out of 30) of the markers screened in this study were polymorphic. Alignment of the sequences obtained from the PCR amplicons with the contig sequences used to design the primers also verified successful amplification of the targeted DNA regions.

Table 1 Twenty validated genic-SSR markers for M. fascicularis.

Data analysis

Genetic diversity assessment was performed based on 19 polymorphic markers (Table 1) amplified across 26 M. fascicularis individuals, which were divided into the West Coast and East Coast populations. Heterozygosity assessment was performed on individual population (Supplementary Table S2), with West Coast and East Coast populations showed similar mean HE values of 0.481 and 0.484, respectively. The mean NA values for the West Coast population was 3.316 and East Coast population was 2.684. For overall genetic diversity assessment in all 26 individuals, NA, HO, HE, and PIC ranged from 2 to 6, 0 to 1, 0.125 to 0.713, and 0.110 to 0.653, respectively (Supplementary Table S3). The overall mean NA, HO, HE, and PIC were 3.630, 0.269, 0.495, and 0.431, respectively. F-statistics calculated from the 19 polymorphic loci revealed a mean FST of 0.059. Out of the 19 loci, three loci (MF121, MF259, and MF272) were the most polymorphic with six alleles each. Seven of the 19 polymorphic SSR loci had PIC values of >0.5, and thus, they were considered as highly informative39. Compared to previous population studies12,13, where genomic SSR markers were used, the genic-SSR markers used in this study generated lower NA, HO, HE and PIC values for the same species. As the SSR markers developed in our study were generated from transcriptome, it was anticipated that the genetic diversity of these markers would be lower than those of SSR markers derived from genomic DNA regions40. The lower values could also be contributed by the lower number of samples (n = 26) and sampling sites in the current work compared to those studies12,13. Nonetheless, the identification of 19 polymorphic SSRs out of 30 markers screened based on 26 individual samples is promising. We are confident that higher NA, HO, HE and PIC values would be obtained with more samples.

The average PIC value of 0.431 for the 19 polymorphic loci validated in this study was comparable to those genic-SSRs developed for the Korean quail (mean PIC value = 0.494)26 and crab (mean PIC value = 0.49)24. Although not all the 19 genic-SSR markers showed high polymorphism and PIC values, all showed the reproducibility and specificity highly desired in genotyping by PCR.

There were very few reported studies on the development of SSR markers for cynomolgus macaque. The first study conducted to develop SSR markers for the cynomolgus macaque was reported in 2007 by Kikuchi et al.41. In their work, they crossed-amplified 148 SSR markers selected from human genome database, and discovered 66 (44%) polymorphic SSR markers in the cynomolgus macaque. Later, Higashino et al.42 identified an additional 499 polymorphic SSR markers from the BAC library of M. fascicularis. They analysed the genetic polymorphisms of cynomolgus macaques originated from Indonesia, the Philippines and Malaysia using these SSR markers. In both studies, the SSR markers employed were derived from genomic DNA regions. The polymorphic genic-SSR markers identified in this study is a good addition to complement existing SSR markers to provide more markers for the investigation of the genetic structure of wild macaque populations.

Materials and Methods

Ethical clearance

The usage of M. fascicularis samples in this investigation complied with the animal care regulations and all relevant national laws of Malaysia. Sampling protocols were approved by the Institutional Animal Care and Use Committee (IACUC), University of California, Davis, USA as adopted by the PREDICT Programme in Malaysia, under which the Department of Wildlife and National Park (DWNP) Malaysia is working collaboratively with the EcoHealth Alliance, the Ministry of Health Malaysia, and the Veterinary Services Department, Malaysia.

De novo assembly and functional annotation

A transcriptome dataset was generated from a previous RNA sequencing project of the M. fascicularis on liver, kidney, lymph node, spleen and thymus30,31. We subjected the raw sequencing reads to quality assessment using FASTQC v0.11.2. Illumina co-sequencing positive control (PhiX) sequences were filtered and cleaned sequence reads were subjected to base quality checking (Q ≥ 30). De novo sequence assembly was performed using CLC Genomics Workbench version 8.5.1 (CLC Bio-Qiagen, Aarhus, Denmark). We subjected the assembled contigs (average coverage ≥ 10 reads) to annotation by sequence similarity searches with BLAST+ version 2.2.31+43 using Blastn against database built with M. fascicularis (GCF_000364345.1) RNA sequences from the NCBI RefSeq database. Contigs with no match to M. fascicularis RNA sequences were further searched against database built with Macaca mulatta (GCF_000772875.2) protein sequences using Blastp program. Sequences with no match to M. mulatta protein sequences were then searched against database built with Homo sapiens (GRCh38) protein sequences. Contigs with no match were further examined using protein similarity search against SwissProt database.

Identification and classification of genic-SSRs

Genic-SSR identification and classification were performed on the filtered contigs (average coverage ≥ 10 reads) using MIcroSAtellite identification tool (MISA)44. The minimum number of repeats for di-, tri-, tetra-, penta-, and hexanucleotides were set at six, five, five, four, and four, respectively. Categorization of perfect, compound and complex SSRs were as follows. Perfect: consisting of a single repeat of n units; compound perfect: consisting of two or more alternate tandem repeats of n units each; complex: consisted of repeats that varied in motifs by a single unit/consisted of alternate repeat motifs interspersed within a single region/consisted of two simple perfect motifs separated by nonrepeating sequences of variable length.

Primer design

Contig sequences containing SSRs identified from the transcriptome dataset were employed for primer development using Primer3 software45. We focused on candidate SSR sequences of perfect di- and trinucleotides with repeat numbers ≥10 and with only one SSR presents in each contig for primer design. All contig sequences used for primer design were checked against genomic sequences to predict the location of introns. Three-hundred SSR primer pairs were designed. All the contigs used for SSR primers design were checked for functional annotation where a cut-off value of E < 1e−15 was used.

Sampling, DNA extraction, PCR amplification, and electrophoresis

Thirty genic-SSR primer pairs selected randomly from the 300 pairs designed were used for initial screening on the DNA samples of 26 M. fascicularis individuals. Primers were selected randomly among those that have self- and cross- primer complimentary values of less than 3, low tendency to form secondary structures and 3′- complimentary value of less than 3. To test the robustness of the markers, samples were obtained from nine states in Peninsular Malaysia (Table 2) with the permission and collaboration of DWNP. Three samples from each state were obtained except Terengganu (2 samples). Genomic DNA samples of 24 M. fascicularis individuals were provided in the form of extracted DNA. Two DNA samples were isolated from liver tissue samples provided by DWNP using QIAamp DNA mini kit (Qiagen, Germany) according to the manufacturer’s protocol.

Table 2 GPS locations of the M. fascicularis samples used.

PCR was performed in 10 µl reaction volumes containing 10 ng of genomic DNA, 2.0 µM of each primer, 1\(\times \)PCR buffer, 2.5 mM MgCl2, 0.2 mM dNTPs, and 1 U Taq polymerase (Promega, USA), in a thermal cycler T100 (Bio-Rad, USA). Gradient PCR protocol under the following conditions was employed: a single cycle of initial denaturation at 95 °C for 5 minutes, followed by 35 cycles of denaturation at 95 °C for 1 minute, annealing at X °C for 30 seconds, extension at 72 °C for 5 minutes, and ended with a single cycle of final extension at 72 °C for 5 minutes. X denotes the different annealing temperatures (Ta) used for different primers (Table 3). Primers that were not successfully amplified or produced multiple bands were further tested using touchdown PCR with 1 °C decrements starting from 60 °C. PCR products were separated on a 2.0% agarose gel and 8.0% non-denaturing polyacrylamide gel stained with EtBr (0.5 µg/ml). To further confirm the presence of targeted SSRs in the amplified products, PCR products with the expected fragment sizes were sequenced on an ABI 3730 through services provided by First Base Sdn. Bhd. (Seri Kembangan, Malaysia). Sequences obtained were compared with the contig sequences that the primers were designed from and the targeted SSR repeats were also identified.

Table 3 The 30 genic-SSR primers used in preliminary screenings on 26 macaque DNA samples.

Genetic diversity analysis

The 26 individual macaque samples from the nine states in Peninsular Malaysia were arbitrarily divided into two populations, the East Coast and West Coast populations, taking into consideration the Titiwangsa Range as a potential geographical barrier for gene flow between populations on both coasts. The East Coast population comprised of eight individuals from three states (Kelantan, Terengganu and Pahang), while the West Coast population consisted of 18 individuals from six states (Perlis, Kedah, Pulau Pinang, Perak, Selangor and Negeri Sembilan). SSR banding patterns were analyzed with PopGene version 1.3246 and Cervus version 3.0.747 to calculate the number of alleles (NA), observed heterozygosity (HO), expected heterozygosity (HE), fixation index (FST), and polymorphic information content (PIC).