In silico EST-SSR Identification and Development through EST Sequences from Metroxylon sagu Rottb. for Genetic Diversity Analysis

Sago plant (Metroxylon sagu Rottb.) is one of the most carbohydrate-producing plants in the world. Microsatellites or simple sequence repeats (SSRs) play an important role in the genome and are used extensively compared to other molecular markers. For the first time, we are exploiting data expressed sequence tags (EST) of sago plants to identify and characterise markers in this species. EST data about sago plants are obtained through the EST database on the National Center for Biotechnology Information (NCBI) website. We obtained data of 458 Kb (412 contig) with a maximum and minimum length of 1,138 and 124 nucleotides, respectively. We successfully identified 820 perfectly patterned SSR using Phobos 3.3.12 software. The type characterisation of EST-SSR was dominated by tri-nucleotides 36% (294), followed by hexa-nucleotides 24% (202), tetra-nucleotides 15% (120), penta-nucleotides 13% (108) and di-nucleotides 12% (96). The most frequency of SSR motifs in each type is AG, AAG and AAAG. Analysis of synteny on the EST sequence with the online application Phytozome found that sequences were distributed on 12 Oryza sativa chromosomes with a likeness percentage between 63% to 100% and e-value between 0 to 0.094. We developed the primer and generated 19 primers. Furthermore, we validated 7 primers that all generated polymorphic alleles. To our knowledge, this report is the first identification and characterisation of EST-SSR for sago species and these markers can be used for genetic diversity analysis, marker assisted selection (MAS), cultivar identification, kinship analysis and genetic mapping analysis.


INTRODUCTION
Genetic diversity among individuals or populations is the basis of adaptation and evolution and thus plays a major role in dealing with different biotic and abiotic pressures.Diverse genetic resources provide better opportunities for plant breeders to create new, better cultivars with desired traits (Salgotra & Chauhan 2023).The study of genetic diversity in plant species is very useful for the development of breeding programmes and for conservation purposes.To access genetic variation within and between populations, morphological characterisation approaches, biochemical markers and molecular markers are often used (Mondini et al. 2009).Assessment of variation between populations using morphological characters is difficult to study because morphology varies under different plant growth conditions (D'Imperio et al. 2011).DNA markers are critical for assessing genetic diversity between and within different plant species (Amiteye 2021;Hailu & Asfere 2020), because they highlight differences in nucleotide sequences between individuals and are not affected by environmental factors (Aslanbay Guler & Imamoglu 2023).
Simple sequence repeat (SSR) genetic markers are currently widely used for genetic diversity analysis, cultivar identification, pedigree analysis, genetic mapping analysis and marker-assisted selection (MAS).SSR markers, also known as microsatellite markers, if used as genetic markers, are codominant, polymorphic so that they have a high level of allele diversity, and the test is very efficient because it is based on the Polymerase chain reaction (PCR) method (Molla et al. 2010).Therefore, SSR markers can be used to detect the diversity among closely related plant accessions better than other molecular markers (Kumar et al. 2009).
SSR markers can be developed through genomic and genic approaches.SSR markers with a genomic approach (g-SSR) were developed through the identification of genomic sequences while SSR markers with a genic approach (EST-SSR) were developed through expressed sequence tags (EST) sequences (Jain et al. 2014).The development of SSR markers using traditional methods requires a lot of time, money, and laboratory work (Kale et al. 2012).The development of SSR markers from genomic libraries will be time-consuming and require large infrastructure laboratory facilities.An alternative and more effective approach that can be done is searching for SSR in silico in the EST database that has been published in NCBI.This method has been widely used in various molecular studies on various plants (Purwoko et al. 2021;Priyanka et al. 2017;Vieira et al. 2016;Jain et al. 2014;Aberlenc-Bertossi et al. 2014;Duran et al. 2013).
Sago (Metroxylon sagu Rottb.) is one of the palm plants that produce starch.Sago plants can accumulate starch in the trunk up to 200 kg/tree to 220 kg/tree (Jong 1995) and are one of the high carbohydrate-producing plants in the world (Flach 1995).The utilisation of sago is very dependent on the potential of available sago resources.Uncontrolled exploitation of sago is carried out to fulfill the need for food, industrial raw materials and energy which continues to increase, causing productive sago palm species to be threatened with extinction.One way to protect Indonesian germplasm, especially sago palms, is to conduct an inventory and characterisation both phenotypically and genotypically.Genetic markers are known to have an important role in uncovering and studying plant diversity and population genetics with techniques to detect genetic variability between individuals, populations and species.Knowledge of genetic variability is a prerequisite for studying the evolutionary history of a species and also for breeding programs and conservation of plant genetic resources.Data on genetic diversity is needed to protect sago palms and their genetic components, which are thought to be native to Indonesia, from being exploited by other countries.
Recently, SSR markers have been developed using partial genome data to study the genetic diversity of sago palms (Purwoko et al. 2019) but the development of SSRs using EST data for sago palms has not been fully studied.Sago EST data have been generated and published in a publicly accessible database offering the opportunity to create EST-SSR markers in silico.This approach can be used to design specific primers at specific loci that represent functional genes or coding regions.The development of SSR markers using EST sequences has several advantages compared to genomic sequences, such as EST-SSR represents functional components of the genome and can be used between species, can be used to search for genes, and also map genes.Identification of SSR through these two approaches has been widely carried out on palm trees such as oil palm and dates.So far, the identification of SSR in sago palm with the above approach has not been reported.This is the first report of the SSR analysis of sago palms using the genic approach (EST-SSR).

EST Sequence Source, SSR Analysis and Functional Characterisation
The EST sequence of sago palm was obtained from the NCBI database (http:// ncbi.nlm.nih.gov/) with accession number JK731189-JK731600 which is the EST of young leaves of the sago palm.The EST sequences that have been downloaded from the NCBI website are then uploaded to the EGassembler website (https://www.genome.jp/tools/egassembler/)which aims to clean sequences, remove vector contamination, and assemble contig sequences (Nejad et al. 2006).Sequence processing is carried out using standard parameters suggested by the site.Contig sequences resulting from the assembly and processing were then downloaded in FASTA format for further use for SSR analysis, synthesis and functional gene analysis.
SSR analysis was performed using Phobos 3.3.12software (Mayer et al. 2010) to detect nucleotides at loci with di-, tri-, tetra-, penta-and hexa-nucleotide motifs.Synteny analysis was carried out using the Phytozome online application using the BLASTn programme.From the EST sequences that were detected to have SSR and were selected (at least 20 bp lengths of SSR), synteny analysis was carried out using Oryza sativa chromosome data.The EST sequence containing the SSR motif was then searched for putative genes by comparing the nonredundant protein Arabidopsis database on The Arabidopsis Information Resource (TAIR) (http://www.arabidopsis.org/index.jsp)using the BLASTx programme with an e-value limit of 10 -3 .The gene ontology (GO) mapping analysis aims to provide annotations of the highest BLAST Hit results (Gotz et al. 2008).After mapping GO, then proceed with GO annotation which aims to provide functional annotations to query sequences (Gotz et al. 2008).The parameters used for GO annotation include annotation cut off of 55, GO weight of 5, Hit-filter e-value of 1.0 e-6 , and HSP-Hit coverage cut off of 0. Visualisation of GO analysis results using the http webbased REVIGO application (http://revigo.irb.hr/Results.aspx?jobid=738236493) (Supek et al. 2011).

Total DNA Isolation, Primer Design and Validation
After the motifs and synteny analysis results were obtained, the primers were designed using Primer3 1.1.4software (Untergasser et al. 2012).The parameters used for the primary design are presented in Table 1.Total DNA was isolated from sago leaf samples using a modified cetyltrimethylammonium bromide (CTAB) method for DNA isolation from palm leaves (Purwoko et al. 2019;Maskromo et al. 2016;Novero et al. 2012;Pesik et al. 2015;2017;Tinche et al. 2014).To check the quality and concentration of DNA, electrophoresis was used with 1% agarose gel.The PCR composition was made with a total volume of 25 µL/reaction consisting of Go taq green master mix (12.5 µL), forward primer (2 µL), reverse primer (2 µL), 2 µL DNA template and sterile H 2 O to a volume of 25 µL.Takara PCR Thermal Cycler Dice ® (http://catalog.takarabio.co.jp/product/basic_info.php?unitid=U100004192) was used for amplification of SSR markers.The PCR program used was as follows: predenaturation at 95°C for 3 min, denaturation at 95°C for 30 s 35 cycles, annealing with Tm -5°C for 30 s and extension at 72°C for 30 s each for 35 cycles and final extension 72°C for 60 s and hold at 4°C.The PCR products were evaluated using gel electrophoresis in 1% agarose and finally visualised with SYBR safe dye (Invitrogen).The amplified product which is expected to be of band size is further separated on the 8% agarose metaphor gel.

Genetic Analysis
Using a straightforward matching dissimilarity index, we created a dissimilarity matrix for the diploid based on allelic data.Bootstrap analysis with 10,000 iterations was used to calculate the dissimilarity matrices.Using the option to alter 13 axes, the default axis as decided by the principal coordinate analysis (PCoA) was chosen to set the based on dissimilarity.Using the unrooted weighted neighbour-joining strategy, we constructed trees using the computed dissimilarity matrix.Using Dissimilarity Analysis and Representation for WINDOWS (DARWin) software version 6.05 (Perrier & Jacquemoud-Collet 2006; http://darwin.cirad.fr/darwin), the dissimilarity matrix, bootstrapping, PCoA, and tree construction for the sago palm accessions were carried out.

RESULTS
Regarding analysis, the total length of the EST sequence produces 412,716 bp (412 contigs) with a maximum and minimum length of 1,138 and 124 nucleotides, respectively.The results of sequence processing using the EGassembler webbased application showed 412 clean EST sequences without contaminants from vector sequences.EST sequence nucleotides were distributed with a frequency of A: 115,701 (28%), C: 102,089 (25%), G: 83,606 (20%) and T: 111,319 (27%) while the composition of GC: 185,695 (45%) (Fig. 2).

Synteny Analysis
Synteny analysis was carried out using the Phytozome Online application using the BLASTn programme.From the EST sequences that were detected to have SSR and were selected, synteny analysis was carried out using Oryza sativa data.
From the analysis, it was found that the EST sequences of sago were spread over 12 chromosomes with the percentage of similarity between 63%-100% and the e-value ranged from 0 to 0.094 (Table 2).Meanwhile, the 15 sequences that were primer designed successfully were spread on 12 rice chromosomes with a similarity percentage between 64%-100% and e-values ranging from 2.00E-19 to 0.094 (Table 2).However, there is one sequence whose synteny is not known but the primer has been successfully synthesised, namely EJK731303.

Primer Design for EST-SSR Markers
Of the 412 EST sequences with the perfect SSR motif, 15 sequences with the SSR motif were selected.A total of 20 SSR motifs from 15 sequences allow for primer design, namely: 1 di-nucleotide, 7 tri-nucleotide, 1 penta-nucleotide and 11 hexa-nucleotides.Only 19 primer pairs of the 20 SSR motifs could be synthesised (Table 2).The primers were designed with the following criteria: optimum primer size 20 bp, melting temperature (Tm): 55°C-60°C, and GC content 45%-60%.
The size of the shortest primer design product is 184 bp and the longest is 498 bp with a Tm range of 60°C and a GC value of 45%-55%.

Sequence Annotations Containing SSR
The primers were designed from EST sequences then analysed using BLASTx by selecting the e-value 0.00001 against the NCBI-nr database followed by the TAIR database (Table 3).From 15 sequences analysed, 14 sequences were identified as having a gene ontology and successfully mapped, only 1 sequence that did not have BLASTx Hit (Table 4).The 10 plant species having the highest hit frequency can be seen in Fig. 5. From these results, it can be seen that there were six monocot plants (Elaeis guineensis, Phoenix dactylifera, Oryza sativa japonica, Musa acuminate, Zea mays and O. brachyantha) and four dicot plants (Medicago truncatula, Glycine max, Brachypodium distachyon and Solanum pennellii) which have significant homology.
The BLASTx programme on TAIR was used to search for gene annotations.The results of gene ontology annotations and functional categories based on locus identification can be seen in Fig. 6.Based on the results of the sequence annotations, a total of 290 gene ontologies can be determined and distributed into three categories: molecular functions (71), biological processes (210), and cellular components (53).The molecular function is dominated by nucleotidebinding subcategory about 19.7%, biological process subcategory is dominated by metabolic process subcategory about 62.7%, and cellular components are dominated by membrane subcategory about 62.7%.

Genetic Relationship and Cluster Analysis
In this study, a total of 19 primer pairs were designed and from 15 selected sequences containing SSR motifs, 7 class I primers were synthesised (Table 2), used for validation and polymorphism assessment among 2 accessions of M. sagu (B1 and C4) of which 7 showed amplification and 7 were found to be polymorphic (Fig. 7).A total of 7 SSR markers were found to be polymorphic in 21 alleles of M. sagu with an average number of 3 alleles per locus.The PIC values were found to range from 0.132 in the primary (EJK 731600-1 and EJK 731391-2) to 0.580 in (EJK 731455), with a mean value of 0.315.The highest (0.680) and the lowest (0.148) expected heterozygosity values were obtained with primers (EJK 731455) and (EJK 731600-1 and EJK 731391-2), respectively, with a mean value of 0.372.The range for the observed heterozygosity (Ho) was 0.154 to 0.769 with a mean value of 0.341 (Table 5).Unrooted weighted neighbour-joining cluster analysis was constructed to measure genetic diversity and interrelationships between accessions in 13 accessions grouped into three large groups using Darwin software.
Cluster II consists of four accessions S1, SG09, C4 and B1.Cluster III consists of four accessions SG04, SG05, SG06 and SG07 (Fig. 8).The dendrogram grouping classifying various accessions of M. sagu based on response to EST-SSR markers is the first report to the authors' knowledge.In a previous study, Purwoko et al. (2019) also succeeded in grouping various accessions of M. sagu from various islands in Indonesia.The results of PCoA (Fig. 9) presented a two-dimensional graphical view of the genetic diversity of 13 sago palm accessions originating from four regions in Indonesia.The results observed in the PCoA were in agreement with the cluster analysis.

DISCUSSION
Publicly available EST data have proven to be useful in the identification and development of SSR molecular markers.The EST sequences play a useful role in the establishment of markers, transcriptomic profiling, proteomic research, and gene discovery (Haq et al. 2021).The EST-SSR marker has advantages because it contains candidate genes and can produce molecular markers associated with certain traits (Kalia et al. 2011).According to Haq et al. (2014) and Singh et al. (2019), the EST-SSR marker itself is a functional molecular marker to characterise "a presumed function or a particular gene encoding enzymatic activity" that aids in numerous genomic applications in plants.According to research by Pashley et al. (2006), Ellis and Burke (2007) and Haq et al. (2014), the presence of SSRs in the expressed area or ESTs is more preserved, significant, and transferable across taxonomic boundaries than anonymous SSRs.In numerous analyses of plant genomes, including those that evaluate genetic polymorphism, genetic diversity, population genetics, biodiversity, high-resolution genetic maps, gene mapping, quantitative trait loci, germplasm characterisation, cultivar identification, paternity analysis, marker-assisted breeding taxonomy, and comparative genomic studies, EST-SSR is the preferred molecular marker (Kantety et al. 2002;Eujayl et al. 2004;Varshney et al. 2007;Ukoskit et al. 2018;Haq et al. 2021).EST markers were also known to originate from genomic regions that can be transcribed and conserved across multiple genomes over a wider range than other markers (Pashley et al. 2006).
In this study, 820 SSRs were found from 412 EST sequences of M. sagu or 1/0.5 kb of the EST sequence to find EST-SSR markers.This result is much lower than previous studies on the sago genome, namely 132.57/Mb (Purwoko et al. 2019).We found that the trinucleotide repeat sequence had a dominant frequency (36%) compared to the others.A similar situation was previously reported in C. longa (Purwoko et al. 2021), P. violascens (Cai et al. 2019), and same results were also obtained in date palm ESTs which stated that tri-nucleotides predominated from other motifs (Zhao et al. 2013).Dissimilar things were reported in oil palm plants that di-nucleotides predominated compared to others (Singh et al. 2008).However, the type of dinucleotide motif found to be the most common SSR was AG (51%) followed by AAG (24.5%) then AAAG (14.2%).This is similar to motifs in M. sagu genome, in which the AG and AAG motifs are predominant (Purwoko et al. 2019).Some SSRs do not produce primers because of their impossible position, at the beginning or end of the sequence.According to Kale et al. (2012), the failure of the design of primer was due to not obtaining a suitable clamping sequence or an impossible melting temperature constraint.Primer validation is carried out to determine the ability of the primer that has been designed to be amplified or not.The primer ability to produce amplification products is influenced by several characters, such as internal stability, melting temperature, secondary structure, or competition between primers (Sint et al. 2012).
For annotation analysis, EST sequences with SSR and having a primer (15 sequences) were performed comparative analysis with the publicly available databases NCBI-nr and TAIR and resulted annotations for 14 (93.33%)sequences.The interesting thing is the highest hits were obtained on Elaeis guineensis and Phoenix dactylifera which are palms, making it possible that the primers synthesised from sago EST could also be used in these two plants for tranferable genotyping study across palms genera.Transferability of SSR markers indicates whether the markers are applicable to comparative mapping studies in plants (Endo et al. 2017).Comparative analysis with TAIR yielded 7,062 functional characteristic hits.
Our findings support the usefulness of EST-SSR markers for sago cultivar differentiation and genetic diversity and grouping analysis.Additionally, we have demonstrated the value of the created EST-SSR marker in examining the genetic diversity of the sago plant.Given that gene function is frequently established (Parida et al. 2009), the use of DNA coding regions for the construction of SSR is a further benefit in genetic associations (Feingold et al. 2005) and linkage analysis.Recently constructed EST-SSR markers have been successfully used to study association mapping for traits of interest in various commodities such as Syringa oblata (Yang et al. 2020) and Hibiscus cannabinus (An et al. 2023).

CONCLUSION
The results of the current study demonstrate the successful identification and development of SSR markers in sago palms based on in silico EST data.A computational-based approach was used to develop and identify SSR markers from a publicly available EST database, which were further validated through a wet lab.The development of markers from DNA coding regions has a great advantage because previously known gene functions can assist in exploiting markers for specific traits.The resulting EST-SSR marker was successfully used to evaluate the genetic diversity of sago palms.In the future, the EST-SSR marker will be useful for the conservation and breeding activities of the underutilised carbohydrate-producing plants.

Figure 1 :
Figure 1: Coordinate map the origin West Borneo of sago palm samples used to analyse the genetic diversity using EST-SSR markers.(Source: Google Earth Engine).

Figure 2 :
Figure 2: Nucleotide distribution in EST sequences of sago palms.

Figure 3 :
Figure 3: Distribution of SSR types in EST sequences of sago.

Figure 4 :
Figure 4: The best four SSR motif frequencies for di-, tri-and tetra-nucleotides.

Figure 5 :
Figure 5: Frequency of the 10 plants with the most hits.

Figure 6 :
Figure 6: Gene ontology (GO) classification annotated for the sequence containing SSR in the cellular component, molecular function and biological processes.

Figure 8 :
Figure 8: Unrooted weighted neighbour-joining cluster analysis of genetic dissimilarity as measured using amplified simple sequence repeat (SSR) markers.Blue: West Borneo accessions; Green: Java accession; Red: Sumatera and Maluku accessions.

Figure 9 :
Figure 9: Factorial analysis based on Eigen values calculated from seven SSR markers.

Table 1 :
Parameters for designing primer.

Table 2 :
Description of the primers that were successfully synthesised.

Table 3 :
Distribution of contig sequences from Blast2Go analysis.

Table 5 :
Summary of observed allele number (N), polymorphism information content (PIC), observed and expected heterozygosity (Ho and He) for 13 sago palm accession.