Genome assembly and quality validation
To evaluate the genome size and heterozygosity of M. officinalis, a total of 61.4 Gb short reads from the MGISEQ-2000 sequencing platform were subjected to K-mer analysis (Supplementary Table S1). The 17-mer frequency curve showed a bimodal distribution, with the highest peak occurring at a depth of 54 (Supplementary Fig. S1a). Based on the total number of K-mers, the genome size and heterozygosity of M. officinalis were estimated to be 485.4 Mb and 1.32%, respectively (Supplementary Fig. S1b). These results indicated that the genome of M. officinalis was small but highly heterozygous.
De novo assembly of the 62.92 Gb of single-molecule long reads from the Oxford Nanopore PromethION sequencing platform was performed with NextDenovo software (Supplementary Table S1). After removing redundant and contaminated sequences (nontarget classes, mitochondria, and chloroplasts), the final postcorrection genome size was 484.85 Mb, with a contig N50 of 4.21 Mb (Table 1). The genome size was similar to that estimated by the genome survey. Employing Hi-C technology, 398.8 million clean reads from the Illumina NovaSeq 6000 sequencing platform were used for chromosome construction to further refine the M. officinalis genome assembly. Moreover, using the agglomerative hierarchical clustering method in LACHESIS software, a total of 99.94% of the assembly was anchored to 11 pseudochromosomes; the size ranged from 33.06 Mb to 47.00 Mb with a contig N50 of 3.61 Mb and scaffold N50 of 40.97 Mb (Table 1 and Supplementary Table S2). Finally, the contig sequences were connected in the determined order and direction by adding 100 N to obtain the final chromosome-level genome sequence with a chromosome mount rate of 90.77% (Fig. 2 and Supplementary Table S2). A Hi-C interaction heatmap showed that the clustering, ordering, and orientation of the contigs was valid, providing the first high-quality chromosome-scale genome assembly for M. officinalis (Supplementary Fig. S2).
A variety of methods have been employed to evaluate the assembly quality of genomes. Here, the assembly completeness was evaluated using Benchmarking Universal Single-Copy Orthologs (BUSCOs). The BUSCO analysis identified 97.02% of the complete BUSCOs in the assembly (Supplementary Table S3). To assess assembly consistency, the short reads were mapped to the genome using bwa software. In total, ~99.26% of the clean data were mapped to the genome assembly (Supplementary Table S4). The mapping ratio of the RNA-seq reads from different tissues was in the range of ~93.94–96.08% (Supplementary Table S5). We further evaluated potential contamination in the genome by using GC depth analysis. The GC depth scatter plot showed that the GC content was distributed at 30–40%, and the sequencing depth was concentrated at 110–150×, indicating high quality without contamination in the data (Supplementary Fig. S3). All these results indicated high completeness and consistency of the M. officinalis genome assembly.
Genome annotation
Homology-based annotation and a de novo approach were applied to identify transposable elements (TEs) and tandem repeats in the M. officinalis genome. In total, we identified 281.41 Mb of nonredundant repetitive sequences, accounting for 58.04% of the assembled genome (Supplementary Table S6). Of these predicted repeats, TEs comprised the largest proportion (54.18%), including 42.92% Class I repeats and 11.26% Class II repeats. Long terminal repeat (LTR) retrotransposons accounted for 35.79% of the genome (Supplementary Table S6). Based on the assembled genome, a total of 209 rRNAs, 207 snRNAs, 78 miRNAs, and 644 tRNAs were predicted (Supplementary Table S7).
A total of 27,102 nonredundant protein-coding genes were predicted by using a combination of de novo, homolog-based, and transcriptome-based predictions (Table 2). The average gene length and coding sequence size were 3762.35 and 1169.11 bp, respectively, with an average of five exons per gene (Table 2). Then, we performed functional annotation of predicted genes by comparison with different databases. Overall, 24,769 (91.39%) genes were functionally annotated in at least one of the public databases, and 6153 (22.70%) genes could be annotated in all databases (Supplementary Fig. S4a). A total of 24,637 (90.90%) genes showed homologous genes in the NR database, while 20,712 (76.42%) genes were similar to proteins in the SwissProt database. Among the blastx top hits, species of the genus Coffea (Rubiaceae family) showed the highest proportion (82.77%) of homologous genes, including Coffea arabica (49.54%), Coffea eugenioides (20.44%), and C. canephora (12.79%) (Supplementary Fig. S4b). In addition, 14,998 (55.34%) genes were assigned to at least one GO term and classified into 41 GO functional subcategories (Supplementary Fig. S4c). To further understand the metabolic pathways of M. officinalis, 9737 (35.93%) genes were annotated in the KEGG pathway database (Supplementary Fig. S4d). Pathways associated with “biosynthesis of other secondary metabolites” (466 genes) and “metabolism of terpenoids and polyketides” (216 genes) can be used to explore the biosynthesis pathways of active ingredients in M. officinalis.
We conducted BUSCO analysis to verify the predicted genes. The results showed that approximately 96.80% of the complete BUSCOs could be identified in the annotated results, indicating the high reliability of the predicted results (Supplementary Table S8). The number of predicted genes and structural characteristics of the M. officinalis genome were consistent with those of related species, which indicated that the annotation results were acceptable (Supplementary Table S9).
Gene families and phylogenetic relationships
OrthoMCL software was used to identify the gene families of M. officinalis and nine other species, among which C. canephora is the most closely related species to M. officinalis and also belongs to the Rubiaceae family. A total of 22,750 (83.94%) genes were categorized into 14,124 gene families, 849 of which were unique to M. officinalis (Supplementary Table S10). In each species, 1576 genes were identified as single-copy orthologs. A total of 7230 genes in M. officinalis did not cluster with the genes of other species, indicating that these genes were M. officinalis specific (Supplementary Fig. S5a, b). Furthermore, we performed KEGG enrichment analysis to explore the metabolic pathways involved in these species-specific genes. Interestingly, we found that the pathways related to the synthesis of secondary metabolites were significantly enriched (q value < 0.05), including “indole alkaloid biosynthesis” (36 genes), “phenylpropanoid biosynthesis” (103 genes), and “stilbenoid, diarylheptanoid, and gingerol biosynthesis” (29 genes) (Supplementary Table S11). Moreover, we also identified 149 genes involved in “plant–pathogen interaction”. These data will help to reveal the molecular mechanisms underlying the interactions between M. officinalis and pathogens.
We used MAFFT software to perform multiple sequence alignments on the identified single-copy orthologous genes. The phylogenetic tree was constructed using the PROTGAMMAAUTO model of RAXML software, in which V. vinifera and Arabidopsis thaliana were outgroup species. As shown in Fig. 3a, M. officinalis was most closely related to C. canephora, and the divergence time of the two was 49.27 (31.55–65.27) million years ago (Mya). Comparative genomic analysis showed that 732 expanded and 308 contracted gene families were discovered in M. officinalis (Fig. 3b). Compared with the closely related species of C. canephora (383 expansion/370 contraction), M. officinalis showed more gene family expansion than contraction. KEGG enrichment analysis of the expanded genes suggested that they were mainly enriched in “ABC transporters” (30 genes), “starch and sucrose metabolism” (78 genes), “phenylpropanoid biosynthesis” (97 genes), “isoquinoline alkaloid biosynthesis” (19 genes) and so on, indicating that some of these might be related to the biosynthesis of active compounds (Supplementary Table S12). The contracted gene families were involved in “ether lipid metabolism” (8 genes), “endocytosis” (20 genes), “sesquiterpenoid and triterpenoid biosynthesis” (5 genes), “spliceosome” (18 genes), “protein processing in endoplasmic reticulum” (18 genes) and “plant–pathogen interaction” (25 genes), indicating that some of these families may be related to environmental adaptation (Supplementary Table S13). Notably, we found that a series of genes related to secondary metabolism and environmental adaptation exhibited significant expansion (Supplementary Table S14). These results provide valuable resources for understanding the biosynthesis of active ingredients and the interaction between M. officinalis and its growth environment.
Positive selection, WGD, and collinearity
The Ka/Ks ratios of the single-copy genes were used to evaluate the positive selection of genes in M. officinalis. A total of 101 candidate genes were strictly positively selected (p-value < 0.05) (Supplementary Table S15). GO enrichment analysis showed that these genes were enriched in “DNA repair”, “ATP-dependent helicase activity”, “chromosome”, and “DNA topological change”, indicating that these positively selected genes may improve DNA damage resistance in adverse environments. To estimate the potential WGD events of M. officinalis, synonymous nucleotide substitutions (Ks) were characterized in M. officinalis, C. canephora, Catharanthus roseus, and V. vinifera. Based on the Ks distribution between M. officinalis and the other three genomes, M. officinalis–C. canephora (0.608) divergence time was slower than that of M. officinalis–C. roseus (0.867) and M. officinalis–V. vinifera (1.238), which was consistent with the phylogenetic tree (Fig. 3c). The distribution of Ks showed one peak at ~2.1 to 2.3 in the genomes of M. officinalis, C. canephora, and V. vinifera. The results showed that they shared an ancient WGD event (the core eudicot γ triplication event) before their divergence and no recent independent WGD event (Fig. 3c). The comparative genome structure between M. officinalis and C. canephora showed high collinearity, indicating that there was no large-scale structural variation after the divergence between M. officinalis and C. canephora (Fig. 3d and Supplementary Fig. S6). For most collinear regions, one chromosome of M. officinalis corresponded to one chromosome of C. canephora; for example, MoChr2, MoChr5, MoChr6, MoChr7, MoChr8, MoChr9, and MoChr10 of M. officinalis corresponded to CcChr2, CcChr3, CcChr10, CcChr7, CcChr8, CcChr5, CcChr11 and CcChr4 of C. canephora, respectively. MoChr1 corresponded to CcChr1, CcChr2 and CcChr6; MoChr3 corresponded to CcChr2, CcChr6 and CcChr9; MoChr4 corresponded to CcChr1, CcChr3, and CcChr6 (Supplementary Fig. S6). These observations indicated that MoChr1, MoChr3, and MoChr4 might have formed by fragmentation and recombination of ancestral chromosomes. We further conducted intergenomic collinearity between M. officinalis and V. vinifera. The M. officinalis genome generally showed a one-to-one syntenic relationship with V. vinifera, which was consistent with the result that the M. officinalis genome did not undergo a recent WGD event (Fig. 3d). Interestingly, the collinear regions of MoChr5 mainly corresponded to VvChr5, and MoChr6 corresponded to VvChr 18, while other chromosomes, especially MoChr1, MoChr2, and MoChr3, did not exhibit any significant corresponding relationships between M. officinalis and V. vinifera (Supplementary Fig. S6).
Characteristic analysis of genes showing organ-specific expression
Based on gene expression levels, we identified 451, 254, 109, 165, and 219 genes expressed explicitly in stalks, leaves, one-year-old roots (AR), three-year-old roots (TR), and six-year-old roots (SR), respectively, and 17,578 genes were expressed in all tissues (Fig. 4a). To elucidate the similarities and differences of gene expression patterns in different tissues, we also performed a k-means cluster analysis. A total of 16,771 differentially expressed genes (DEGs) were divided into 10 clusters (Fig. 4b). We, therefore, focused our attention on the clusters that contained genes with tissue-specific expression. We found that the leaf-biased genes (cluster 1) were highly correlated with fundamental pathways, such as “photosynthesis”, “biosynthesis of secondary metabolites”, “photosynthesis-antenna proteins”, “carbon fixation in photosynthetic organisms” and “porphyrin and chlorophyll metabolism” (Supplementary Table S16). In contrast, stalk-biased genes (cluster 9) were significantly associated with defense responses, such as “MAPK signaling pathway-plant” and “plant–pathogen interaction” (Fig. 4c and Supplementary Table S16). Additionally, we also identified that genes in clusters 2, 4, and 8 were highly expressed at different developmental stages of roots, and some genes were continuously enhanced with root development (clusters 3 and 7). The functions of these genes were mainly enriched in “phenylpropanoid biosynthesis”, “amino sugar and nucleotide sugar metabolism”, “ABC transporters”, “MAPK signaling pathway-plant”, “plant–pathogen interaction” and “peroxisome”, indicating that they may be related to the synthesis, transport, and storage of active ingredients and defense responses (Supplementary Table S16). These results provide a basis for further analysis to reveal the gene expression regulatory network and regulate bioactive metabolite derivative production in M. officinalis.
Identification of genes related to the anthraquinone biosynthesis pathway
Anthraquinones in Rubiaceae plants are mainly synthesized through the shikimate/o-succinylbenzoic acid pathway, which mainly involves the shikimate pathway, TCA cycle, mevalonate (MVA) pathway, and methylerythritol phosphate (MEP) pathway (Fig. 5a). 1,4-Dihydroxy-2-naphthoate (DHNA) is formed by isochorismate and α-ketoglutarate under the catalysis of a series of enzymes. Therefore, DHNA combines with dimethylallyl pyrophosphate (DMAPP), derived from the MVA and MEP pathways, to form the core structure of anthraquinone. We identified 11 crucial gene families in the shikimate pathway (Supplementary Table S17). In addition, compared with C. canephora, the key gene DHQS in the shikimate pathway was expanded to enhance the ability to produce 3-dehydroquinate (DHQ). On the other hand, the chlP gene number was also expanded, which may contribute to terpenoid-quinone biosynthesis. We also identified 14 important gene families in the MVA and MEP pathways (Fig. 5a and Supplementary Table S17). Almost every node in the MEP pathway had only one gene copy, while the crucial genes ACAT, HMGR and PMK in the MVA pathway had two or three gene copies. Gene expression analysis showed that most MEP pathway genes had high expression in leaves, while genes in the MVA pathway were expressed in various tissues.
Functional gene evolution contributed to the formation of terpenoids
Terpenoid biosynthesis starts from the terpenoid backbone biosynthesis pathway. Isopentenyl pyrophosphate (IPP) and DMAPP are common precursors for terpenoid biosynthesis in plants and are mainly formed by the MVA and MEP pathways. Geranylgeranyl diphosphate (GGPP) is the direct precursor substrate of diterpene biosynthesis, which GGPPS catalyzes. In the M. officinalis genome, we found that GGPPS has ten gene copies, and tandem duplication was found on Chr6 (Fig. 6 and Supplementary Table S17). Gene expression analysis showed that GGPPS might have potential functional divergence; for example, evm.model.LG06.1002 was relatively highly expressed in AR, but evm.model.LG06.1003 showed no expression in any tissues (Supplementary Table S17).
Terpene synthase (TPS) is vital for terpenoid synthesis and uses geranyl diphosphate (GPP), GGPP, and farnesyl diphosphate (FPP) as direct precursors to synthesize monoterpenes, diterpenes, sesquiterpenes, and triterpenes. TPS proteins were identified by using the Pfam domain models PF03936 and PF01397 with an E-value cutoff of 1e−5. We identified 41 TPS family genes in the M. officinalis genome. Based on the phylogenetic tree generated by aligning the TPS proteins of M. officinalis, C. canephora, and A. thaliana, we divided the M. officinalis TPS genes into seven groups (Supplementary Fig. S7a and Supplementary Table S18). The TPS-a (14 genes) and TPS-b (14 genes) groups contained more genes, while the TPS-g group comprised only two genes. We also found that the TPS-a group had tandem duplication on Chr2 (10 genes), and the TPS-b group had tandem duplication on Chr2 (6 genes) and Chr6 (6 genes) (Supplementary Table S18). Thus, tandem duplication is responsible for TPS-Cin and TPS04 gene family expansion in M. officinalis after its split from C. canephora (Fig. 6). In addition, we noticed that the expression patterns of most genes in the same group were similar. Nevertheless, TPS-a was mainly expressed in AR, and TPS-b showed high expression in stalks, which may be related to the tissue-specific localization of substance synthesis (Supplementary Fig. S7b and Supplementary Table S18).
Iridoids are the major terpenoids in M. officinalis and are monoterpene analogs. In the iridoid biosynthesis pathway, one GES (TPS-g subfamily), fifteen G10Hs, six 10HGOs, three ISs, four 7-DLSs and twelve 7-DLGTs were identified (Fig. 5a and Supplementary Table S17). Based on the chromosome location, we found that these functional genes (G10H, 10HGO, IS, 7-DLS, and 7-DLGT) may have undergone tandem duplication in the M. officinalis genome. Interestingly, this duplication also existed in C. canephora, which suggested that the duplication of these gene families may have occurred before the speciation of M. officinalis and C. canephora.
Significant expansion of genes related to sugar metabolism
As shown in Fig. 5b, we deduced the synthesis pathway of polysaccharides in M. officinalis based on the enzymes involved in carbon metabolism in the pathways “starch and sucrose metabolism” and “amino and nucleotide sugar metabolism”. In M. officinalis, nine genes encoding sacA, four genes encoding malZ, six genes encoding scrK, and 12 genes encoding HK were identified, most of which were highly expressed in leaves and stalks (Fig. 5b and Supplementary Table S19). We also identified 44 genes encoding nucleotide-diphospho-sugar interconversion enzymes, most of which showed diverse expression patterns in different tissues, indicating the complexity of the regulation of polysaccharide biosynthesis (Fig. 5b and Supplementary Table S19). The sacA gene catalyzes the formation of D-fructose and Glc-6P as precursors. Two tandem repeat blocks were found on Chr1 and Chr3 (Fig. 6). UGDH is responsible for the biosynthesis of UDP-GlcA, which might be the restrictive precursor of other nucleotide-diphospho-sugars, such as UDP-Gal, UDP-d-Xyl and UDP-l-Ara. We found that the UGDH gene in M. officinalis was expanded compared to that in C. canephora, which might have contributed to the formation of UDP-GlcA (Fig. 6 and Supplementary Table S20).
We also paid particular attention to other expanded gene families related to the starch and sucrose metabolism pathways (Fig. 5c). Based on their function, these extended genes catalyze the formation of glucose from other sugars, which is one of the important substrates for glycolysis and polysaccharide synthesis. A significant tandem duplication event of BGL genes was identified in M. officinalis on Chr1, Chr2, Chr4, contig9, and contig 17 (Fig. 6). For the bglB genes, similar tandem repeats were found on Chr3, Chr9, and Chr11 (Fig. 6 and Supplementary Table S20). The AMY genes can decompose starch into maltose and then form glucose under the catalysis of malQ and malZ. Tandem duplication of AMY genes was also found on Chr6 (Fig. 6 and Supplementary Table S20). These results suggested that tandem duplication is responsible for the expansion of these genes after the separation between M. officinalis and C. canephora.