Many of the non-coding RNAs (ncRNAs) produced by eukaryotic cells have been demonstrated to have crucial roles in biological processes. Long non-coding RNAs (lncRNAs), short non-coding RNAs (sRNAs), and circular non-coding RNA (circRNA) are the three different forms of ncRNAs. ncRNAs do not have the potential to be translated into proteins. They actively participate in the regulation, transcription, or post-transcriptional changes of gene expression. The presence, quality, and function of ncRNA in the gene expression of a biological sample at a specific time are revealed by RNA sequencing using next-generation sequencing technology. The ncRNAs also serve the role of reliable biomarkers for diagnoses and have functions in epigenetics.
The bioinformatics analysis of ncRNA sequencing involves the following steps.
1. Quality control
Here, first, Raw data is generated and stored in FASTQ format (a text-based format for storing a nucleotide sequence). The FASTQ format has four different rows for “Sequence ID”, Read bases, separator, and quality score provider of the FASTQ format. Moving on to the data filtering step, which uses the fast program to introduce raw reads and produce clean reads. In the end, three things; error rate, base content, and the portion of raw reads transformed into clean reads, are attained.
2. Mapping
On the clean reads, mapping is performed using a program called Hierarchical Indexing for Spliced Alignment of Transcripts (HISAT2). The reference genome is indexed using a graph-based method, and the Bowtie2 algorithm is used for alignment. This method yields more accurate results with quick and sensitive alignment. Here, the output is a binary form of a SAM (Sequence Alignment Map) file called a BAM file. These BAM files may now be seen in the Integrative Genomics Viewer and compared to the reference genome to determine their differences.
3. Annotation
After getting BAM files, annotation is done through a software named StringTie. Annotation means identifying functional elements along the sequence of a genome. It uses a network for algorithm as well as an optional de novo assembly step to assemble steps into known or novel gene models based on known gene annotations. In this case, BAM files plus reference annotation files are introduced (input) and a GTF is obtained (output) through transcript annotation of the assembled and aligned reads. Now the assembled transcripts are merged to remove duplicate or redundant transcripts. After this, different filters (Exon number filter, transcript length filter, coding potential filter, etc.) are used to identify and predict the ncRNA types.
4. Quantitative analysis
The simplest approach to quantify the ncRNA and coding gene is to count the number of reads that map to each transcript. However, two factors need to be taken into consideration. First, the estimated expression level depends on the read counts and total reads sequenced for each sample and the second is that read counts also depend on total gene/transcript length. This means it is essential to perform a normalization step to make the data comparable between and within samples.
5. Functional analysis
In functional analysis, biological reference is assigned to a set of genes. It is determined whether there is the enrichment of any known biological functions, interactions, or pathways. So, a software called ClusterProfiler is used which implements methods to analyze and visualize functional profiles of genomic coordinates, gene and gene clusters and enrich the data. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are the most frequently used databases for functional analysis. Aside from data enrichment, ncRNA target prediction can also be done.
6. Qualitative analysis
In this step, variant discovery, and alternative spicing (AS) is done through software; GATK and rMATS respectively. GATK is used to identify where the aligned reads differ from the reference genome and write to a variant call format (VCF) file. BAM files are introduced and VCF files are obtained. On the other hand, rMATS is designed for detecting differential AS in replicated RNA-seq data.
How Novogene Can Help
Novogene has accumulated extensive experience in non-coding RNA library preparation, sequencing, and bioinformatics analysis across numerous species. It can prepare rRNA removal libraries, with a sequencing strategy of PE150, keeping strand-specific directional library by default for lncRNA-seq and circRNA-seq. Optional features of Globin mRNA removal and exosome RNA are also provided. To avoid bacterial contamination, dual rRNA depletion strategy is adopted. For sRNA-seq, sRNAs removal and directional libraries are not needed due to their small sizes, and the SE50 strategy is adopted. Novogene also provides sequencing-only services for premade libraries as well. Novogene utilizes their deep scientific knowledge, first-class customer service, and unsurpassed data quality to help clients realize their research goals in the rapidly evolving world of genomics. To get in touch with Novogene and request more information or a quote, please go here.
The bioinformatics analysis of ncRNA sequencing involves the following steps.
- Quality control
- Mapping
- Annotation
- Quantitative analysis (expression)
- Functional analysis
- Qualitative analysis (characterization)
1. Quality control
Here, first, Raw data is generated and stored in FASTQ format (a text-based format for storing a nucleotide sequence). The FASTQ format has four different rows for “Sequence ID”, Read bases, separator, and quality score provider of the FASTQ format. Moving on to the data filtering step, which uses the fast program to introduce raw reads and produce clean reads. In the end, three things; error rate, base content, and the portion of raw reads transformed into clean reads, are attained.
2. Mapping
On the clean reads, mapping is performed using a program called Hierarchical Indexing for Spliced Alignment of Transcripts (HISAT2). The reference genome is indexed using a graph-based method, and the Bowtie2 algorithm is used for alignment. This method yields more accurate results with quick and sensitive alignment. Here, the output is a binary form of a SAM (Sequence Alignment Map) file called a BAM file. These BAM files may now be seen in the Integrative Genomics Viewer and compared to the reference genome to determine their differences.
3. Annotation
After getting BAM files, annotation is done through a software named StringTie. Annotation means identifying functional elements along the sequence of a genome. It uses a network for algorithm as well as an optional de novo assembly step to assemble steps into known or novel gene models based on known gene annotations. In this case, BAM files plus reference annotation files are introduced (input) and a GTF is obtained (output) through transcript annotation of the assembled and aligned reads. Now the assembled transcripts are merged to remove duplicate or redundant transcripts. After this, different filters (Exon number filter, transcript length filter, coding potential filter, etc.) are used to identify and predict the ncRNA types.
4. Quantitative analysis
The simplest approach to quantify the ncRNA and coding gene is to count the number of reads that map to each transcript. However, two factors need to be taken into consideration. First, the estimated expression level depends on the read counts and total reads sequenced for each sample and the second is that read counts also depend on total gene/transcript length. This means it is essential to perform a normalization step to make the data comparable between and within samples.
5. Functional analysis
In functional analysis, biological reference is assigned to a set of genes. It is determined whether there is the enrichment of any known biological functions, interactions, or pathways. So, a software called ClusterProfiler is used which implements methods to analyze and visualize functional profiles of genomic coordinates, gene and gene clusters and enrich the data. Gene ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) are the most frequently used databases for functional analysis. Aside from data enrichment, ncRNA target prediction can also be done.
6. Qualitative analysis
In this step, variant discovery, and alternative spicing (AS) is done through software; GATK and rMATS respectively. GATK is used to identify where the aligned reads differ from the reference genome and write to a variant call format (VCF) file. BAM files are introduced and VCF files are obtained. On the other hand, rMATS is designed for detecting differential AS in replicated RNA-seq data.
How Novogene Can Help
Novogene has accumulated extensive experience in non-coding RNA library preparation, sequencing, and bioinformatics analysis across numerous species. It can prepare rRNA removal libraries, with a sequencing strategy of PE150, keeping strand-specific directional library by default for lncRNA-seq and circRNA-seq. Optional features of Globin mRNA removal and exosome RNA are also provided. To avoid bacterial contamination, dual rRNA depletion strategy is adopted. For sRNA-seq, sRNAs removal and directional libraries are not needed due to their small sizes, and the SE50 strategy is adopted. Novogene also provides sequencing-only services for premade libraries as well. Novogene utilizes their deep scientific knowledge, first-class customer service, and unsurpassed data quality to help clients realize their research goals in the rapidly evolving world of genomics. To get in touch with Novogene and request more information or a quote, please go here.