In a typical RNA-seq experiment, Illumina paired-end, Hs ensembl genome, happy tuxedo mapping and HTSeq counting, we get a list of genes associated with number of reads.
The typical distribution of frequency is bimodal:
- a peak with low abundant genes considered as noise
- a peak with gene of biological (biochemical) interest, that one is considered for typical significance analysis, SNP analysis aso.
What is the composition, what is the sens of that first peak containing mainly genes without any read but also a proportion of genes with, let's say, less than 10 reads.
These latest are 19% processed pseudogenes, 16% protein coding, 12% Linc RNA, 10% antisens which doesn't follow the biotype distribution from Ensembl.
Why only <10 read uniquely mapped to these genes? Any suggestion?
Ie: contamination of DNA? Nascent RNA? RNA dark/junk matter? Size (of gene) matter?
The typical distribution of frequency is bimodal:
- a peak with low abundant genes considered as noise
- a peak with gene of biological (biochemical) interest, that one is considered for typical significance analysis, SNP analysis aso.
What is the composition, what is the sens of that first peak containing mainly genes without any read but also a proportion of genes with, let's say, less than 10 reads.
These latest are 19% processed pseudogenes, 16% protein coding, 12% Linc RNA, 10% antisens which doesn't follow the biotype distribution from Ensembl.
Why only <10 read uniquely mapped to these genes? Any suggestion?
Ie: contamination of DNA? Nascent RNA? RNA dark/junk matter? Size (of gene) matter?