I was recently at a meeting about RNA-seq in general, and the topic of small RNA-seq came up, something with which I'm quite unfamiliar. The discussions were interesting, but seeing as I didn't know much about sRNA-seq (and I was the "RNA-seq"-guy at the meeting), they didn't get very far. I've since tried to learn a bit about it, and I wanted to ask some questions to clear up things I'm not sure about...
1) A general pipeline for sRNA-seq. As far as I understand it, the sequencing adapters are proportionally a much larger part of the reads than for normal RNA-seq. This would make adapter trimming more or less mandatory for any sRNA-seq analysis. Is this correct?
2) Seeing as sRNA is a lot smaller, would that mean that there are more duplicated reads in an sRNA-seq dataset? If so, would you remove them?
3) As far as alignment goes, I can't really understand if one should use one of the sRNA-specific aligners I seem to find by googling, or to use one of the normal RNA-seq aligners (STAR, Tophat, etc.). I seem to find information saying that you can use either...
4) Can you align to the normal human reference genome (such as GRCh38), or do you need to add some sRNA-specific database? I found miRBase, for example, which (as far as I can tell) is a database for miRNA sequences. I assume one could align to that, if one is only interested in miRNA? Or should those sequences be added to e.g. GRCh38 and then aligned to the collated reference?
Since I'm interested in this purely from a learning and knowledge perspective, I won't actually work with any sRNA-seq dataset. I did download a run from the SRA and put it through my standard alignment pipeline just to see what happened, though. I got around 80% ambigously alignments and about 10% duplicated reads using just a very simple STAR 2-pass alignment to GRCh38 without any sRNA-specific sequences added and no adapter/quality trimming. Do these numbers make sense for the non-optimised (from an sRNA perspective) pipeline used? What would be required to get a better alignment?
1) A general pipeline for sRNA-seq. As far as I understand it, the sequencing adapters are proportionally a much larger part of the reads than for normal RNA-seq. This would make adapter trimming more or less mandatory for any sRNA-seq analysis. Is this correct?
2) Seeing as sRNA is a lot smaller, would that mean that there are more duplicated reads in an sRNA-seq dataset? If so, would you remove them?
3) As far as alignment goes, I can't really understand if one should use one of the sRNA-specific aligners I seem to find by googling, or to use one of the normal RNA-seq aligners (STAR, Tophat, etc.). I seem to find information saying that you can use either...
4) Can you align to the normal human reference genome (such as GRCh38), or do you need to add some sRNA-specific database? I found miRBase, for example, which (as far as I can tell) is a database for miRNA sequences. I assume one could align to that, if one is only interested in miRNA? Or should those sequences be added to e.g. GRCh38 and then aligned to the collated reference?
Since I'm interested in this purely from a learning and knowledge perspective, I won't actually work with any sRNA-seq dataset. I did download a run from the SRA and put it through my standard alignment pipeline just to see what happened, though. I got around 80% ambigously alignments and about 10% duplicated reads using just a very simple STAR 2-pass alignment to GRCh38 without any sRNA-specific sequences added and no adapter/quality trimming. Do these numbers make sense for the non-optimised (from an sRNA perspective) pipeline used? What would be required to get a better alignment?
Comment