Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • HiSAT2 RNAseq alignment low + stranded data predicted to be unstranded

    Hi all,

    I have been receiving some confusing output with an RNA-seq dataset that I am trying to ultimately determine differential expression and fpkm values for.

    Background: Library prep was stranded TruSeq (dUTP method) of mouse metatranscriptome and was given assembled contigs of transcriptome (Trinity) across 18 samples as well as RNA seq fastq files, performed on 4 lanes and two mate pair files, for each of these samples.

    I ran HiSAT2 three times on one of the samples' pairs of reads with the following options (-fr, -rf and -ff) to make sure I am aligning using the correct strandedness (threw in -ff for funsies) but all gave the same output, with really only half maybe of the pairs aligning:

    Code:
    hisat2 -fr -x  ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fa
    stq -S fr.sam
    
    RESULTS:
    6185153 reads; of these:
      6185153 (100.00%) were paired; of these:
        5629996 (91.02%) aligned concordantly 0 times
        475008 (7.68%) aligned concordantly exactly 1 time
        80149 (1.30%) aligned concordantly >1 times
        ----
        5629996 pairs aligned concordantly 0 times; of these:
          1575716 (27.99%) aligned discordantly 1 time
        ----
        4054280 pairs aligned 0 times concordantly or discordantly; of these:
          8108560 mates make up the pairs; of these:
            3511037 (43.30%) aligned 0 times
            3273195 (40.37%) aligned exactly 1 time
            1324328 (16.33%) aligned >1 times
    71.62% overall alignment rate
    
    
    hisat2 -rf -x  ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fastq -S rf.sam
    
    6185153 reads; of these:
      6185153 (100.00%) were paired; of these:
        5629996 (91.02%) aligned concordantly 0 times
        475008 (7.68%) aligned concordantly exactly 1 time
        80149 (1.30%) aligned concordantly >1 times
        ----
        5629996 pairs aligned concordantly 0 times; of these:
          1575716 (27.99%) aligned discordantly 1 time
        ----
        4054280 pairs aligned 0 times concordantly or discordantly; of these:
          8108560 mates make up the pairs; of these:
            3511037 (43.30%) aligned 0 times
            3273195 (40.37%) aligned exactly 1 time
            1324328 (16.33%) aligned >1 times
    71.62% overall alignment rate
    
    
    hisat2 -ff -x  ../../contigs/Sample_120507/Sample_120507 -q -1 120507_CGATGT_S1_L001_R1_001.fastq -2 120507_CGATGT_S1_L001_R2_001.fastq -S ff.sam
    
    RESULTS:
    6185153 reads; of these:
      6185153 (100.00%) were paired; of these:
        5629996 (91.02%) aligned concordantly 0 times
        475008 (7.68%) aligned concordantly exactly 1 time
        80149 (1.30%) aligned concordantly >1 times
        ----
        5629996 pairs aligned concordantly 0 times; of these:
          1575716 (27.99%) aligned discordantly 1 time
        ----
        4054280 pairs aligned 0 times concordantly or discordantly; of these:
          8108560 mates make up the pairs; of these:
            3511037 (43.30%) aligned 0 times
            3273195 (40.37%) aligned exactly 1 time
            1324328 (16.33%) aligned >1 times
    71.62% overall alignment rate
    To further complicate things, I ran infer_experiment.py from RseQC to double check that the data is indeed stranded and all my results came back as suggesting unstranded paired reads:

    Code:
    $ infer_experiment.py -i ff.sam -r ../Sample_120507/fragGeneScan.gff.bed
    Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
    
    Loading SAM/BAM file ...  Total 200000 usable reads were sampled
    
    
    This is PairEnd Data
    Fraction of reads failed to determine: 0.0005
    Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
    Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764
    
    
    $ infer_experiment.py -i rf.sam -r ../Sample_120507/fragGeneScan.gff.bed
    Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
    Loading SAM/BAM file ...  Total 200000 usable reads were sampled
    
    
    This is PairEnd Data
    Fraction of reads failed to determine: 0.0005
    Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
    Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764
    
    
    $ infer_experiment.py -i fr.sam -r ../Sample_120507/fragGeneScan.gff.bed
    Reading reference gene model ../Sample_120507/fragGeneScan.gff.bed ... Done
    Loading SAM/BAM file ...  Total 200000 usable reads were sampled
    
    
    This is PairEnd Data
    Fraction of reads failed to determine: 0.0005
    Fraction of reads explained by "1++,1--,2+-,2-+": 0.5232
    Fraction of reads explained by "1+-,1-+,2++,2--": 0.4764
    But the sequencing core assured me that they were using a stranded protocol. I am thinking maybe the issues of low alignment as well as the [un]stranded issue may be related and due to something that I am missing here, or maybe there is not an issue here and the results are fine.

    Any thoughts/suggestions would be much appreciated!

  • #2
    Having seen a head to head (some time ago) between multiple 'stranded' RNA library kits, I can attest that there are degrees of how good they are, running all the way from 'not stranded at all' to 'really quite stranded'. I don't recall if this kit was one of the under-performers however.

    Although I might want to see some traceability from whoever did the sequencing to demonstrate that they actually used a stranded kit...

    Comment

    Latest Articles

    Collapse

    • seqadmin
      Non-Coding RNA Research and Technologies
      by seqadmin




      Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

      Nobel Prize for MicroRNA Discovery
      This week,...
      10-07-2024, 08:07 AM
    • seqadmin
      Recent Developments in Metagenomics
      by seqadmin





      Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
      09-23-2024, 06:35 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, 10-02-2024, 04:51 AM
    0 responses
    102 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 10-01-2024, 07:10 AM
    0 responses
    111 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 09-30-2024, 08:33 AM
    1 response
    114 views
    0 likes
    Last Post EmiTom
    by EmiTom
     
    Started by seqadmin, 09-26-2024, 12:57 PM
    0 responses
    20 views
    0 likes
    Last Post seqadmin  
    Working...
    X