Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Help with RNA-seq and strand specific mapping

    I don't have much of a background in bioinformatics so any help that you can give me in figuring this out would be much appreciated.
    I have paired end strand specific RNA seq data (sequenced with Illumina) from a non-model organism (Tasmanian Devil if you want to know), and I have a reference genome against which I can map the reads to. What I want to be able to do with this data is look for novel transcripts and non-coding RNAs, and visualise the data using IGV.

    This is what I've been doing thus far:

    I indexed the ref genome and used trimmomatic to trim the RNA reads, then I used tophat as follows:

    tophat2 -r 60 -p 12 -G /path/to/GFF_file/Fgenesh_EST_aligned.gff -o /path/to/output/ --library-type fr-firststrand /path/to/ref_genome/421K17 /path/to/Trimmed/OP1_1_L4.fastq /path/to/Trimmed/OP1_2_L4.fastq

    Then I used samtools to sort the accepted hits file and index the sorted file:

    samtools sort accepted_hits.bam accepted_hits.sort

    samtools index accepted_hits.sort.bam

    However when I look at the data on IGV this is what I see (http://imgur.com/mnzm5hn and http://imgur.com/FymviY7)

    (settings: view as pairs; color alignments by first of pair strand; sort by mapping quality)

    I have also tried using flags with samtools to try and get the strand specific transcripts as follows

    samtools view -b -f 99 accepted_hits.bam > acc_hits1.bam
    samtools view -b -f 147 accepted_hits.bam > acc_hits2.bam
    samtools view -b -f 83 accepted_hits.bam > acc_hits3.bam
    samtools view -b -f 163 accepted_hits.bam > acc_hits4.bam

    samtools merge file_1.bam acc_hits1.bam acc_hits2.bam
    samtools merge file_2.bam acc_hits3.bam acc_hits4.bam

    I converted these to the sam format, indexed it and viewed it with IGV and this is what i got (http://imgur.com/6bCWFGq). Essentially file_1.sam and file_2.sam look almost the same, just that one has more reads than the other, but they appear to me mapping to essentially the same place.

    Is there something obvious that I have missed, either in the way I have prepared the data, or perhaps the way in which I'm viewing it on IGV?? Any help that you can give me in this would be much appreciated.

    Thanks!!
    Last edited by weirdkid; 10-31-2015, 05:52 PM. Reason: images didn't show up

  • #2
    Hi, what protocol was used to prepare the library?

    Comment


    • #3
      Hi Jim,

      do you mean the library for the reference genome? If so this is what I used:

      bowtie2-build /path/to/421K17.fa /path/to/421K17/421K17

      thanks

      Comment


      • #4
        No, I mean the library prep for the RNA, e.g. dUTP.

        Comment


        • #5
          In that case I'm not sure. I got this data from another lab and I don't think they passed that information on to me.

          Comment


          • #6
            I ask because it doesn't appear to be strand preserving. I would verify that with the lab you received it from.

            Comment


            • #7
              I'll check this with the lab and come back to you on that then. But what makes you think that it is not strand preserving?

              Comment


              • #8
                The reads look more or less evenly distributed between strands, when colored by first-in-pair strand. Also, the junctions look more or less even between strands. It just has the look of an un-stranded library, if it is no amount of informatics will recover the strand information. I thought that was the gist of your original post, sorry if I misunderstood.

                Comment


                • #9
                  So I asked and from what I was told the samples were sequenced with Illumina's TrueSeq RNA kit (total RNA) and was strand specific.

                  Comment


                  • #10
                    What is the difference between the two subsets in your IGV example?
                    In terms of strand-preservation Illumina's TruSeq stranded achieves 90 % - 97 %; so if you see in the correct strand-orientation 950 reads and in the wrong orientation 50 your result is acceptable.
                    For a sample-wide anaylsis, you can use RSeQC's infer_experiment function on the complete data set (accepted_hits.sort).

                    Comment


                    • #11
                      I've tried using RseQC's infer experiment script, but I'm not sure what it means when it asks for the reference genome in bed format. My reference genome is in fasta format, and converting from fasta to bed doesn't make sense to me...

                      Comment


                      • #12
                        RSeQc needs gene/transcript information. Thus, you need to convert your gtf/gff (Fgenesh_EST_aligned.gff) into a 12-column bed-file. Either you script it yourself or use for instance a tool like this.

                        Comment


                        • #13
                          Finally managed to do it. I couldn't quite get the gff/gtf to convert to bed so I just did it manually.
                          Anyway, here is what I got:

                          This is PairEnd Data
                          Fraction of reads failed to determine: 0.0000
                          Fraction of reads explained by "1++,1--,2+-,2-+": 0.1667
                          Fraction of reads explained by "1+-,1-+,2++,2--": 0.8333

                          Therefore the data is strand specific paired end data. But I'm not quite sure how to interpret the differences between the two. If the majority of the reads are explained by ""1+-,1-+,2++,2--": 0.8333", does this mean that I should have analyzed the data in tophat with fr-secondstrand instead of fr-firststrand??

                          Comment


                          • #14
                            According to this post, you should use fr-firststrand. When I analysed TruSeq stranded libraries, I yielded the same orientation, but higher percentages. For non-model organism it might be different due to incomplete annotation.

                            Comment


                            • #15
                              Hmmm, well I used fr-firststrand so I guess that can't be it.

                              I'll just keep trying a few different things and see where it gets me.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                Yesterday, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              28 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-24-2024, 06:58 AM
                              0 responses
                              25 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-23-2024, 08:43 AM
                              0 responses
                              57 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X