Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat not aligning (pretty much) anything.

    Hey guys, first post (and thus a novice).

    I've used TopHat before on my labs own data and it worked pretty well. Now I am using a data set from another lab and there is an issue. Nothing (except for ~2100 reads) maps to the genome.

    BLASTing some of the sequences proves it's Drosophila RNA in the fastq file. I've provided a bit from the fastq, and my command line input. But if there is more you need then let me know.

    I have a suspicion that it may be a problem with the formatting of the quality scores, having --solexa1.3-quals instead throws an error at the prep-read stage

    Thanks guys.

    .fastq sample:

    @SRR070266.18 HWUSI-EAS1720_3:3:1:0:442 length=216
    NGCCAAGCAAGGCGAATTTATTTATGCCACTAAGCGTGGTATTGTCCGACTACGGAATGACCATGAGATTACACTGGAGGATGTACTCTTTTGTAAGGAAGCTGCTGGCTTTTGTCAANNTCNATCGANANTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGG
    +SRR070266.18 HWUSI-EAS1720_3:3:1:0:442 length=216
    !(*)(-/,,'8::88FFFF;-5//3FF;FFFFFFF555-544333F;F5FF=;FF#####################################################5=@CC8B###!!##!#####!#!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!###


    Command line:

    tophat -i 40 -p 5 -G ./GFF/dmel-all-r5.53.gff -o ./out/ --solexa-quals ./Dmel_AllChr_Bowtie2indx ./Testis-RNA-seq-ModEncode.fastq

  • #2
    Could the problem be with the formatting of the files as well as the quality values?

    I notice that the read header lines say length=216, are these paired end reads that have been merged together, and is that why they don't align?

    Comment


    • #3
      Here is a description from the ModEncode website.

      "RNA-seq analysis was performed on poly(A)+ RNA from various tissues dissected from various stages of D. melanogaster. Total RNA was isolated by the Peter Cherbas group. Isolation of poly(A)+ RNA and strand-specific library construction were performed in the Brenton Graveley lab. Libraries were distributed among 3 labs in the Drosophila Transcriptome group for sequencing (Celniker, Gingeras, and Graveley) for paired-end RNA sequencing (2x76+ nt) on the GAIIx and HiSeq platforms. Fastq files were generated using pipeline version 1.5."

      I'll admit I had just assumed it wasn't paired-end, I've worked with paired-end before and I thought you needed two .fastq files for that. But my bad.

      Comment


      • #4
        Paired-end doesn't need 2 fastq files; It can be interleaved. But as mastal said, it looks like that read was paired data that was merged.

        Tophat can't handle really low-quality data (which that read is). If most of your reads look like that, with mostly N's and Q2 bases, it's not surprising almost nothing aligned. I suggest you go back to the original raw fastq files and do quality-control on them: quality trimming and adapter trimming (in case of insert sizes being shorter than read length), and try again, but it's possible the library is just a complete failure.

        If you download BBMap, you can do quality- and adapter-trimming with bbduk.sh and also try aligning with bbmap.sh, which is much more robust to errors than Tophat. Still, at the default settings, it would also not align that particular read unless it was at least trimmed to remove the Ns.

        Comment


        • #5
          Is it this study?

          http://www.ncbi.nlm.nih.gov/sra/SRX029215

          Study summary: SRP003893 • D. melanogaster Cell Line Stranded RNASeq
          So it kind of looks like merged reads. And I think the quality is qual-64 judging from the '!'s for base N.

          Comment


          • #6
            Originally posted by yueluo View Post
            And I think the quality is qual-64 judging from the '!'s for base N.
            Nope, that's ASCII-33:
            ! = 33 (Q0)
            # = 35 (Q2)

            Comment


            • #7
              Oops, sorry my bad

              Comment


              • #8
                Hey guys,

                Thanks for the input, I kinda knew something was off about the fragment length, wouldn't have known it was paired end. I still don't see the point in making a .fastq like that where you can't discern where one read ends and the other begins.

                Anyway, I took the quick and dirty way out and trimmed the reads to ~50bp (thank you fastx toolkit), and achieved >98% of reads mapped and post-processing is going well so far.

                All the best,

                Gordon

                Comment


                • #9
                  Usually, PE-reads are merged when you have a small insert library - e.g. You have ~300bp library but do PE250 with miseq, such PE-reads will overlap at 3'-ends. After trimming the low-qual bases for both reads, you can merge them into one from the overlapping bases.

                  Comment


                  • #10
                    I think in this case the reads were merged because the data came from the NCBI's repository (SRA), and that is the format that they use for paired reads.

                    I think there are tools to split the data back into R1 and R2, there have previously been discussions about that here on SEQanswers. I will try to find a link and post it.

                    See for example this thread on biostars



                    You need to use fastq-dump from the SRA-Toolkit.
                    Last edited by mastal; 03-07-2014, 02:54 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Recent Advances in Sequencing Analysis Tools
                      by seqadmin


                      The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                      05-06-2024, 07:48 AM
                    • seqadmin
                      Essential Discoveries and Tools in Epitranscriptomics
                      by seqadmin




                      The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                      04-22-2024, 07:01 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Yesterday, 07:03 AM
                    0 responses
                    15 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-10-2024, 06:35 AM
                    0 responses
                    37 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-09-2024, 02:46 PM
                    0 responses
                    43 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 05-07-2024, 06:57 AM
                    0 responses
                    39 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X