Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • TopHat not aligning (pretty much) anything.

    Hey guys, first post (and thus a novice).

    I've used TopHat before on my labs own data and it worked pretty well. Now I am using a data set from another lab and there is an issue. Nothing (except for ~2100 reads) maps to the genome.

    BLASTing some of the sequences proves it's Drosophila RNA in the fastq file. I've provided a bit from the fastq, and my command line input. But if there is more you need then let me know.

    I have a suspicion that it may be a problem with the formatting of the quality scores, having --solexa1.3-quals instead throws an error at the prep-read stage

    Thanks guys.

    .fastq sample:

    @SRR070266.18 HWUSI-EAS1720_3:3:1:0:442 length=216
    NGCCAAGCAAGGCGAATTTATTTATGCCACTAAGCGTGGTATTGTCCGACTACGGAATGACCATGAGATTACACTGGAGGATGTACTCTTTTGTAAGGAAGCTGCTGGCTTTTGTCAANNTCNATCGANANTNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGG
    +SRR070266.18 HWUSI-EAS1720_3:3:1:0:442 length=216
    !(*)(-/,,'8::88FFFF;-5//3FF;FFFFFFF555-544333F;F5FF=;FF#####################################################5=@CC8B###!!##!#####!#!#!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!###


    Command line:

    tophat -i 40 -p 5 -G ./GFF/dmel-all-r5.53.gff -o ./out/ --solexa-quals ./Dmel_AllChr_Bowtie2indx ./Testis-RNA-seq-ModEncode.fastq

  • #2
    Could the problem be with the formatting of the files as well as the quality values?

    I notice that the read header lines say length=216, are these paired end reads that have been merged together, and is that why they don't align?

    Comment


    • #3
      Here is a description from the ModEncode website.

      "RNA-seq analysis was performed on poly(A)+ RNA from various tissues dissected from various stages of D. melanogaster. Total RNA was isolated by the Peter Cherbas group. Isolation of poly(A)+ RNA and strand-specific library construction were performed in the Brenton Graveley lab. Libraries were distributed among 3 labs in the Drosophila Transcriptome group for sequencing (Celniker, Gingeras, and Graveley) for paired-end RNA sequencing (2x76+ nt) on the GAIIx and HiSeq platforms. Fastq files were generated using pipeline version 1.5."

      I'll admit I had just assumed it wasn't paired-end, I've worked with paired-end before and I thought you needed two .fastq files for that. But my bad.

      Comment


      • #4
        Paired-end doesn't need 2 fastq files; It can be interleaved. But as mastal said, it looks like that read was paired data that was merged.

        Tophat can't handle really low-quality data (which that read is). If most of your reads look like that, with mostly N's and Q2 bases, it's not surprising almost nothing aligned. I suggest you go back to the original raw fastq files and do quality-control on them: quality trimming and adapter trimming (in case of insert sizes being shorter than read length), and try again, but it's possible the library is just a complete failure.

        If you download BBMap, you can do quality- and adapter-trimming with bbduk.sh and also try aligning with bbmap.sh, which is much more robust to errors than Tophat. Still, at the default settings, it would also not align that particular read unless it was at least trimmed to remove the Ns.

        Comment


        • #5
          Is it this study?

          http://www.ncbi.nlm.nih.gov/sra/SRX029215

          Study summary: SRP003893 • D. melanogaster Cell Line Stranded RNASeq
          So it kind of looks like merged reads. And I think the quality is qual-64 judging from the '!'s for base N.

          Comment


          • #6
            Originally posted by yueluo View Post
            And I think the quality is qual-64 judging from the '!'s for base N.
            Nope, that's ASCII-33:
            ! = 33 (Q0)
            # = 35 (Q2)

            Comment


            • #7
              Oops, sorry my bad

              Comment


              • #8
                Hey guys,

                Thanks for the input, I kinda knew something was off about the fragment length, wouldn't have known it was paired end. I still don't see the point in making a .fastq like that where you can't discern where one read ends and the other begins.

                Anyway, I took the quick and dirty way out and trimmed the reads to ~50bp (thank you fastx toolkit), and achieved >98% of reads mapped and post-processing is going well so far.

                All the best,

                Gordon

                Comment


                • #9
                  Usually, PE-reads are merged when you have a small insert library - e.g. You have ~300bp library but do PE250 with miseq, such PE-reads will overlap at 3'-ends. After trimming the low-qual bases for both reads, you can merge them into one from the overlapping bases.

                  Comment


                  • #10
                    I think in this case the reads were merged because the data came from the NCBI's repository (SRA), and that is the format that they use for paired reads.

                    I think there are tools to split the data back into R1 and R2, there have previously been discussions about that here on SEQanswers. I will try to find a link and post it.

                    See for example this thread on biostars



                    You need to use fastq-dump from the SRA-Toolkit.
                    Last edited by mastal; 03-07-2014, 02:54 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Best Practices for Single-Cell Sequencing Analysis
                      by seqadmin



                      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                      06-06-2024, 07:15 AM
                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 06-07-2024, 06:58 AM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-06-2024, 08:18 AM
                    0 responses
                    23 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-06-2024, 08:04 AM
                    0 responses
                    20 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-03-2024, 06:55 AM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X