Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cerebralrust
    Junior Member
    • Jan 2012
    • 8

    Too few reads mapping back to contigs

    I assembled plant transcriptome 454 data (non normalised) using trinity after the following

    1)pre processing (removal of adaptors, vector contamination)
    2)removal of rRna sequences
    3)removal of chloroplast and mitochondrial genes using bwa

    From 3,70,929 reads, i got 21,486 contigs. When i mapped the reads to the contigs using bwa, only 44,678 reads were used in the assembly. What am i doing wrong here? I randomly blasted the contigs to observe that they share over 90% similarity with related legume proteins (although many were hypothetical).
    However, only a small percentage of the contigs align to the transcript assemblies of related legumes when mapped using bwa.

    The velvet assembly of the same data resulted in 15,323 contigs with lesser n50 value, n90 value, max length etc.
    MIRA assembly resulted in more contigs and more reads being used but lesser n50, n90 and avg length of contig.
    Why are only 44,678 reads being used? Any advice is greatly appreciated.
  • seqret
    Junior Member
    • Sep 2009
    • 1

    #2
    I assume you used newbler, and that you had about 370 thousand reads? Did you check the 454NewblerMetrics.txt file and/or the 454ReadStatus.txt file to determine how many reads the assembler thought it used? I would guess that the assembly was very fragmented so that many of the reads ended up in contigs that were too small to report. When doing transcriptome assemblies, Newbler has some rules about what gets reported as isotigs, contigs, or not reported at all --- don't remember them all off the top of my head.

    Also, you did tell the assembler that this is a cdna assembly project, correct?

    Comment

    • westerman
      Rick Westerman
      • Jun 2008
      • 1104

      #3
      @seqret ... note his first line. He used Trinity, not newbler. Then he used Velvet and MIRA.

      Originally posted by cerebralrust View Post
      I assembled plant transcriptome 454 data (non normalised) using trinity

      I have been thinking about this problem. Hard to tell without looking at the data. However it is possible that Trinity, Velvet and MIRA are not up to the task. If you are recommending using Newbler then I heartily agree with that idea.

      Comment

      • kmcarr
        Senior Member
        • May 2008
        • 1181

        #4
        I'm wondering if the problem is not with the assembly but with the mapping. Is bwa the best tool to use here, or were the options used appropriate? (I'm asking because I'm not that familiar with bwa.) Frankly, if I had a set of contigs (putative transcripts) and wanted to map raw 454 reads back to them just to count I would use blat.

        Comment

        • lh3
          Senior Member
          • Feb 2008
          • 686

          #5
          For 454, I recommend bwasw, bowtie2, smalt or tmap. Blat is a bit slow and does not output SAM.

          Comment

          • Jeremy
            Senior Member
            • Nov 2009
            • 190

            #6
            I would recommend Newbler since it has been specifically designed for 454 data.
            I am assuming that by mapping the reads back you are trying to get read counts per contig/isotig/isogroup yes?

            If you use newbler you can get read counts per contig from the 454ReadStatus.txt file that is produced when you perform a transcriptome assembly. Just do a grep for 'Assembled' and count the number of times each contig appears, if you have different samples in different lanes you can do the appropriate grep to subset them also. This file lists the 3` and 5` match of each read so you effectively count each read twice. I don't think that is a problem since the reads are generally pretty long to begin with. This method means that some contigs may have a zero or low read count, but it does count every read so that should not be a problem after you sum the read counts of contigs to form read counts per isotig.

            Alternatively you can grep 'Assembled', and make a subset of the assembled reads and then map them back to your contigs using GSMapper. I recommend only using reads with the assembled status to minimise false mapping. I use mapping for SNP deiscovery also, so I set -ais 1 which means that the mapped read needs to be a very good match.
            Last edited by Jeremy; 02-23-2012, 10:22 PM.

            Comment

            • cerebralrust
              Junior Member
              • Jan 2012
              • 8

              #7
              Thank you for all your suggestions, members!

              @ seqret : As Rick pointed out, i've never used Newbler.

              @ Rick : Using Newbler is not an option, i guess, since it is not open source and we got the sequenced data from a collaborator in the US. Perhaps my only option is to standardise mira parameters to improve the assembly?

              @kmcarr : I was wondering about the mapping also. I will try mapping with bwasw and bowtie2 on the suggestion of lh3 since i require results in sam format also.

              @lh3 : I will try all, compare and pick the best one.

              @Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future.

              I suppose i'm left with the option of using mira with various combinations of parameters to get the best assembly.

              If it may be of help to anyone, I should not have used Trinity for this data considering :

              According to one of key developers of Trinity - Brian J. Haas' option:

              "Ultimately, Trinity might not be the best tool for assembling 454 data, since coverage won't be anywhere near what is expected from Illumina in most cases, and Trinity exploits the high coverage data as part of reconstructing transcripts. The current version of Newbler is supposed to work especially well for 454 transcriptome data, so I encourage you to give that a try if you haven't already."

              Comment

              • kmcarr
                Senior Member
                • May 2008
                • 1181

                #8
                Originally posted by cerebralrust View Post
                @Jeremy : As i mentioned before, Newbler is not an option since it is not open source and i'm a poor undergraduate student. But i will keep your suggestions in mind for the future.
                Newbler may be proprietary but proprietary != $. You can obtain Newbler free of charge by completing the software request at this webpage. Note: I'm not sure if there are any restrictions for non-USA distribution.

                Comment

                • Jeremy
                  Senior Member
                  • Nov 2009
                  • 190

                  #9
                  Once you do get Newbler, you should use the .sff file(s) for assembly and mapping. This file has the quality scores as well as the fasta sequence so it will produce much better results than just a .txt of the sequence.

                  Comment

                  Latest Articles

                  Collapse

                  • seqadmin
                    New Genomics Tools and Methods Shared at AGBT 2025
                    by seqadmin


                    This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                    The Headliner
                    The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                    03-03-2025, 01:39 PM
                  • seqadmin
                    Investigating the Gut Microbiome Through Diet and Spatial Biology
                    by seqadmin




                    The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                    02-24-2025, 06:31 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by seqadmin, 03-20-2025, 05:03 AM
                  0 responses
                  17 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-19-2025, 07:27 AM
                  0 responses
                  18 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-18-2025, 12:50 PM
                  0 responses
                  19 views
                  0 reactions
                  Last Post seqadmin  
                  Started by seqadmin, 03-03-2025, 01:15 PM
                  0 responses
                  185 views
                  0 reactions
                  Last Post seqadmin  
                  Working...