Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bowtie mapping with R1&R2 and R1

    Hello,

    I have had a problem with mapping short reads generated from Illumina HiSeq to the assembly generated from velvet. I went through other threads not to post repeated questions but they did not solve my problem.

    Basically, it give me very low mapping rates. Also, when I only used one pair of reads, it gave me higher mapping rate which is still low though. Does this mean I used a wrong script to split interleaved reads into separate R1 and R2 reads? Any suggestion would be appreciated.

    1. When R1&R2 are used.
    <Command>
    bowtie index -q -1 R1.fastq -2 R2.fastq output

    <Output>
    # reads processed: 1705514
    # reads with at least one reported alignment: 31904 (1.87%)
    # reads that failed to align: 1673610 (98.13%)
    Reported 31904 paired-end alignments to 1 output stream(s)

    2. When only R1 used.
    <Command>
    bowtie index -q R1.fastq output

    <Output>
    # reads processed: 1705514
    # reads with at least one reported alignment: 291167 (17.07%)
    # reads that failed to align: 1414347 (82.93%)
    Reported 291167 alignments to 1 output stream(s)

  • #2
    Okay, so we'll start with the assumption that there is some kind of mismatch between what your paired reads are, and what bowtie expects them to be, and that mismatch is causing bowtie to not make the right alignments.

    One thing...bowtie has a default expected insert size setting. Maybe that default is wrong for your reads. So try cranking up the max expected insert size.

    The other thing to check, make sure that the nth read of R1 is the mate to the nth read of R2; if you've sorted the reads for some reason, or used certain trimmers, that can juggle the reads around. Bowtie is not going to figure out what is mated to what by name,it will assume that the first reads of each file are mates, and soon, so if that's wrong, that could cause Bowtie to not correctly assign mapping locations.

    Another thing to try; try using bwa instead. bwa aligns each read of a pair separately, then puts them together to make the .sam, (or you can keep the two separate, if you want) and it doesn't make any a priori assumptions about insert size and orientation.

    Comment


    • #3
      Thanks, swbarnes2!

      Based on your suggestion,

      1. The insert size is 560 bp according to the information provided by the sequencing facility I used. I added -X 650 in the command and the mapping rates increased when R1 and R2 were used and same when R1 was used.

      <R1&R2>
      # reads processed: 1705514
      # reads with at least one reported alignment: 108218 (6.35%)
      # reads that failed to align: 1597296 (93.65%)
      Reported 108218 paired-end alignments to 1 output stream(s)

      <R1>
      # reads processed: 1705514
      # reads with at least one reported alignment: 291167 (17.07%)
      # reads that failed to align: 1414347 (82.93%)
      Reported 291167 alignments to 1 output stream(s)

      2. The nth read of R1 is the mate to the nth read of R2.

      I still have no idea why the mapping rates are low and why using single-end reads give me higher mapping rate. I am about to try with bwa and will post what I get.

      Also, I've heard that sometimes low mapping is due to high diversity of species in a sample. Could someone help me clear with this? Thanks!
      Last edited by morning latte; 11-18-2013, 05:26 PM.

      Comment


      • #4
        Are you mapping to an assembly that was *not*created using these reads? That is not clear.

        Were the original reads you received interleaved and you then split them into two files?

        Are the reads QC'ed/appropriately trimmed?

        Comment


        • #5
          Sorry that I was not clear.

          I am mapping to an assembly that was created using these reads. I've got separate R1 and R2 reads from the sequencing core and I interleaved them before QC. After QC, I extracted one file containing only pairs and another file containing orphans for velvet assembly. For bowtie mapping, I further split the file containing only pairs into R1 and R2 files. I've just got result from bwa mapping. This is my first time to run bwa and hopefully I didn't make any mistake. But I guess something is wrong with the pairs. Forward strand means R1 and reverse strand means R2? Then those should be 50%? Can you share with me any script that I can split interleaved reads into separate R1 and R2 reads? Thanks a lot for your help in advance.

          Total reads: 3411028
          Mapped reads: 635188 (18.6216%)
          Forward strand: 3023182 (88.6296%)
          Reverse strand: 387846 (11.3704%)
          Failed QC: 0 (0%)
          Duplicates: 0 (0%)
          Paired-end reads: 3411028 (100%)
          'Proper-pairs': 403004 (11.8147%)
          Both pairs mapped: 403075 (11.8168%)
          Read 1: 1705514
          Read 2: 1705514
          Singletons: 232113 (6.80478%)

          Comment


          • #6
            There are several scripts floating around that can do the split: https://github.com/faircloth-lab/phy...split_reads.py

            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


            That said hopefully your interleaved file does not have scrambled reads (i.e. read pairs are next to each other).

            Why not start with the original read files that you received from the core and then use a paired-end read aware trimming tool like (trimmomatic or cutadapt) that can keep the R1/R2 files in sync. Then use the trimmed files to do the alignment.

            Most aligners expect paired-end reads to be in same order in R1 and R2 files. Perhaps your R1/R2 files have somehow gotten out of order (can you check that) and that may be causing the low percentage of alignment.
            Last edited by GenoMax; 11-18-2013, 06:32 PM.

            Comment


            • #7
              Originally posted by morning latte View Post
              Sorry that I was not clear.

              I am mapping to an assembly that was created using these reads. I've got separate R1 and R2 reads from the sequencing core and I interleaved them before QC. After QC, I extracted one file containing only pairs and another file containing orphans for velvet assembly. For bowtie mapping, I further split the file containing only pairs into R1 and R2 files. I've just got result from bwa mapping. This is my first time to run bwa and hopefully I didn't make any mistake. But I guess something is wrong with the pairs. Forward strand means R1 and reverse strand means R2? Then those should be 50%? Can you share with me any script that I can split interleaved reads into separate R1 and R2 reads? Thanks a lot for your help in advance.

              Total reads: 3411028
              Mapped reads: 635188 (18.6216%)
              Forward strand: 3023182 (88.6296%)
              Reverse strand: 387846 (11.3704%)
              Failed QC: 0 (0%)
              Duplicates: 0 (0%)
              Paired-end reads: 3411028 (100%)
              'Proper-pairs': 403004 (11.8147%)
              Both pairs mapped: 403075 (11.8168%)
              Read 1: 1705514
              Read 2: 1705514
              Singletons: 232113 (6.80478%)
              I've been playing with a fun little script that does the de-interleaving. Not mine, but I'll share it with everyone:

              Code:
              #!/bin/bash
              #Usage:: deinterleave_fastq.sh < interleaved.fastq f.fastq r.fastq
              #
              # Deinterleaves a FASTQ file of paired reads into two FASTQ
              # files specified on the command line.
              #
              # Can deinterleave 100 million paired reads (200 million total
              # reads; a 43Gbyte file), in memory (/dev/shm), in 4m15s (255s)
              #
              # Latest code: https://gist.github.com/3521724
              # Also see my interleaving script: https://gist.github.com/4544979
              #
              # Inspired by Torsten Seemann's blog post:
              # http://thegenomefactory.blogspot.com.au/2012/05/cool-use-of-unix-paste-with-ngs.html
              
              paste - - - - - - - - | tee >(cut -f 1-4 | tr "\t" "\n" > $1) | cut -f 5-8 | tr "\t" "\n" > $2
              What is the coverage for the regions you have assembled? Velvet will tell you what fraction of your reads were actually used: this would be an indirect upper bound on how much mapping you would expect for your assembly. For instance, from one of my velvet runs

              Final graph has 124617 nodes and n50 of 10722, max 81190, total 23259331, using 49339054/119833599 reads
              An assembly using 41% of my reads might not have more than 41% map back to it via bwa. And if your assembly doesn't have the content that you are trying to map, then it won't map. Grepping the assembly for the nucleotide content of one of your unmapped reads might at least tell you if the read was used in the assembly.

              Velvet also has a perl script that extracts the reads used for particular contigs.
              Last edited by ctseto; 11-19-2013, 05:47 AM.

              Comment


              • #8
                Thanks ctseto.

                From the Log file of my velvet run below, 73% of reads were used for assembly.

                Final graph has 297031 nodes and n50 of 149, max 2332, total 24838030, using 2709100/3703537 reads

                I tried with commands suggested by you and GenoMax to split paired-end reads but I've getting the same result that 17% of mapping rate by using single read and 6% by paired-end reads. I am totally lost. I do not know what else I should try..... This is so frustrating.

                Comment


                • #9
                  Okay, now I have more fundamental questions. Could somebody help me figure out two questions? I really appreciate any advice. Thanks a lot in advance.

                  1. Is it possible that mapping to assembly using single read has higher mapping rate compared to using paired-end reads? Mapping using paired reads should have a higher mapping rate in any case?

                  2. My metagenome is from an environmental sample. Is it possible that the mapping rate back to assembly is low due to the metagenome characteristic not due to the error derived from assembly or mapping?

                  Comment


                  • #10
                    Originally posted by morning latte View Post
                    Thanks ctseto.

                    From the Log file of my velvet run below, 73% of reads were used for assembly.

                    Final graph has 297031 nodes and n50 of 149, max 2332, total 24838030, using 2709100/3703537 reads

                    I tried with commands suggested by you and GenoMax to split paired-end reads but I've getting the same result that 17% of mapping rate by using single read and 6% by paired-end reads. I am totally lost. I do not know what else I should try..... This is so frustrating.
                    Your n50 is low and your max contig is 2332.

                    Perhaps try mapping just to your largest contig to assess coverage for now?

                    Also, are you running MetaVelvet downstream of this?

                    Comment


                    • #11
                      Thanks for your respond.

                      What do you mean by running MetaVelvet downstream of this? You mean assembler I am using? I am using Velvet.

                      My metagenome consists of viruses from an environmental sample. I am not surprised that the max contig is small. I should have mentioned it earlier.

                      I will map to the largest contig like you suggested. I am not very certain how to do that. Basically, I extract the longest contig and make index file using it and map? Is there a command to extract the longest read from a file? Thanks a lot for your help.

                      Comment


                      • #12
                        Originally posted by morning latte View Post
                        Okay, now I have more fundamental questions. Could somebody help me figure out two questions? I really appreciate any advice. Thanks a lot in advance.

                        1. Is it possible that mapping to assembly using single read has higher mapping rate compared to using paired-end reads? Mapping using paired reads should have a higher mapping rate in any case?

                        2. My metagenome is from an environmental sample. Is it possible that the mapping rate back to assembly is low due to the metagenome characteristic not due to the error derived from assembly or mapping?
                        1. Mapping single-end gives you greater flexibility in forcing maps; but generally paired end maps would be considered more reliable, since to map two reads both reads must meet the conditions required. In SE, your constraints are identity of read. In PE, your constraints are orientation (innie, outie), distance between reads and getting the content of your reads to match up relative to a reference sequence.

                        --------------
                        Based on what your Velvet Log is telling you, you used a fair amount of the reads in the assembly. But when you back-map this doesn't appear to be the case.

                        Originally posted by morning latte View Post
                        Thanks for your respond.

                        What do you mean by running MetaVelvet downstream of this? You mean assembler I am using? I am using Velvet.

                        My metagenome consists of viruses from an environmental sample. I am not surprised that the max contig is small. I should have mentioned it earlier.

                        I will map to the largest contig like you suggested. I am not very certain how to do that. Basically, I extract the longest contig and make index file using it and map? Is there a command to extract the longest read from a file? Thanks a lot for your help.


                        A metagenomic assembler theoretically is supposed to be able to deal with a mixed-species assembly (though in practice, most de novo assembly is a mixed species problem, with species-of-interest intermixed with contaminant). meta-velvetg is run after velvetg; but it might not be a silver bullet in this case.

                        An n50 of 149 is very small (n50, half of your draft assembly is made up of <= 149 nt pieces) , and a largest contig of 2kb is also quite small. Before determining coverage, it might be worth blasting that longest contig to see what it is. If it's viral, then index and map reads to it and observe how the coverage looks. I don't recall exactly how to pull the biggest node out of velvet in a one-liner, hopefully someone else can post one.

                        velvetg has a -min_contig_lgth option as well, which if you bump it up can screen out a lot of small contigs that are too small to be very meaningful. This may be helpful to you if you wish to do contig mapping; you could set min_contig_lgth to 1kb to screen out a lot of your small contigs.

                        Which kmers did you attempt to assemble at? Did you use velvetoptimiser to determine it? Also, there are other assemblers out there: benchmark and determine which ones work best for you.
                        Last edited by ctseto; 11-20-2013, 11:21 AM.

                        Comment


                        • #13
                          I really appreciate your advice and time for this. Now, I am very clear about using single-end reads and paired-end reads for mapping.

                          For velvet assembly, I used kmers from 21 to 43 and chose 21 as it gives me the highest number of contigs (do you think I should use velvetoptimiser to determine it?). As far as I know, n50 value from the Log file is not correct and I used a script to get n50 and I got 466 bp. Before using bowtie, I extracted contigs which are longer than 300 bp. 300 bp is commonly used for viral metagenomes as they have small genomes.

                          Comment


                          • #14
                            Originally posted by morning latte View Post
                            I really appreciate your advice and time for this. Now, I am very clear about using single-end reads and paired-end reads for mapping.

                            For velvet assembly, I used kmers from 21 to 43 and chose 21 as it gives me the highest number of contigs (do you think I should use velvetoptimiser to determine it?). As far as I know, n50 value from the Log file is not correct and I used a script to get n50 and I got 466 bp. Before using bowtie, I extracted contigs which are longer than 300 bp. 300 bp is commonly used for viral metagenomes as they have small genomes.
                            Fair enough. I'm sure you're aware of NCBI's Viral Genomes Resource; it might be a good place to check for reference genomes if you haven't done so already. In the bacteria/eukaryote world there are lists of essential genes that are looked for in each contig: perhaps in virus world, you could look for capsid proteins, reverse-transcriptase for retroviruses, viral proteases etc.

                            Also, isn't 300 nt a little small? Sure, viruses can use alternative splicing to get a lot out of a small amount of genome, but 300 seems small. I will defer to your subject matter expertise though.
                            Last edited by ctseto; 11-20-2013, 02:28 PM.

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Strategies for Sequencing Challenging Samples
                              by seqadmin


                              Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                              03-22-2024, 06:39 AM
                            • seqadmin
                              Techniques and Challenges in Conservation Genomics
                              by seqadmin



                              The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                              Avian Conservation
                              Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                              03-08-2024, 10:41 AM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, Yesterday, 06:37 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, Yesterday, 06:07 PM
                            0 responses
                            7 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-22-2024, 10:03 AM
                            0 responses
                            49 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 03-21-2024, 07:32 AM
                            0 responses
                            66 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X