Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • JJenks
    Junior Member
    • May 2012
    • 6

    Merging paired end reads for BLAST

    Hi All,

    I've just read the various threads about dealing with paired end reads, but none seemed to address my problem.
    I've got several metagenomic datasets consisting of paired end reads from Illumina MiSeq technology, which we are planning on BLASTing. Reads are 100bp in length and are from a 300-400 bp fraction, so will not overlap. I'd like to know if there is a way in which I can combine each pair into a single file, which can be BLASTed to increase the accuracy of the BLAST.
    Also, would I be correct in saying that I require a reverse compliment of the R2 read before combination?

    Sorry if this is a little vague, I can provide more information if required.

    Thanks
    Joe
  • JackieBadger
    Senior Member
    • Mar 2009
    • 385

    #2
    Yes. Reverse complement R2 using Fastx tool kit.
    Then you can upload your files to Galaxy: convert from FASTQ to Tabular format, and use the cut/merge column functions under text manipulation to join reads end-to end

    Comment

    • swbarnes2
      Senior Member
      • May 2008
      • 910

      #3
      Yes, rev-comp read 2.

      Just write a script to do it. It would be pretty straightforward.

      Next-gen sequencing is kind of hard to do without a little unix and scripting ability.

      Comment

      • ucpete
        Member
        • Dec 2008
        • 35

        #4
        I'm not sure what you're aiming for really -- I'm not aware of BLAST having any special way of using paired-end information to make alignment more accurate, especially because your reads don't overlap. If what you're trying to do is resolve any discordant read pair alignments using BLAST as your aligner, you definitely do NOT need to take the reverse compliment first and you definitely do NOT want to merge the R1 and R2 data before aligning -- each read in the pair has the exact same title (the only difference in identification being the file from which they derive) so you won't be able to deconvolve the results afterwards! What I do in these situations is run two BLASTs: one for read 1, another for read 2, then I parse the results to output the concordant information. For example, R1 aligns to organism A, B, and C equally well, and R2 aligns well to organism B. That pair would be called as deriving from organism B. Often times I will consider the E value when making these calls as well, whether it's only using the top scoring hit per query, or using score to break ties. Also, taking the reverse compliment is nonsensical if this is your situation, as BLAST already searches both strands. Hope this helps!
        Last edited by ucpete; 02-22-2013, 04:34 PM. Reason: typo

        Comment

        • JJenks
          Junior Member
          • May 2012
          • 6

          #5
          Thanks for the advice! We wanted to combine the pairs for two reasons, firstly to increase the amount of sequence available for the BLAST search, and to reduce our dataset size. Surely a BLAST of combined datasets would decrease the likelihood of returning multiple alignments due to there being 2x the amount of sequence?

          Comment

          • ucpete
            Member
            • Dec 2008
            • 35

            #6
            BLASTing each read file separately or BLASTing them in the same file will search the same amount of sequence. There will be no reduction in search space unless reads 1 and 2 are overlapping, and you merged them first by assembly. But you said there is no overlap. BLAST returns all valid alignments with E-values less than your threshold so you will get the same number of alignments whether you have all the reads in one file or you BLAST both reads separately. But if you merge them without modifying the FASTA/Q title, you will have the problem of not being able to distinguish which read is which as the read titles are exactly the same for each read in the pair. Why not run two BLASTs?! It takes the same amount of time, it produces the same output, but you will actually know which read is which!

            Comment

            • JJenks
              Junior Member
              • May 2012
              • 6

              #7
              Ok, that makes more sense. Thanks!

              Comment

              • felvis56
                Member
                • Dec 2012
                • 11

                #8
                I am looking to combine R1 and R2 for a blast search as we sequenced amplicons and I want to blast the full 500 bp rather that 2 x 250 bp searches. Can I combine the files for this?

                Comment

                • JJenks
                  Junior Member
                  • May 2012
                  • 6

                  #9
                  For amplicon data, try using Pandaseq. This should merge your reads which have areas of overlap, and can also be used to remove primers/barcodes.
                  PAired-eND Assembler for DNA sequences. Contribute to neufeld/pandaseq development by creating an account on GitHub.

                  Comment

                  • indugun
                    Junior Member
                    • Apr 2013
                    • 4

                    #10
                    Parsing the blast results

                    Hello,
                    I have Blasted reads R1 and R2 separately. For one of the read the results as below. Could you please suggest me how parse R1 and R1 to select appropriate protein.

                    For R1
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2YXR6|GLPK_STAAB 65.1 43 15 0 136 8 438 480 1.4e-10 65.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6GHD5|GLPK_STAAR 65.1 43 15 0 136 8 438 480 1.4e-10 65.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K643|GLPK_THEPX 67.4 43 14 0 136 8 437 479 2.3e-10 65.1
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6U1B8|GLPK_STAA2 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P63741|GLPK_STAAM 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8NWX7|GLPK_STAAW 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7X1U3|GLPK_STAA1 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FHD9|GLPK_STAA3 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5ISI2|GLPK_STAA9 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P99113|GLPK_STAAN 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6G9R3|GLPK_STAAS 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A8Z1X0|GLPK_STAAT 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FYZ5|GLPK_STAA8 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6QGJ8|GLPK_STAAE 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HGD2|GLPK_STAAC 62.8 43 16 0 136 8 438 480 3.9e-10 64.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K754|GLPK_THEP3 65.1 43 15 0 136 8 437 479 5.1e-10 63.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q9KDW8|GLPK_BACHD 62.8 43 16 0 136 8 439 481 6.7e-10 63.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8R8J4|GLPK_CALS4 60.5 43 17 0 136 8 437 479 1.5e-09 62.4
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8CSS0|GLPK_STAES 65.1 43 15 0 136 8 438 480 2.6e-09 61.6
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HPP1|GLPK_STAEQ 65.1 43 15 0 136 8 438 480 2.6e-09 61.6
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C4ZGB4|GLPK_AGARV 56.8 44 19 0 136 5 437 480 4.4e-09 60.8
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C6C1M7|GLPK_DESAD 60.5 43 17 0 136 8 437 479 9.7e-09 59.7
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B2FI02|GLPK_STRMK 58.1 43 18 0 136 8 439 481 1.3e-08 59.3
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B4SJT3|GLPK_STRM5 58.1 43 18 0 136 8 439 481 1.7e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B2I618|GLPK_XYLF2 53.7 41 19 0 136 14 439 479 2.2e-08 58.5

                    For R2:
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K643|GLPK_THEPX 77.1 35 8 0 13 117 428 462 5.8e-10 63.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B0K754|GLPK_THEP3 74.3 35 9 0 13 117 428 462 1.3e-09 62.4
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C4ZGB4|GLPK_AGARV 71.4 35 10 0 13 117 428 462 1.7e-09 62.0
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|C6C1M7|GLPK_DESAD 71.4 35 10 0 13 117 428 462 2.9e-09 61.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8R8J4|GLPK_CALS4 71.4 35 10 0 13 117 428 462 2.9e-09 61.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2YXR6|GLPK_STAAB 71.4 35 10 0 13 117 429 463 4.9e-09 60.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6GHD5|GLPK_STAAR 71.4 35 10 0 13 117 429 463 4.9e-09 60.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6U1B8|GLPK_STAA2 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P63741|GLPK_STAAM 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q8NWX7|GLPK_STAAW 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7X1U3|GLPK_STAA1 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FHD9|GLPK_STAA3 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5ISI2|GLPK_STAA9 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|P99113|GLPK_STAAN 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q6G9R3|GLPK_STAAS 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A8Z1X0|GLPK_STAAT 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q9KDW8|GLPK_BACHD 68.6 35 11 0 13 117 430 464 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q2FYZ5|GLPK_STAA8 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A6QGJ8|GLPK_STAAE 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A4J8E6|GLPK_DESRM 68.6 35 11 0 13 117 431 465 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|Q5HGD2|GLPK_STAAC 68.6 35 11 0 13 117 429 463 1.4e-08 58.9
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B8FXS7|GLPK_DESHD 71.9 32 9 0 10 105 428 459 1.9e-08 58.5
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A7FX30|GLPK_CLOB1 63.9 36 13 0 10 117 427 462 2.4e-08 58.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|A5I5M0|GLPK_CLOBH 63.9 36 13 0 10 117 427 462 2.4e-08 58.2
                    NS500568:58:HC2VKAFXX:4:21612:4814:20388 sp|B1IKJ7|GLPK_CLOBK 63.9 36 13 0 10 117 427 462 2.4e-08 58.2




                    Originally posted by ucpete View Post
                    I'm not sure what you're aiming for really -- I'm not aware of BLAST having any special way of using paired-end information to make alignment more accurate, especially because your reads don't overlap. If what you're trying to do is resolve any discordant read pair alignments using BLAST as your aligner, you definitely do NOT need to take the reverse compliment first and you definitely do NOT want to merge the R1 and R2 data before aligning -- each read in the pair has the exact same title (the only difference in identification being the file from which they derive) so you won't be able to deconvolve the results afterwards! What I do in these situations is run two BLASTs: one for read 1, another for read 2, then I parse the results to output the concordant information. For example, R1 aligns to organism A, B, and C equally well, and R2 aligns well to organism B. That pair would be called as deriving from organism B. Often times I will consider the E value when making these calls as well, whether it's only using the top scoring hit per query, or using score to break ties. Also, taking the reverse compliment is nonsensical if this is your situation, as BLAST already searches both strands. Hope this helps!
                    Last edited by indugun; 11-05-2018, 10:45 AM.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      New Genomics Tools and Methods Shared at AGBT 2025
                      by seqadmin


                      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                      The Headliner
                      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                      03-03-2025, 01:39 PM
                    • seqadmin
                      Investigating the Gut Microbiome Through Diet and Spatial Biology
                      by seqadmin




                      The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                      02-24-2025, 06:31 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, 03-20-2025, 05:03 AM
                    0 responses
                    21 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-19-2025, 07:27 AM
                    0 responses
                    27 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-18-2025, 12:50 PM
                    0 responses
                    21 views
                    0 reactions
                    Last Post seqadmin  
                    Started by seqadmin, 03-03-2025, 01:15 PM
                    0 responses
                    189 views
                    0 reactions
                    Last Post seqadmin  
                    Working...