Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • vas72985
    Member
    • Jul 2010
    • 20

    Pulling out paired reads containing a specific sequence in one pair

    I am wondering if anyone knows of an easy way (or tool) to take paired Illumina fastq data and isolate only those read pairs that contain a specific sequence (say a 20 or 30mer) as the first 20 (or 30) nucleotides of one of the reads.

    I will have a 2x150 MiSeq library where some percentage of the total reads (the percentage that I want) will have one read in the pair that starts with a specific nucleotide sequence. Ideally I would like to collect all read pairs where one read in the pair starts with the specific sequence as well as have that sequence trimmed from the read containing it before I map the pairs to the genome.

    Thanks.
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    BBDuk might work for that. It can bin (or trim) reads by the presence and absence of specific kmers, like this:

    bbduk.sh in=reads.fq outm=matching.fq out=unmatching.fq literal=ATGTTACGTCT k=11

    However, it looks at all of the kmers in a read, not just the first one. Does this sequence ever occur in the middle of the reads, and if so, what would you want to do with those reads?

    Comment

    • vas72985
      Member
      • Jul 2010
      • 20

      #3
      The sequence most likely always occurs at the beginning of the read, but I suppose if something were to be slightly off with the prep, it could occur later in the read. In that case I would also like to pull out those reads. So if I understand correctly, this might work for that purpose?

      Comment

      • vas72985
        Member
        • Jul 2010
        • 20

        #4
        However, does BBDuk allow for paired data? Ie, if the kmer is in read 1, will it allow for isolation of read 1 reads containing the kmer but also of read 2 pairs for those reads it identifies?)

        Comment

        • Brian Bushnell
          Super Moderator
          • Jan 2014
          • 2709

          #5
          1) Yes, it will work perfectly, in this case.
          2) BBDuk always keeps pairs together, as long as it knows the input is paired. For twin files, the command would be:

          bbduk.sh in1=reads1.fq in2=reads2.fq outm1=matching1.fq outm2=matching2.fq out1=unmatching1.fq out2=unmatching2.fq literal=ATGTTACGTCT k=11

          You can later trim the reads with the "ktrim=l" flag.

          Comment

          • vas72985
            Member
            • Jul 2010
            • 20

            #6
            So I tried this on a very small test data set where I artificially inserted a specific 12mer (GACCAGCTAGTG) and it found all of the ones that I artificially inserted (as well as one that I didn't realize was there to begin with), but it also output a few read pairs as matches that look like they shouldn't belong. For example the read pair below:

            @IRIS:7:32:32:1772#0/1
            AAGGCTTTAGTCATGTGTTCAAGATCGAAAAAGGAA
            +
            aaaaaaaaaa`abab`a^aabaaa`ab`a`aaa`]a

            @IRIS:7:32:32:1772#0/2
            GAAGAAACCTCACAAGACTTTCACTAGATGGTCAGA
            +
            abbbaab^aaa``_aaa]`^_Z\X`W]^_a_TQ[]Z


            Any ideas why it would be making some improper calls?

            Comment

            • vas72985
              Member
              • Jul 2010
              • 20

              #7
              Basically it found all 11 sequences that I know match the 12mer, but it also pulled out an additional 9 sequences that I have no idea why they are being called matching.

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                #8
                Oh - by default, it looks for both a kmer AND its reverse compliment, and ignores the middle letter of the kmer to increase sensitivity. To disable these, add these flags:

                rcomp=f mm=f

                (where rcomp means 'look for reverse-compliments of kmers' and mm means 'mask middle').

                In this case, reverse-compliment of GACCAGCTAGTG = CACTAGCTGGTC, and:
                Code:
                                     [B]CACTAG[COLOR="Red"]C[/COLOR]TGGTC[/B]
                GAAGAAACCTCACAAGACTTT[B]CACTAGATGGTC[/B]AGA
                ...the middle base is masked. So it matches read 2.

                Comment

                • vas72985
                  Member
                  • Jul 2010
                  • 20

                  #9
                  Ah, brilliant. Now it works like a charm. Thanks for the help. I'll give it a try on my actual dataset whenever I get it back. Now if only you could make that happen faster

                  Comment

                  • vas72985
                    Member
                    • Jul 2010
                    • 20

                    #10
                    Now I may be getting greedy, but is there an option that would allow me to set a threshold for mismatches between my sequence and kmer. For example I know my kmer is exact, but it's possible that I would want to allow 1 or 2 mismatches from my kmer in sequences and still have them be called "matched". Is there an option for this? It wasn't immediately obvious to me looking at the usage.

                    Thanks

                    Comment

                    • Brian Bushnell
                      Super Moderator
                      • Jan 2014
                      • 2709

                      #11
                      Yes! It is possible. You can set "hdist=1" for one mismatch or "hdist=2" for 2 (that stands for Hamming distance). You can also allow indels but that shouldn't be necessary with Illumina data.

                      Comment

                      Latest Articles

                      Collapse

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by SEQadmin2, 06-05-2026, 10:09 AM
                      0 responses
                      14 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-04-2026, 08:59 AM
                      0 responses
                      26 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 12:03 PM
                      0 responses
                      33 views
                      0 reactions
                      Last Post SEQadmin2  
                      Started by SEQadmin2, 06-02-2026, 11:40 AM
                      0 responses
                      23 views
                      0 reactions
                      Last Post SEQadmin2  
                      Working...