Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to parse reads containing a particular sequence in any orientation

    I'm pretty new to bioinformatics and sorry if it's not worthy of asking here.

    I have a FLASH stitched fastq file from my paired end data, from which I want to sort the reads containing a particular sequence or part of that sequence in any orientation, and make a new fastq file with them. Is there any easy tool/code to do that?
    Last edited by PoorSeq; 10-30-2013, 10:25 PM.

  • #2
    I have a FLASH stitched fastq file from my paired end data, from which I want to sort the reads containing a particular sequence
    Okay, that's doable with grep, and very quick.

    or part of that sequence
    Wait, what? You want any subsequence? Will one base pair do? What are your limits?

    Comment


    • #3
      If you are trying to get reads that align to a particular sequence, try bowtie2. Not sure if that's your purpose though?

      Comment


      • #4
        Originally posted by PoorSeq View Post
        I'm pretty new to bioinformatics and sorry if it's not worthy of asking here.

        I have a FLASH stitched fastq file from my paired end data, from which I want to sort the reads containing a particular sequence or part of that sequence in any orientation, and make a new fastq file with them. Is there any easy tool/code to do that?
        Hi- As gringer pointed out, you should clarify what you are trying to do. Anyway... This sequence of unix commands reads a fastq file and outputs the reads in fastq format matching a regular expression. See if it helps:

        Code:
        ## Get reads containing substring AAA or its revcomp TTT
        gunzip -c fastq.fq.gz \
        | paste - - - - \
        | grep -P '^@.*?\t(.*?AAA.*?)|(.*?TTT.*?)\t\+' \
        | tr '\t' '\n' \
        | gzip > sub.fq.gz
        
        ## Example input fastq:
        @seq1
        ACTGAAACTG
        +comment
        IIIIIIIIII
        @seq2
        ACTGNNNCTGTTT
        +comment
        BBBBBBBBBBBBB
        @seq3
        CCCCCCCCCCCCC
        +comment
        BBBBBBBBBBTTT
        @seq4
        AAACCCCCCCCCC
        +comment
        BBBBBBBBBBTTT
        
        ## Output sub.fq.gz
        @seq1
        ACTGAAACTG
        +comment
        IIIIIIIIII
        @seq2
        ACTGNNNCTGTTT
        +comment
        BBBBBBBBBBBBB
        @seq4
        AAACCCCCCCCCC
        +comment
        BBBBBBBBBBTTT
        If your input is unzipped use "paste - - - - < fastq.fq" instead of "gunzip -c fastq.fq.gz \
        | paste - - - -"

        Comment


        • #5
          Thanks all for the answers, specially dariober for the code. I have also developed a bioawk code later which I was able to use. The code is:

          bioawk -c fastx '/SEQUENCE/ {print "@"$name; print $seq; print "+"; print $qual }' inut.fq > output.fq

          Yes, my aim is to collect all the reads that contains a sequence (or subsequence, at least 14nt) and make a new file.

          Comment


          • #6
            Originally posted by PoorSeq View Post
            bioawk -c fastx '/SEQUENCE/ {print "@"$name; print $seq; print "+"; print $qual }' inut.fq > output.fq
            Hmm, bioawk seems pretty neat.

            Note that what you've got there won't work for reverse complement orientation, so you'll need to have both forward and reverse included. Also, picking any 14+nt substring of SEQUENCE (or its reverse-complement) will be a bit trickier to implement.

            Comment


            • #7
              Good point gringer! however, because I have already flash-stitched the reads, I expect that the sequence I'm looking for will be in the same orientation in all reads. Still, I would see if I have any potholes. Also, I chose 14nt because my samples are from a bacteria with a genome size of 4.2 million, so I expect anything equal or more than 12nt will be unique.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Genetic Variation in Immunogenetics and Antibody Diversity
                by seqadmin



                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                11-06-2024, 07:24 PM
              • seqadmin
                Choosing Between NGS and qPCR
                by seqadmin



                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                10-18-2024, 07:11 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Today, 11:09 AM
              0 responses
              24 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Today, 06:13 AM
              0 responses
              20 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 11-01-2024, 06:09 AM
              0 responses
              30 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-30-2024, 05:31 AM
              0 responses
              21 views
              0 likes
              Last Post seqadmin  
              Working...
              X