Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting specific sequence stretches in fastq files

    Hey all,

    For a while ive been trying to find a way how to count specific sequence stretches directly from fastq files. For example i would like to see how many reads contain the following sequence: "TGACCCGATAGA". It seems to work with grep -c when i use fasta files but not fastq files.

    Is there anything im missing?

    Best,
    Exo

  • #2
    You can still use grep -c, but pipe the file through awk first:

    Code:
    awk '{if(NR%4 == 2) print $0}' sample.fastq | grep -c ...

    Comment


    • #3
      This is a perfect application for BBDuk. I would not recommend grep for this approach with multiline fasta as it does not distinguish between lines and reads, and cannot handle newlines in the middle of the target sequence. And I have no idea why grep would fail with fastq files; that should not be a problem. But anyway -

      bbduk.sh in=reads.fq literal=TGACCCGATAGA k=12 mm=f rcomp=f int=f

      That will tell you exactly how many reads contain that sequence. If you are also interested in reads containing the reverse-complement of the sequence, remove the flag "rcomp=f". The "int=f" flag tells it to treat reads individually rather than as pairs, if your reads are interleaved. k is the kmer length, which in this case is the length of your sequence, since you want an exact match. Incidentally, BBDuk can also tell you how many reads contain a sequence that is 1 substitution away from the target with the flag "hdist=1", and so forth. It can also handle degenerate symbols like 'N' if you add the flag "copyundefined", which is occasionally useful when dealing with amplification primers.
      Last edited by Brian Bushnell; 06-12-2016, 09:19 AM.

      Comment


      • #4
        Originally posted by dpryan View Post
        You can still use grep -c, but pipe the file through awk first:

        Code:
        awk '{if(NR%4 == 2) print $0}' sample.fastq | grep -c ...
        Returns at max 1 hit per sequence though..
        savetherhino.org

        Comment


        • #5
          Originally posted by rhinoceros View Post
          Returns at max 1 hit per sequence though..
          Yes, though that's typically what's desired.

          Comment


          • #6
            Originally posted by rhinoceros View Post
            Returns at max 1 hit per sequence though..
            Hi- Is that a problem? The OP asks for the number of reads containing pattern. So if a read contains the pattern twice it should still be counted as 1, which is what grep -c does, isn't it?

            Comment

            Latest Articles

            Collapse

            • seqadmin
              Latest Developments in Precision Medicine
              by seqadmin



              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

              Somatic Genomics
              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
              05-24-2024, 01:16 PM
            • seqadmin
              Recent Advances in Sequencing Analysis Tools
              by seqadmin


              The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
              05-06-2024, 07:48 AM

            ad_right_rmr

            Collapse

            News

            Collapse

            Topics Statistics Last Post
            Started by seqadmin, Today, 06:55 AM
            0 responses
            8 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-30-2024, 03:16 PM
            0 responses
            23 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-29-2024, 01:32 PM
            0 responses
            27 views
            0 likes
            Last Post seqadmin  
            Started by seqadmin, 05-24-2024, 07:15 AM
            0 responses
            214 views
            0 likes
            Last Post seqadmin  
            Working...
            X