Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lyw1
    Member
    • Sep 2014
    • 33

    tool to find how many counts for each sequence

    Hi all,

    It is my first time to do gene sequencing. I am sequencing short PCR products and want to get the counts for each sequence (I know the sequences in the sample for PCR). Is there anyone knew any tool for this?

    Thank you for the help.

    Zheng
  • Brian Bushnell
    Super Moderator
    • Jan 2014
    • 2709

    #2
    How long are the sequences, and do you expect 100% identity, or what? Do you already know the sequences you are expecting?

    Comment

    • lyw1
      Member
      • Sep 2014
      • 33

      #3
      30 to 90 bp.

      Yes, I expect 100% identity and know the sequences.

      Comment

      • Brian Bushnell
        Super Moderator
        • Jan 2014
        • 2709

        #4
        In that case, just map the reads to a reference fasta file you make of the sequences. For example, with BBMap, you can output a file containing the counts of reads that mapped to each reference sequence:

        bbmap.sh -Xmx2g ref=sequences.fasta in=reads.fastq scafstats=stats.txt

        What kind of sequencing are you doing? You may need to adapter-trim the input reads first, for example, if you have a fixed read length with NGS reads.

        Comment

        • lyw1
          Member
          • Sep 2014
          • 33

          #5
          Thanks a lot. I use MiSeq.

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            #6
            In that case, you will need to either trim the adapters first, or use a different approach such as matching substrings, if the reference sequences are sufficiently different.

            You can trim adapters like this:

            bbduk.sh -Xmx1g in=reads.fq out=clean.fq ktrim=r ref=nextera.fa.gz,truseq.fa.gz k=25 mink=12 hdist=1

            ...where those nextera and truseq adapter files are included with the BBTools package (which contains BBMap and BBDuk) in the /resources/ directory.

            Alternately, you can generate a report directly from BBDuk like this:

            bbduk.sh -Xmx1g in=reads.fastq ref=sequences.fasta stats=stats.txt k=29 fbm

            ...if the reference sequences are sufficiently different that they do not share any 29-mers.

            Comment

            • lyw1
              Member
              • Sep 2014
              • 33

              #7
              If I add a few samples that include inserts with 30 bp unknown sequences, is it possible for me to count reads for 50 sequences that appear most frequently?

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                #8
                The above tools require known sequences. If some are unknown, you could assemble the adapter-trimmed reads (with, say, Velvet) which should result in a fasta file containing one copy of each sequence present in your library, then map to that assembly to get the counts, as in post 4.

                Comment

                • lyw1
                  Member
                  • Sep 2014
                  • 33

                  #9
                  Brian, it is very helpful. Thanks again.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    #10
                    Oh, I should probably add, this may be difficult to assemble since the coverage is likely to be extremely high. You may need to normalize first. Also, what length are the reads, and are they paired or single?

                    Comment

                    • lyw1
                      Member
                      • Sep 2014
                      • 33

                      #11
                      The reads are from both sides and 30 for each side.

                      Comment

                      • Brian Bushnell
                        Super Moderator
                        • Jan 2014
                        • 2709

                        #12
                        In that case, there's an easier solution, if all of the reads of the same product are expected to start at the same location.

                        kmercountexact.sh in=reads.fq outk=counts.txt mincount=4 k=30

                        That will produce a file "counts.txt" containing the count of every 30bp string in the file, excluding strings present fewer than 4 times.

                        Comment

                        • lyw1
                          Member
                          • Sep 2014
                          • 33

                          #13
                          It works good

                          until

                          zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          java -ea -Xmx11628m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16]

                          ways=37
                          Initial:
                          Memory: free=187m, used=64m

                          Final:
                          Memory: free=125m, used=193m

                          Input: 396838 reads 11111464 bases.
                          Result: 0 reads (0.00%) 0 bases (0.00%)
                          Unique Kmers: 4345841
                          Exception in thread "main" java.lang.AssertionError: ACCTTTTACCTAGTTT 3

                          at kmer.KmerNode.dumpKmersAsText(KmerNode.java:155)
                          at kmer.HashForest.dumpKmersAsText(HashForest.java:211)
                          at kmer.HashArray.dumpKmersAsText(HashArray.java:249)
                          at jgi.CountKmersExact.dumpKmersAsText(CountKmersExact.java:749)
                          at jgi.CountKmersExact.process(CountKmersExact.java:383)
                          at jgi.CountKmersExact.main(CountKmersExact.java:58)

                          Here I tried k=25, 20, 16. It works for k=25 or 20,

                          Some sequences at both ends may show "N". Do this command search short sequence from different start positions? Or always start from a fixed position? thanks,

                          Zheng

                          Comment

                          • Brian Bushnell
                            Super Moderator
                            • Jan 2014
                            • 2709

                            #14
                            Oh, sorry about that; I left a debug assertion in. I will remove that from the next version. Until then, add the "-da" flag to your command and it will work correctly.

                            Any subsequence containing an 'N' will be ignored. Otherwise, it looks at all valid subsequences. So, for example:

                            NNTTANGNNCT

                            for K=2 would produce:
                            TT 1
                            TA 1
                            CT 1

                            but nothing with the G that is surrounded by Ns. Thus, the reason I suggest K=30 for length-30 reads is that you will produce exactly one substring per read, while K=20 would produce 11 substrings per read. But if a 30bp read contains even a single N, it will yield no substrings.

                            Comment

                            • lyw1
                              Member
                              • Sep 2014
                              • 33

                              #15
                              Thanks.

                              Seems still have a problem after with flag -da.

                              zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              java -da -Xmx11773m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16, -da]

                              ways=37
                              Exception in thread "main" java.lang.RuntimeException: Output file ERa-05RNS160m_S6_L001_R2_001_counts16.txt already exists, and overwrite=false
                              at jgi.CountKmersExact.<init>(CountKmersExact.java:335)
                              at jgi.CountKmersExact.main(CountKmersExact.java:55)

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              13 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              28 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              22 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...