Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tool to find how many counts for each sequence

    Hi all,

    It is my first time to do gene sequencing. I am sequencing short PCR products and want to get the counts for each sequence (I know the sequences in the sample for PCR). Is there anyone knew any tool for this?

    Thank you for the help.

    Zheng

  • #2
    How long are the sequences, and do you expect 100% identity, or what? Do you already know the sequences you are expecting?

    Comment


    • #3
      30 to 90 bp.

      Yes, I expect 100% identity and know the sequences.

      Comment


      • #4
        In that case, just map the reads to a reference fasta file you make of the sequences. For example, with BBMap, you can output a file containing the counts of reads that mapped to each reference sequence:

        bbmap.sh -Xmx2g ref=sequences.fasta in=reads.fastq scafstats=stats.txt

        What kind of sequencing are you doing? You may need to adapter-trim the input reads first, for example, if you have a fixed read length with NGS reads.

        Comment


        • #5
          Thanks a lot. I use MiSeq.

          Comment


          • #6
            In that case, you will need to either trim the adapters first, or use a different approach such as matching substrings, if the reference sequences are sufficiently different.

            You can trim adapters like this:

            bbduk.sh -Xmx1g in=reads.fq out=clean.fq ktrim=r ref=nextera.fa.gz,truseq.fa.gz k=25 mink=12 hdist=1

            ...where those nextera and truseq adapter files are included with the BBTools package (which contains BBMap and BBDuk) in the /resources/ directory.

            Alternately, you can generate a report directly from BBDuk like this:

            bbduk.sh -Xmx1g in=reads.fastq ref=sequences.fasta stats=stats.txt k=29 fbm

            ...if the reference sequences are sufficiently different that they do not share any 29-mers.

            Comment


            • #7
              If I add a few samples that include inserts with 30 bp unknown sequences, is it possible for me to count reads for 50 sequences that appear most frequently?

              Comment


              • #8
                The above tools require known sequences. If some are unknown, you could assemble the adapter-trimmed reads (with, say, Velvet) which should result in a fasta file containing one copy of each sequence present in your library, then map to that assembly to get the counts, as in post 4.

                Comment


                • #9
                  Brian, it is very helpful. Thanks again.

                  Comment


                  • #10
                    Oh, I should probably add, this may be difficult to assemble since the coverage is likely to be extremely high. You may need to normalize first. Also, what length are the reads, and are they paired or single?

                    Comment


                    • #11
                      The reads are from both sides and 30 for each side.

                      Comment


                      • #12
                        In that case, there's an easier solution, if all of the reads of the same product are expected to start at the same location.

                        kmercountexact.sh in=reads.fq outk=counts.txt mincount=4 k=30

                        That will produce a file "counts.txt" containing the count of every 30bp string in the file, excluding strings present fewer than 4 times.

                        Comment


                        • #13
                          It works good

                          until

                          zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          java -ea -Xmx11628m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16]

                          ways=37
                          Initial:
                          Memory: free=187m, used=64m

                          Final:
                          Memory: free=125m, used=193m

                          Input: 396838 reads 11111464 bases.
                          Result: 0 reads (0.00%) 0 bases (0.00%)
                          Unique Kmers: 4345841
                          Exception in thread "main" java.lang.AssertionError: ACCTTTTACCTAGTTT 3

                          at kmer.KmerNode.dumpKmersAsText(KmerNode.java:155)
                          at kmer.HashForest.dumpKmersAsText(HashForest.java:211)
                          at kmer.HashArray.dumpKmersAsText(HashArray.java:249)
                          at jgi.CountKmersExact.dumpKmersAsText(CountKmersExact.java:749)
                          at jgi.CountKmersExact.process(CountKmersExact.java:383)
                          at jgi.CountKmersExact.main(CountKmersExact.java:58)

                          Here I tried k=25, 20, 16. It works for k=25 or 20,

                          Some sequences at both ends may show "N". Do this command search short sequence from different start positions? Or always start from a fixed position? thanks,

                          Zheng

                          Comment


                          • #14
                            Oh, sorry about that; I left a debug assertion in. I will remove that from the next version. Until then, add the "-da" flag to your command and it will work correctly.

                            Any subsequence containing an 'N' will be ignored. Otherwise, it looks at all valid subsequences. So, for example:

                            NNTTANGNNCT

                            for K=2 would produce:
                            TT 1
                            TA 1
                            CT 1

                            but nothing with the G that is surrounded by Ns. Thus, the reason I suggest K=30 for length-30 reads is that you will produce exactly one substring per read, while K=20 would produce 11 substrings per read. But if a 30bp read contains even a single N, it will yield no substrings.

                            Comment


                            • #15
                              Thanks.

                              Seems still have a problem after with flag -da.

                              zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              java -da -Xmx11773m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16, -da]

                              ways=37
                              Exception in thread "main" java.lang.RuntimeException: Output file ERa-05RNS160m_S6_L001_R2_001_counts16.txt already exists, and overwrite=false
                              at jgi.CountKmersExact.<init>(CountKmersExact.java:335)
                              at jgi.CountKmersExact.main(CountKmersExact.java:55)

                              Comment

                              Latest Articles

                              Collapse

                              • performena
                                Reply to Recent Advances in Sequencing Technologies
                                by performena
                                Performena is a digital-first creative agency based in the UAE. We craft resonant brand experiences through strategic consultancy, creative services, media planning, programmatic advertising, and custom technology solutions.

                                Our journey began with a vision to revolutionize digital marketing. We combine data-driven insights with creative storytelling to deliver impactful campaigns that drive measurable results. Our team of wild hearts and tech wizards work tirelessly to unlock your brand's...
                                Today, 05:26 AM
                              • seqadmin
                                Recent Advances in Sequencing Technologies
                                by seqadmin







                                Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

                                Long-Read Sequencing
                                Long-read sequencing has...
                                Yesterday, 01:49 PM
                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 09:29 AM
                              0 responses
                              14 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 09:06 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 08:03 AM
                              0 responses
                              11 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-22-2024, 07:36 AM
                              0 responses
                              65 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X