Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • tool to find how many counts for each sequence

    Hi all,

    It is my first time to do gene sequencing. I am sequencing short PCR products and want to get the counts for each sequence (I know the sequences in the sample for PCR). Is there anyone knew any tool for this?

    Thank you for the help.

    Zheng

  • #2
    How long are the sequences, and do you expect 100% identity, or what? Do you already know the sequences you are expecting?

    Comment


    • #3
      30 to 90 bp.

      Yes, I expect 100% identity and know the sequences.

      Comment


      • #4
        In that case, just map the reads to a reference fasta file you make of the sequences. For example, with BBMap, you can output a file containing the counts of reads that mapped to each reference sequence:

        bbmap.sh -Xmx2g ref=sequences.fasta in=reads.fastq scafstats=stats.txt

        What kind of sequencing are you doing? You may need to adapter-trim the input reads first, for example, if you have a fixed read length with NGS reads.

        Comment


        • #5
          Thanks a lot. I use MiSeq.

          Comment


          • #6
            In that case, you will need to either trim the adapters first, or use a different approach such as matching substrings, if the reference sequences are sufficiently different.

            You can trim adapters like this:

            bbduk.sh -Xmx1g in=reads.fq out=clean.fq ktrim=r ref=nextera.fa.gz,truseq.fa.gz k=25 mink=12 hdist=1

            ...where those nextera and truseq adapter files are included with the BBTools package (which contains BBMap and BBDuk) in the /resources/ directory.

            Alternately, you can generate a report directly from BBDuk like this:

            bbduk.sh -Xmx1g in=reads.fastq ref=sequences.fasta stats=stats.txt k=29 fbm

            ...if the reference sequences are sufficiently different that they do not share any 29-mers.

            Comment


            • #7
              If I add a few samples that include inserts with 30 bp unknown sequences, is it possible for me to count reads for 50 sequences that appear most frequently?

              Comment


              • #8
                The above tools require known sequences. If some are unknown, you could assemble the adapter-trimmed reads (with, say, Velvet) which should result in a fasta file containing one copy of each sequence present in your library, then map to that assembly to get the counts, as in post 4.

                Comment


                • #9
                  Brian, it is very helpful. Thanks again.

                  Comment


                  • #10
                    Oh, I should probably add, this may be difficult to assemble since the coverage is likely to be extremely high. You may need to normalize first. Also, what length are the reads, and are they paired or single?

                    Comment


                    • #11
                      The reads are from both sides and 30 for each side.

                      Comment


                      • #12
                        In that case, there's an easier solution, if all of the reads of the same product are expected to start at the same location.

                        kmercountexact.sh in=reads.fq outk=counts.txt mincount=4 k=30

                        That will produce a file "counts.txt" containing the count of every 30bp string in the file, excluding strings present fewer than 4 times.

                        Comment


                        • #13
                          It works good

                          until

                          zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          java -ea -Xmx11628m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16
                          Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16]

                          ways=37
                          Initial:
                          Memory: free=187m, used=64m

                          Final:
                          Memory: free=125m, used=193m

                          Input: 396838 reads 11111464 bases.
                          Result: 0 reads (0.00%) 0 bases (0.00%)
                          Unique Kmers: 4345841
                          Exception in thread "main" java.lang.AssertionError: ACCTTTTACCTAGTTT 3

                          at kmer.KmerNode.dumpKmersAsText(KmerNode.java:155)
                          at kmer.HashForest.dumpKmersAsText(HashForest.java:211)
                          at kmer.HashArray.dumpKmersAsText(HashArray.java:249)
                          at jgi.CountKmersExact.dumpKmersAsText(CountKmersExact.java:749)
                          at jgi.CountKmersExact.process(CountKmersExact.java:383)
                          at jgi.CountKmersExact.main(CountKmersExact.java:58)

                          Here I tried k=25, 20, 16. It works for k=25 or 20,

                          Some sequences at both ends may show "N". Do this command search short sequence from different start positions? Or always start from a fixed position? thanks,

                          Zheng

                          Comment


                          • #14
                            Oh, sorry about that; I left a debug assertion in. I will remove that from the next version. Until then, add the "-da" flag to your command and it will work correctly.

                            Any subsequence containing an 'N' will be ignored. Otherwise, it looks at all valid subsequences. So, for example:

                            NNTTANGNNCT

                            for K=2 would produce:
                            TT 1
                            TA 1
                            CT 1

                            but nothing with the G that is surrounded by Ns. Thus, the reason I suggest K=30 for length-30 reads is that you will produce exactly one substring per read, while K=20 would produce 11 substrings per read. But if a 30bp read contains even a single N, it will yield no substrings.

                            Comment


                            • #15
                              Thanks.

                              Seems still have a problem after with flag -da.

                              zheng@zheng-XPS-8500:~/Desktop/bbmap/20140916ngs$ kmercountexact.sh in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              java -da -Xmx11773m -cp /home/zheng/Desktop/bbmap/current/ jgi.CountKmersExact in=ERa-05RNS160m_S6_L001_R2_001.fastq outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt mincount=3 k=16 -da
                              Executing jgi.CountKmersExact [in=ERa-05RNS160m_S6_L001_R2_001.fastq, outk=ERa-05RNS160m_S6_L001_R2_001_counts16.txt, mincount=3, k=16, -da]

                              ways=37
                              Exception in thread "main" java.lang.RuntimeException: Output file ERa-05RNS160m_S6_L001_R2_001_counts16.txt already exists, and overwrite=false
                              at jgi.CountKmersExact.<init>(CountKmersExact.java:335)
                              at jgi.CountKmersExact.main(CountKmersExact.java:55)

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Exploring the Dynamics of the Tumor Microenvironment
                                by seqadmin




                                The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                07-08-2024, 03:19 PM
                              • seqadmin
                                Exploring Human Diversity Through Large-Scale Omics
                                by seqadmin


                                In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                06-25-2024, 06:43 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 07-10-2024, 07:30 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 09:45 AM
                              0 responses
                              197 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-03-2024, 08:54 AM
                              0 responses
                              207 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 07-02-2024, 03:00 PM
                              0 responses
                              190 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X