Header Leaderboard Ad

Collapse

How do I make next-gen SEQ data non-redundant?

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I make next-gen SEQ data non-redundant?

    I want to create a non-redundant fasta file of the data from next generation sequencing run. I would like the output fasta file to include the a count number for the number of identical reads. Does anybody have a perl script that could do this?

    Thanks for your help!
    Last edited by PRJ; 03-22-2010, 03:28 PM.

  • #2
    Not sure if this is what you want but,

    I suggest you align the data first and then dump it in a BAM file. After that you can mark the PCR duplicates. The BAM will contain all the reads from your sequencing. Then you can write your own tool using any of the multiple BAM libraries to report any stats you want.
    -drd

    Comment


    • #3
      You could maybe look into the fastx_collapser tool from this package: http://hannonlab.cshl.edu/fastx_tool...mmandline.html
      But I don't think it will report any counts.

      Perhaps you could try using wordcount from EMBOSS (http://emboss.sourceforge.net/apps/c...wordcount.html) and setting the wordsize to your read length? This will give you a list of all unique sequences and how many there are. Then you can write a script to convert it back to fasta. However, I'm not sure how if it can handle a large fasta file.

      Comment


      • #4
        So sequences which are substrings of other sequences - are they redundant?
        How about reverse complements?

        You can use vmatch to cluster them:
        http://jermdemo.blogspot.com/2009/11...ssemblies.html

        If not then just load a hash with your seqs as keys and your deflines as values.
        --
        Jeremy Leipzig
        Bioinformatics Programmer
        --
        My blog
        Twitter

        Comment


        • #5
          Originally posted by drio View Post
          Not sure if this is what you want but,

          I suggest you align the data first and then dump it in a BAM file. After that you can mark the PCR duplicates. The BAM will contain all the reads from your sequencing. Then you can write your own tool using any of the multiple BAM libraries to report any stats you want.
          I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?

          Comment


          • #6
            Originally posted by thinkRNA View Post
            I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?
            He didn't specify he was running a RNA-seq experiment. Your question is
            still interesting. The answer is you don't know for certain if that read is
            coming from the same template or not. But chances that two reads are dups
            when they map to the same coordinate with the same direction is pretty high.
            That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.
            -drd

            Comment


            • #7
              While the method of hashing all the data should work in theory, if you don't have X more memory than the size of your file you will run out of memory (by X I mean some factor which is dependent on a lot of details of the hashtable implementation that I won't claim to know, but X is certainly greater than 1).

              One other approach would be to convert the files to tab-delimited; sort on the sequence column & then you can compress duplicates by just comparing the current row to the previous one -- I believe UNIX sort can deal with files that are quite large & perhaps use on-disk storing of the intermediates during the sort.

              Comment


              • #8
                Originally posted by drio View Post
                He didn't specify he was running a RNA-seq experiment. Your question is
                still interesting. The answer is you don't know for certain if that read is
                coming from the same template or not. But chances that two reads are dups
                when they map to the same coordinate with the same direction is pretty high.
                That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.
                Thanks so much for responding-I was dying to know the answer. From your reply, I understand that the PCR duplicates cannot be inferred in single reads in RNA-seq data . However, for DNA-sequencing paired-end reads, one can determine it. But there can be high copy numbers in certain genomic locations (Example, Cmyc genomic amplification in b-cell lymphomas). Is it not going to be another parameter one should take in to consideration when removing duplicates? Although, I can see how someone will not be interested in this if they are only interested in SNP variants. But, I think, it is exactly these reads you don't want to throw if you are looking for genomic translocations, amplifications, deletions.

                Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?
                Last edited by thinkRNA; 03-23-2010, 09:25 PM.

                Comment


                • #9
                  Efficient tools to filter exact sequence duplicates

                  This question seems to be asked frequently, so here's a detailed tutorial of how this can be done efficiently with tens of millions of reads.

                  Goby provides an efficient implementation to filter out non-unique reads from a large set of reads (see http://icbtools.med.cornell.edu/goby/).

                  For the purpose of this example, we will use a small input Fasta file, data/with-redundancy.fasta, the content of which is shown below.
                  >0
                  AAAAAAA
                  >1
                  AAAAAAA
                  >2
                  ACACACA
                  >3
                  ACACACA
                  >4
                  ACACACA
                  >5
                  ACATTTT

                  If you have a fasta/fastq format, first convert to compact format. This can be done as follows:

                  java -Xmx3g -jar goby.jar -m fasta-to-compact data/with-redundancy.fasta

                  (The file with-redundancy.compact-reads should now have been created.)

                  Use the tally-reads mode to calculate how many times each sequence appears in the input:

                  java -Xmx3g -jar goby.jar -m tally-reads -i data/with-redundancy.compact-reads -o myfilter

                  The tally-reads mode leverages sequence digests and works in two passes to minimize memory usage. Input files can contain tens of millions of reads.

                  Convert back to fasta, excluding sequences that appear more than once:

                  java -Xmx3g -jar goby.jar -m compact-to-fasta -i data/with-redundancy.compact-reads -f myfilter-keep.filter -o unique-reads.fa

                  The file unique-reads.fa correctly excludes repeat occurrences of reads whose sequence appear more than once in the input. This file should now look like:

                  >0
                  AAAAAAA
                  >2
                  ACACACA
                  >5
                  ACATTTT


                  Starting with Goby version 1.4.1 (see latest release at http://icbtools.med.cornell.edu/goby/download.html), you can also convert the compressed read-set to text format, to obtain multiplicity information for each read in the input.

                  java -jar goby.jar -m set-to-text myfilter -o out.tsv
                  The file out.tsv should now contain:

                  queryIndex multiplicity
                  0 2
                  1 0
                  2 3
                  3 0
                  4 0
                  5 1


                  Please note that the read set filter is stored by Goby in a compressed format. The tab delimited file can be very large compared to the compressed form.

                  Comment


                  • #10
                    Originally posted by thinkRNA View Post
                    Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?
                    First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

                    Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.

                    Comment


                    • #11
                      Originally posted by krobison View Post
                      First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

                      Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.
                      Thanks so much for this paper, it will make a good read. I am curious to know how prevalent PCR duplicates are in a typical experiment and how much more expensive FRT-Seq is.

                      Comment


                      • #12
                        Originally posted by drio View Post
                        He didn't specify he was running a RNA-seq experiment. Your question is
                        still interesting. The answer is you don't know for certain if that read is
                        coming from the same template or not. But chances that two reads are dups
                        when they map to the same coordinate with the same direction is pretty high.
                        That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.
                        I am also very interested in this question.
                        If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
                        Is this saturation effect better than the risk of getting affected by PCR artifacts?
                        Do people remove identical reads before computing RPKM for instance?

                        Comment


                        • #13
                          Originally posted by steven View Post
                          I am also very interested in this question.
                          If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
                          Is this saturation effect better than the risk of getting affected by PCR artifacts?
                          Do people remove identical reads before computing RPKM for instance?
                          This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.

                          Comment


                          • #14
                            Originally posted by thinkRNA View Post
                            This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.
                            I would say quite often.
                            Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
                            Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).

                            Comment


                            • #15
                              Originally posted by steven View Post
                              I would say quite often.
                              Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
                              Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).
                              Do you mean that two overlapping reads show discrepancy often or do you mean that PCR duplicates occur often? Sorry, I just want to clarify to be sure.

                              Comment

                              Working...
                              X