Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • RockChalkJayhawk
    replied
    Originally posted by drio View Post
    Code:
    drio@ned:~/tmp **$ cat test.txt 
    a:1
    z:2
    t:1
    r:4
    x:1
    drio@ned:~/tmp **$ cat test.txt  | gsort -t: -k2,2 -u
    a:1
    z:2
    r:4
    Yep...that is much simpler. Now, I am filtering my SAM file to contain correctly matched pairs while removing duplicates by:
    Code:
    awk '($7=="=")' accepted_hits.sam |sort -k10,10 -u > reads.filtered

    Leave a comment:


  • drio
    replied
    Originally posted by RockChalkJayhawk View Post
    What UNIX command do you use to compare multiple lines in the same file? The only way I would know to do iit is to sort, cut the sequence from the SAM file, uniq, then join that list with the SAM. There is an easier way than this, right?
    Code:
    drio@ned:~/tmp **$ cat test.txt 
    a:1
    z:2
    t:1
    r:4
    x:1
    drio@ned:~/tmp **$ cat test.txt  | gsort -t: -k2,2 -u
    a:1
    z:2
    r:4

    Leave a comment:


  • RockChalkJayhawk
    replied
    Originally posted by krobison View Post

    One other approach would be to convert the files to tab-delimited; sort on the sequence column & then you can compress duplicates by just comparing the current row to the previous one -- I believe UNIX sort can deal with files that are quite large & perhaps use on-disk storing of the intermediates during the sort.
    What UNIX command do you use to compare multiple lines in the same file? The only way I would know to do iit is to sort, cut the sequence from the SAM file, uniq, then join that list with the SAM. There is an easier way than this, right?

    Leave a comment:


  • steven
    replied
    Originally posted by thinkRNA View Post
    Do you mean that two overlapping reads show discrepancy often or do you mean that PCR duplicates occur often? Sorry, I just want to clarify to be sure.
    I could not say if it is because of PCR duplicates or the other reasons i mentioned, but my feeling is that coverage variation is substantial (see also the link to a dedicated thread in my previous post). I have no precise measurement of this variation though. Maybe a large scale estimation of this variability could be an interesting experiment -if not already performed? Note that Helicos data could be used as a negative control regarding the potential contribution of the PCR step.

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by steven View Post
    I would say quite often.
    Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
    Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).
    Do you mean that two overlapping reads show discrepancy often or do you mean that PCR duplicates occur often? Sorry, I just want to clarify to be sure.

    Leave a comment:


  • steven
    replied
    Originally posted by thinkRNA View Post
    This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.
    I would say quite often.
    Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
    Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by steven View Post
    I am also very interested in this question.
    If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
    Is this saturation effect better than the risk of getting affected by PCR artifacts?
    Do people remove identical reads before computing RPKM for instance?
    This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.

    Leave a comment:


  • steven
    replied
    Originally posted by drio View Post
    He didn't specify he was running a RNA-seq experiment. Your question is
    still interesting. The answer is you don't know for certain if that read is
    coming from the same template or not. But chances that two reads are dups
    when they map to the same coordinate with the same direction is pretty high.
    That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.
    I am also very interested in this question.
    If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
    Is this saturation effect better than the risk of getting affected by PCR artifacts?
    Do people remove identical reads before computing RPKM for instance?

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by krobison View Post
    First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

    Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.
    Thanks so much for this paper, it will make a good read. I am curious to know how prevalent PCR duplicates are in a typical experiment and how much more expensive FRT-Seq is.

    Leave a comment:


  • krobison
    replied
    Originally posted by thinkRNA View Post
    Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?
    First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

    Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.

    Leave a comment:


  • Fabien Campagne
    replied
    Efficient tools to filter exact sequence duplicates

    This question seems to be asked frequently, so here's a detailed tutorial of how this can be done efficiently with tens of millions of reads.

    Goby provides an efficient implementation to filter out non-unique reads from a large set of reads (see http://icbtools.med.cornell.edu/goby/).

    For the purpose of this example, we will use a small input Fasta file, data/with-redundancy.fasta, the content of which is shown below.
    >0
    AAAAAAA
    >1
    AAAAAAA
    >2
    ACACACA
    >3
    ACACACA
    >4
    ACACACA
    >5
    ACATTTT

    If you have a fasta/fastq format, first convert to compact format. This can be done as follows:

    java -Xmx3g -jar goby.jar -m fasta-to-compact data/with-redundancy.fasta

    (The file with-redundancy.compact-reads should now have been created.)

    Use the tally-reads mode to calculate how many times each sequence appears in the input:

    java -Xmx3g -jar goby.jar -m tally-reads -i data/with-redundancy.compact-reads -o myfilter

    The tally-reads mode leverages sequence digests and works in two passes to minimize memory usage. Input files can contain tens of millions of reads.

    Convert back to fasta, excluding sequences that appear more than once:

    java -Xmx3g -jar goby.jar -m compact-to-fasta -i data/with-redundancy.compact-reads -f myfilter-keep.filter -o unique-reads.fa

    The file unique-reads.fa correctly excludes repeat occurrences of reads whose sequence appear more than once in the input. This file should now look like:

    >0
    AAAAAAA
    >2
    ACACACA
    >5
    ACATTTT


    Starting with Goby version 1.4.1 (see latest release at http://icbtools.med.cornell.edu/goby/download.html), you can also convert the compressed read-set to text format, to obtain multiplicity information for each read in the input.

    java -jar goby.jar -m set-to-text myfilter -o out.tsv
    The file out.tsv should now contain:

    queryIndex multiplicity
    0 2
    1 0
    2 3
    3 0
    4 0
    5 1


    Please note that the read set filter is stored by Goby in a compressed format. The tab delimited file can be very large compared to the compressed form.

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by drio View Post
    He didn't specify he was running a RNA-seq experiment. Your question is
    still interesting. The answer is you don't know for certain if that read is
    coming from the same template or not. But chances that two reads are dups
    when they map to the same coordinate with the same direction is pretty high.
    That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.
    Thanks so much for responding-I was dying to know the answer. From your reply, I understand that the PCR duplicates cannot be inferred in single reads in RNA-seq data . However, for DNA-sequencing paired-end reads, one can determine it. But there can be high copy numbers in certain genomic locations (Example, Cmyc genomic amplification in b-cell lymphomas). Is it not going to be another parameter one should take in to consideration when removing duplicates? Although, I can see how someone will not be interested in this if they are only interested in SNP variants. But, I think, it is exactly these reads you don't want to throw if you are looking for genomic translocations, amplifications, deletions.

    Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?
    Last edited by thinkRNA; 03-23-2010, 09:25 PM.

    Leave a comment:


  • krobison
    replied
    While the method of hashing all the data should work in theory, if you don't have X more memory than the size of your file you will run out of memory (by X I mean some factor which is dependent on a lot of details of the hashtable implementation that I won't claim to know, but X is certainly greater than 1).

    One other approach would be to convert the files to tab-delimited; sort on the sequence column & then you can compress duplicates by just comparing the current row to the previous one -- I believe UNIX sort can deal with files that are quite large & perhaps use on-disk storing of the intermediates during the sort.

    Leave a comment:


  • drio
    replied
    Originally posted by thinkRNA View Post
    I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?
    He didn't specify he was running a RNA-seq experiment. Your question is
    still interesting. The answer is you don't know for certain if that read is
    coming from the same template or not. But chances that two reads are dups
    when they map to the same coordinate with the same direction is pretty high.
    That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.

    Leave a comment:


  • thinkRNA
    replied
    Originally posted by drio View Post
    Not sure if this is what you want but,

    I suggest you align the data first and then dump it in a BAM file. After that you can mark the PCR duplicates. The BAM will contain all the reads from your sequencing. Then you can write your own tool using any of the multiple BAM libraries to report any stats you want.
    I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Exploring the Dynamics of the Tumor Microenvironment
    by seqadmin




    The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
    07-08-2024, 03:19 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 07-25-2024, 06:46 AM
0 responses
9 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-24-2024, 11:09 AM
0 responses
26 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-19-2024, 07:20 AM
0 responses
160 views
0 likes
Last Post seqadmin  
Started by seqadmin, 07-16-2024, 05:49 AM
0 responses
127 views
0 likes
Last Post seqadmin  
Working...
X