Unconfigured Ad

**drio** · 03-22-2010, 06:27 PM

Not sure if this is what you want but,

I suggest you align the data first and then dump it in a BAM file. After that you can mark the PCR duplicates. The BAM will contain all the reads from your sequencing. Then you can write your own tool using any of the multiple BAM libraries to report any stats you want.

**kch** · 03-23-2010, 06:32 AM

You could maybe look into the fastx_collapser tool from this package: http://hannonlab.cshl.edu/fastx_tool...mmandline.html
But I don't think it will report any counts.

Perhaps you could try using wordcount from EMBOSS (http://emboss.sourceforge.net/apps/c...wordcount.html) and setting the wordsize to your read length? This will give you a list of all unique sequences and how many there are. Then you can write a script to convert it back to fasta. However, I'm not sure how if it can handle a large fasta file.

**Zigster** · 03-23-2010, 06:37 AM

So sequences which are substrings of other sequences - are they redundant?
How about reverse complements?

You can use vmatch to cluster them:

Using Vmatch to combine assemblies

http://jermdemo.blogspot.com/2009/11/using-vmatch-to-combine-assemblies.html

In praise of Vmatch If I could take only one bioinformatics tool with me to a desert island it would be Vmatch . In addition to being a v...

If not then just load a hash with your seqs as keys and your deflines as values.

**thinkRNA** · 03-23-2010, 08:11 AM

Originally posted by drio View Post

Not sure if this is what you want but,

I suggest you align the data first and then dump it in a BAM file. After that you can mark the PCR duplicates. The BAM will contain all the reads from your sequencing. Then you can write your own tool using any of the multiple BAM libraries to report any stats you want.

I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?

**drio** · 03-23-2010, 06:55 PM

Originally posted by thinkRNA View Post

I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?

He didn't specify he was running a RNA-seq experiment. Your question is
still interesting. The answer is you don't know for certain if that read is
coming from the same template or not. But chances that two reads are dups
when they map to the same coordinate with the same direction is pretty high.
That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.

**krobison** · 03-23-2010, 07:59 PM

While the method of hashing all the data should work in theory, if you don't have X more memory than the size of your file you will run out of memory (by X I mean some factor which is dependent on a lot of details of the hashtable implementation that I won't claim to know, but X is certainly greater than 1).

One other approach would be to convert the files to tab-delimited; sort on the sequence column & then you can compress duplicates by just comparing the current row to the previous one -- I believe UNIX sort can deal with files that are quite large & perhaps use on-disk storing of the intermediates during the sort.

**thinkRNA** · 03-23-2010, 09:20 PM

Originally posted by drio View Post

He didn't specify he was running a RNA-seq experiment. Your question is
still interesting. The answer is you don't know for certain if that read is
coming from the same template or not. But chances that two reads are dups
when they map to the same coordinate with the same direction is pretty high.
That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.

Thanks so much for responding-I was dying to know the answer. From your reply, I understand that the PCR duplicates cannot be inferred in single reads in RNA-seq data . However, for DNA-sequencing paired-end reads, one can determine it. But there can be high copy numbers in certain genomic locations (Example, Cmyc genomic amplification in b-cell lymphomas). Is it not going to be another parameter one should take in to consideration when removing duplicates? Although, I can see how someone will not be interested in this if they are only interested in SNP variants. But, I think, it is exactly these reads you don't want to throw if you are looking for genomic translocations, amplifications, deletions.

Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?

**Fabien Campagne** · 03-24-2010, 05:10 AM

Efficient tools to filter exact sequence duplicates

This question seems to be asked frequently, so here's a detailed tutorial of how this can be done efficiently with tens of millions of reads.

Goby provides an efficient implementation to filter out non-unique reads from a large set of reads (see http://icbtools.med.cornell.edu/goby/).

For the purpose of this example, we will use a small input Fasta file, data/with-redundancy.fasta, the content of which is shown below.
>0
AAAAAAA
>1
AAAAAAA
>2
ACACACA
>3
ACACACA
>4
ACACACA
>5
ACATTTT

If you have a fasta/fastq format, first convert to compact format. This can be done as follows:

java -Xmx3g -jar goby.jar -m fasta-to-compact data/with-redundancy.fasta

(The file with-redundancy.compact-reads should now have been created.)

Use the tally-reads mode to calculate how many times each sequence appears in the input:

java -Xmx3g -jar goby.jar -m tally-reads -i data/with-redundancy.compact-reads -o myfilter

The tally-reads mode leverages sequence digests and works in two passes to minimize memory usage. Input files can contain tens of millions of reads.

Convert back to fasta, excluding sequences that appear more than once:

java -Xmx3g -jar goby.jar -m compact-to-fasta -i data/with-redundancy.compact-reads -f myfilter-keep.filter -o unique-reads.fa

The file unique-reads.fa correctly excludes repeat occurrences of reads whose sequence appear more than once in the input. This file should now look like:

>0
AAAAAAA
>2
ACACACA
>5
ACATTTT

Starting with Goby version 1.4.1 (see latest release at http://icbtools.med.cornell.edu/goby/download.html), you can also convert the compressed read-set to text format, to obtain multiplicity information for each read in the input.

java -jar goby.jar -m set-to-text myfilter -o out.tsv
The file out.tsv should now contain:

queryIndex multiplicity
0 2
1 0
2 3
3 0
4 0
5 1

Please note that the read set filter is stored by Goby in a compressed format. The tab delimited file can be very large compared to the compressed form.

**krobison** · 03-24-2010, 06:19 AM

Originally posted by thinkRNA View Post

Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?

First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.

**thinkRNA** · 03-24-2010, 09:53 AM

Originally posted by krobison View Post

First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.

Thanks so much for this paper, it will make a good read. I am curious to know how prevalent PCR duplicates are in a typical experiment and how much more expensive FRT-Seq is.

**steven** · 03-25-2010, 02:13 AM

Originally posted by drio View Post

He didn't specify he was running a RNA-seq experiment. Your question is
still interesting. The answer is you don't know for certain if that read is
coming from the same template or not. But chances that two reads are dups
when they map to the same coordinate with the same direction is pretty high.
That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.

I am also very interested in this question.
If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
Is this saturation effect better than the risk of getting affected by PCR artifacts?
Do people remove identical reads before computing RPKM for instance?

**thinkRNA** · 03-25-2010, 08:27 AM

Originally posted by steven View Post

I am also very interested in this question.
If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
Is this saturation effect better than the risk of getting affected by PCR artifacts?
Do people remove identical reads before computing RPKM for instance?

This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.

**steven** · 03-25-2010, 08:59 AM

Originally posted by thinkRNA View Post

This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.

I would say quite often.
Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).

**thinkRNA** · 03-25-2010, 09:06 AM

Originally posted by steven View Post

I would say quite often.
Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).

Do you mean that two overlapping reads show discrepancy often or do you mean that PCR duplicates occur often? Sorry, I just want to clarify to be sure.

Topics	Statistics	Last Post
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 33 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM
High-Resolution Sequencing Exposes Hidden Toxoplasma Diversity by SEQadmin2 Started by SEQadmin2, 07-02-2026, 11:08 AM	0 responses 31 views 0 reactions	Last Post by SEQadmin2 07-02-2026, 11:08 AM

Unconfigured Ad

How do I make next-gen SEQ data non-redundant?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News