How do I make next-gen SEQ data non-redundant?

RockChalkJayhawk replied

08-23-2010, 07:04 AM
Originally posted by drio View Post

Code:

drio@ned:~/tmp **$ cat test.txt a:1 z:2 t:1 r:4 x:1 drio@ned:~/tmp **$ cat test.txt | gsort -t: -k2,2 -u a:1 z:2 r:4

Yep...that is much simpler. Now, I am filtering my SAM file to contain correctly matched pairs while removing duplicates by:

Code:

awk '($7=="=")' accepted_hits.sam |sort -k10,10 -u > reads.filtered
Leave a comment:
drio replied

08-23-2010, 06:19 AM
Originally posted by RockChalkJayhawk View Post

What UNIX command do you use to compare multiple lines in the same file? The only way I would know to do iit is to sort, cut the sequence from the SAM file, uniq, then join that list with the SAM. There is an easier way than this, right?

Code:

drio@ned:~/tmp **$ cat test.txt a:1 z:2 t:1 r:4 x:1 drio@ned:~/tmp **$ cat test.txt | gsort -t: -k2,2 -u a:1 z:2 r:4
Leave a comment:
RockChalkJayhawk replied

08-22-2010, 06:05 AM
Originally posted by krobison View Post

One other approach would be to convert the files to tab-delimited; sort on the sequence column & then you can compress duplicates by just comparing the current row to the previous one -- I believe UNIX sort can deal with files that are quite large & perhaps use on-disk storing of the intermediates during the sort.

What UNIX command do you use to compare multiple lines in the same file? The only way I would know to do iit is to sort, cut the sequence from the SAM file, uniq, then join that list with the SAM. There is an easier way than this, right?
Leave a comment:
steven replied

03-25-2010, 09:52 AM
Originally posted by thinkRNA View Post

Do you mean that two overlapping reads show discrepancy often or do you mean that PCR duplicates occur often? Sorry, I just want to clarify to be sure.

I could not say if it is because of PCR duplicates or the other reasons i mentioned, but my feeling is that coverage variation is substantial (see also the link to a dedicated thread in my previous post). I have no precise measurement of this variation though. Maybe a large scale estimation of this variability could be an interesting experiment -if not already performed? Note that Helicos data could be used as a negative control regarding the potential contribution of the PCR step.
Leave a comment:
thinkRNA replied

03-25-2010, 09:06 AM
Originally posted by steven View Post

I would say quite often.
Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).

Do you mean that two overlapping reads show discrepancy often or do you mean that PCR duplicates occur often? Sorry, I just want to clarify to be sure.
Leave a comment:
steven replied

03-25-2010, 08:59 AM
Originally posted by thinkRNA View Post

This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.

I would say quite often.
Variations in coverage can result from a lot of things: PCR-artifacts, heterogeneous fragmentation, low sampling of weakly expressed transcripts, mapping issues, etc..
Not mentioning that what really is in our cells is not limited to what is already reported in our annotation databases (unknown genes, splicing variants, etc).
Leave a comment:
thinkRNA replied

03-25-2010, 08:27 AM
Originally posted by steven View Post

I am also very interested in this question.
If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
Is this saturation effect better than the risk of getting affected by PCR artifacts?
Do people remove identical reads before computing RPKM for instance?

This is exactly my thought that you cannot remove duplicates in RNA-seq because then how will you know your mRNA expression. Now, if you have two overlapping reads and you notice a discrepancy, that could be a result of PCR-duplicates given that the reads don't land on an exon-exon junction. But I wonder how often this happens.
Leave a comment:
steven replied

03-25-2010, 02:13 AM
Originally posted by drio View Post

He didn't specify he was running a RNA-seq experiment. Your question is
still interesting. The answer is you don't know for certain if that read is
coming from the same template or not. But chances that two reads are dups
when they map to the same coordinate with the same direction is pretty high.
That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.

I am also very interested in this question.
If you remove duplicates in an RNA-seq experiment, doesn't this result in a drastic reduction of the dynamic range of the expression values? I mean, the maximum number of reads corresponding to a given genomic region will then become limited by the size of this area basically. Am i wrong?
Is this saturation effect better than the risk of getting affected by PCR artifacts?
Do people remove identical reads before computing RPKM for instance?
Leave a comment:
thinkRNA replied

03-24-2010, 09:53 AM
Originally posted by krobison View Post

First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.

Thanks so much for this paper, it will make a good read. I am curious to know how prevalent PCR duplicates are in a typical experiment and how much more expensive FRT-Seq is.
Leave a comment:
krobison replied

03-24-2010, 06:19 AM
Originally posted by thinkRNA View Post

Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?

First answer: Yes. Of released platforms, only Helicos lacks a PCR step in standard sample prep. But you did specify "popular"

Second answer: Sanger Centre has published an RNA-Seq protocol on Illumina called FRT-Seq.
Leave a comment:
Fabien Campagne replied

03-24-2010, 05:10 AM
Efficient tools to filter exact sequence duplicates

This question seems to be asked frequently, so here's a detailed tutorial of how this can be done efficiently with tens of millions of reads.

Goby provides an efficient implementation to filter out non-unique reads from a large set of reads (see http://icbtools.med.cornell.edu/goby/).

For the purpose of this example, we will use a small input Fasta file, data/with-redundancy.fasta, the content of which is shown below.
>0
AAAAAAA
>1
AAAAAAA
>2
ACACACA
>3
ACACACA
>4
ACACACA
>5
ACATTTT

If you have a fasta/fastq format, first convert to compact format. This can be done as follows:

java -Xmx3g -jar goby.jar -m fasta-to-compact data/with-redundancy.fasta

(The file with-redundancy.compact-reads should now have been created.)

Use the tally-reads mode to calculate how many times each sequence appears in the input:

java -Xmx3g -jar goby.jar -m tally-reads -i data/with-redundancy.compact-reads -o myfilter

The tally-reads mode leverages sequence digests and works in two passes to minimize memory usage. Input files can contain tens of millions of reads.

Convert back to fasta, excluding sequences that appear more than once:

java -Xmx3g -jar goby.jar -m compact-to-fasta -i data/with-redundancy.compact-reads -f myfilter-keep.filter -o unique-reads.fa

The file unique-reads.fa correctly excludes repeat occurrences of reads whose sequence appear more than once in the input. This file should now look like:

>0
AAAAAAA
>2
ACACACA
>5
ACATTTT

Starting with Goby version 1.4.1 (see latest release at http://icbtools.med.cornell.edu/goby/download.html), you can also convert the compressed read-set to text format, to obtain multiplicity information for each read in the input.

java -jar goby.jar -m set-to-text myfilter -o out.tsv
The file out.tsv should now contain:

queryIndex multiplicity
0 2
1 0
2 3
3 0
4 0
5 1

Please note that the read set filter is stored by Goby in a compressed format. The tab delimited file can be very large compared to the compressed form.
Leave a comment:
thinkRNA replied

03-23-2010, 09:20 PM
Originally posted by drio View Post

He didn't specify he was running a RNA-seq experiment. Your question is
still interesting. The answer is you don't know for certain if that read is
coming from the same template or not. But chances that two reads are dups
when they map to the same coordinate with the same direction is pretty high.
That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.

Thanks so much for responding-I was dying to know the answer. From your reply, I understand that the PCR duplicates cannot be inferred in single reads in RNA-seq data . However, for DNA-sequencing paired-end reads, one can determine it. But there can be high copy numbers in certain genomic locations (Example, Cmyc genomic amplification in b-cell lymphomas). Is it not going to be another parameter one should take in to consideration when removing duplicates? Although, I can see how someone will not be interested in this if they are only interested in SNP variants. But, I think, it is exactly these reads you don't want to throw if you are looking for genomic translocations, amplifications, deletions.

Finally, one last question: do all popular platforms (Illumina, 454 and Solid) implement the PCR step so that the possibility of PCR-duplication is present in all of them?

Last edited by thinkRNA; 03-23-2010, 09:25 PM.
Leave a comment:
krobison replied

03-23-2010, 07:59 PM
While the method of hashing all the data should work in theory, if you don't have X more memory than the size of your file you will run out of memory (by X I mean some factor which is dependent on a lot of details of the hashtable implementation that I won't claim to know, but X is certainly greater than 1).

One other approach would be to convert the files to tab-delimited; sort on the sequence column & then you can compress duplicates by just comparing the current row to the previous one -- I believe UNIX sort can deal with files that are quite large & perhaps use on-disk storing of the intermediates during the sort.
Leave a comment:
drio replied

03-23-2010, 06:55 PM
Originally posted by thinkRNA View Post

I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?

He didn't specify he was running a RNA-seq experiment. Your question is
still interesting. The answer is you don't know for certain if that read is
coming from the same template or not. But chances that two reads are dups
when they map to the same coordinate with the same direction is pretty high.
That's only valid for fragment data. If you have MP or PE data you can add the mapping information of the mate to recover duplicates.
Leave a comment:
thinkRNA replied

03-23-2010, 08:11 AM
Originally posted by drio View Post

Not sure if this is what you want but,

I suggest you align the data first and then dump it in a BAM file. After that you can mark the PCR duplicates. The BAM will contain all the reads from your sequencing. Then you can write your own tool using any of the multiple BAM libraries to report any stats you want.

I am a bit confused with determining PCR duplicates in RAN-seq data. How will you differentiate if a redundant read is from a result of PCR duplication versus a real read indicating mRNA expression? Obviously, I am missing something very simple, but can someone clarify please?
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, 07-25-2024, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin 07-25-2024, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 26 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 160 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News