Unconfigured Ad

ech · 08-06-2009, 08:54 AM

samtools rmdup

It looks like removal of duplicates for paired end reads is limited to cases when both reads map to the same chromosome. Is this correct? If yes, is there an alternative?

lh3 · 08-05-2009, 08:40 AM

Originally posted by wjeck View Post

This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair

I can hack a workaround, but I imagine I am not the only one who will run into this problem.

Could you send a small example to me? If read names are ended up with "/[12]", they should be converted in principle.

By the way, thank Nils very much for replying these questions. I am not here in the forum as often as before.

lh3 · 08-05-2009, 08:37 AM

Originally posted by griffon42 View Post

To add a bit to xguo's questions regarding the varFilter in samtools:

I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.

Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?

Thanks!

The caller is pretty much the same, but the default rules used in filtering are slightly different. When I use maq and samtools to call SNPs on maq alignment, 99% of them are identical. What parameters are you using (in particular the maximum depth)?

lh3 · 08-05-2009, 08:30 AM

Originally posted by xguo View Post

I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?

Yes, you need this unless you are doing target sequencing in which case the read depth is expected to vary a lot.

Originally posted by xguo

The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?

RMS=root mean square. It is more stable than maximum mapping quality in filtration.

Originally posted by xguo

Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?

You should try it out. Different project may prefer different threshold.

In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?

You can only calculate by yourself. The number of occurrences is stored in the X0 tag which is specific to BWA.

nilshomer · 08-04-2009, 01:40 PM

Originally posted by wjeck View Post

This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair

I can hack a workaround, but I imagine I am not the only one who will run into this problem.

You could always move past MAQ, onto its successor BWA (same author). Most likely the SAM output of BWA is compliant (hl3 is the author of the C version of SAMtools, MAQ and BWA). There also other aligners out there that I believe support the SAM format, like Bowtie, SOAP or BFAST (my own admittedly).

wjeck · 08-04-2009, 01:27 PM

Originally posted by nilshomer View Post

If it is in proper SAM format, then there in the flag field there should be two bits that identify if it is "read1" or "read2" (see 2.2.2 http://samtools.sourceforge.net/SAM1.pdf). The latest trunk C version has a "-X" option in the "samtools view" command that can better display the flag field for you.

This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair

I can hack a workaround, but I imagine I am not the only one who will run into this problem.

nilshomer · 08-04-2009, 01:12 PM

Originally posted by wjeck View Post

Quick additional question about maq2sam-long. I am finding that when I move maq alignments that include paired end data into sam format I loose the unique identifiers for each of the reads. Instead, both reads of the pair end up receiving the same name. This would be fine, except that when I then export the data for use with the Integrative Genomics Viewer I can't visualize the pairing information as I would like to.

Is there a way to get maq's map files into SAM format while preserving the unique pair names and pairing information? Perhaps this stripping of unique names is appropriate and I am confused about the standard SAM format? Any help would be appreciated.

If it is in proper SAM format, then there in the flag field there should be two bits that identify if it is "read1" or "read2" (see 2.2.2 http://samtools.sourceforge.net/SAM1.pdf). The latest trunk C version has a "-X" option in the "samtools view" command that can better display the flag field for you.

wjeck · 08-04-2009, 01:00 PM

Quick additional question about maq2sam-long. I am finding that when I move maq alignments that include paired end data into sam format I loose the unique identifiers for each of the reads. Instead, both reads of the pair end up receiving the same name. This would be fine, except that when I then export the data for use with the Integrative Genomics Viewer I can't visualize the pairing information as I would like to.

Is there a way to get maq's map files into SAM format while preserving the unique pair names and pairing information? Perhaps this stripping of unique names is appropriate and I am confused about the standard SAM format? Any help would be appreciated.

griffon42 · 08-04-2009, 09:39 AM

To add a bit to xguo's questions regarding the varFilter in samtools:

I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.

Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?

Thanks!

nilshomer · 08-01-2009, 10:09 AM

I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".

The question is how do you identify PCR duplicates with high coverage data? If you have 50bp reads with 1000x coverage, then on average you should get 20 reads with the same start site at a given position. So if you remove PCR duplicates in this situation, you should on average reduce your coverage by a factor of 20.

Now if you have paired end reads, and the insert distribution is not tight (say >500bp), you could enforce both ends must map to the same start site, which would only remove say >1% of the true data (making this up).

If you are trying to measure gene expression, then I would not filter out PCR duplicates at high coverage data. If you are trying to find novel splice events, then the case is more ambiguous.

xguo · 08-01-2009, 06:45 AM

[QUOTE=nilshomer;7023]Why did you restrict it in the first place? I am probably missing something obvious.

I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".

nilshomer · 07-31-2009, 07:42 PM

Originally posted by xguo View Post

nilshomer, thanks for your prompt reply.

I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?

Why did you restrict it in the first place? I am probably missing something obvious.

The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?

If you use MAQ then the maximum mapping quality is the same. If you go to this page you will see a description of the consensus calling, which mentions the maximum mapping quality among other things. I don't see RMS though.

Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?

Assuming they represent true probabilities (a liberal assumption) then you should select it to match your maximum false-positive rate to obtain the largest number of SNPS.

In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?

The above is a very crude approximation, due to the extreme variability of coverage as well as the large dependency on the local repeat structure (context) when it comes to mapping. In the pileup format, there is a column for the # of reads covering the given consensus call.

Sorry for so many questions. Thanks a lot for any insight.

I don't use MAQ very much, since I am an author of my own aligner BFAST. On the other hand, samtools was developed by the same author as MAQ (Heng Li) so you should be safe making some of the above assumptions about MAQ and samtools.

xguo · 07-31-2009, 11:05 AM

nilshomer, thanks for your prompt reply.

I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth? The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq? Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them? In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?

Sorry for so many questions. Thanks a lot for any insight.

nilshomer · 07-31-2009, 09:52 AM

Originally posted by xguo View Post

Can anyone help me with two options in samtools view?

-f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
-F INT Skip alignments with bits present in INT [0]

I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?

See the SAM format for a description of the "FLAG" field. With the "FLAG" field, you can filter out unmapped reads, unpaired reads (paired end reads where one end maps), among other things. To remove low quality mappings (enforce a mapping quality of 50 or greater), I would just use the following command:

Code:

samtools view <in.bam> | awk '{ if($5 > 50) { print $0 }}'

You can save the above to a file, or import it back into a "bam" file.

In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?

thanks
Xiang

In my opinion it is better to use as much information during mapping as the alignment program should be able to squeeze out extra information from the last five bases. You can then "filter" the alignment file (SAM) and remove the last five bases (you will have to do this yourself).

xguo · 07-31-2009, 06:26 AM

Can anyone help me with two options in samtools view?

-f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
-F INT Skip alignments with bits present in INT [0]

I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?

In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?

thanks
Xiang

Topics	Statistics	Last Post
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 46 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 106 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, 06-05-2026, 10:09 AM	0 responses 125 views 0 reactions	Last Post by SEQadmin2 06-05-2026, 10:09 AM

Unconfigured Ad

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News