samtools rmdup
It looks like removal of duplicates for paired end reads is limited to cases when both reads map to the same chromosome. Is this correct? If yes, is there an alternative?
Unconfigured Ad
Collapse
X
-
Could you send a small example to me? If read names are ended up with "/[12]", they should be converted in principle.Originally posted by wjeck View PostThis seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair
I can hack a workaround, but I imagine I am not the only one who will run into this problem.
By the way, thank Nils very much for replying these questions. I am not here in the forum as often as before.
Leave a comment:
-
-
The caller is pretty much the same, but the default rules used in filtering are slightly different. When I use maq and samtools to call SNPs on maq alignment, 99% of them are identical. What parameters are you using (in particular the maximum depth)?Originally posted by griffon42 View PostTo add a bit to xguo's questions regarding the varFilter in samtools:
I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.
Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?
Thanks!
Leave a comment:
-
-
Yes, you need this unless you are doing target sequencing in which case the read depth is expected to vary a lot.Originally posted by xguo View PostI got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?
RMS=root mean square. It is more stable than maximum mapping quality in filtration.Originally posted by xguoThe mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?
You should try it out. Different project may prefer different threshold.Originally posted by xguoConsensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?
You can only calculate by yourself. The number of occurrences is stored in the X0 tag which is specific to BWA.In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?
Leave a comment:
-
-
You could always move past MAQ, onto its successor BWA (same author). Most likely the SAM output of BWA is compliant (hl3 is the author of the C version of SAMtools, MAQ and BWA). There also other aligners out there that I believe support the SAM format, like Bowtie, SOAP or BFAST (my own admittedly).Originally posted by wjeck View PostThis seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair
I can hack a workaround, but I imagine I am not the only one who will run into this problem.
Leave a comment:
-
-
This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pairOriginally posted by nilshomer View PostIf it is in proper SAM format, then there in the flag field there should be two bits that identify if it is "read1" or "read2" (see 2.2.2 http://samtools.sourceforge.net/SAM1.pdf). The latest trunk C version has a "-X" option in the "samtools view" command that can better display the flag field for you.
I can hack a workaround, but I imagine I am not the only one who will run into this problem.
Leave a comment:
-
-
If it is in proper SAM format, then there in the flag field there should be two bits that identify if it is "read1" or "read2" (see 2.2.2 http://samtools.sourceforge.net/SAM1.pdf). The latest trunk C version has a "-X" option in the "samtools view" command that can better display the flag field for you.Originally posted by wjeck View PostQuick additional question about maq2sam-long. I am finding that when I move maq alignments that include paired end data into sam format I loose the unique identifiers for each of the reads. Instead, both reads of the pair end up receiving the same name. This would be fine, except that when I then export the data for use with the Integrative Genomics Viewer I can't visualize the pairing information as I would like to.
Is there a way to get maq's map files into SAM format while preserving the unique pair names and pairing information? Perhaps this stripping of unique names is appropriate and I am confused about the standard SAM format? Any help would be appreciated.
Leave a comment:
-
-
Quick additional question about maq2sam-long. I am finding that when I move maq alignments that include paired end data into sam format I loose the unique identifiers for each of the reads. Instead, both reads of the pair end up receiving the same name. This would be fine, except that when I then export the data for use with the Integrative Genomics Viewer I can't visualize the pairing information as I would like to.
Is there a way to get maq's map files into SAM format while preserving the unique pair names and pairing information? Perhaps this stripping of unique names is appropriate and I am confused about the standard SAM format? Any help would be appreciated.
Leave a comment:
-
-
To add a bit to xguo's questions regarding the varFilter in samtools:
I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.
Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?
Thanks!
Leave a comment:
-
-
The question is how do you identify PCR duplicates with high coverage data? If you have 50bp reads with 1000x coverage, then on average you should get 20 reads with the same start site at a given position. So if you remove PCR duplicates in this situation, you should on average reduce your coverage by a factor of 20.I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".
Now if you have paired end reads, and the insert distribution is not tight (say >500bp), you could enforce both ends must map to the same start site, which would only remove say >1% of the true data (making this up).
If you are trying to measure gene expression, then I would not filter out PCR duplicates at high coverage data. If you are trying to find novel splice events, then the case is more ambiguous.
Leave a comment:
-
-
[QUOTE=nilshomer;7023]Why did you restrict it in the first place? I am probably missing something obvious.
I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".
Leave a comment:
-
-
Why did you restrict it in the first place? I am probably missing something obvious.Originally posted by xguo View Postnilshomer, thanks for your prompt reply.
I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?
If you use MAQ then the maximum mapping quality is the same. If you go to this page you will see a description of the consensus calling, which mentions the maximum mapping quality among other things. I don't see RMS though.The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?
Assuming they represent true probabilities (a liberal assumption) then you should select it to match your maximum false-positive rate to obtain the largest number of SNPS.Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?
The above is a very crude approximation, due to the extreme variability of coverage as well as the large dependency on the local repeat structure (context) when it comes to mapping. In the pileup format, there is a column for the # of reads covering the given consensus call.In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?
I don't use MAQ very much, since I am an author of my own aligner BFAST. On the other hand, samtools was developed by the same author as MAQ (Heng Li) so you should be safe making some of the above assumptions about MAQ and samtools.Sorry for so many questions. Thanks a lot for any insight.
Leave a comment:
-
-
nilshomer, thanks for your prompt reply.
I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth? The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq? Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them? In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?
Sorry for so many questions. Thanks a lot for any insight.
Leave a comment:
-
-
See the SAM format for a description of the "FLAG" field. With the "FLAG" field, you can filter out unmapped reads, unpaired reads (paired end reads where one end maps), among other things. To remove low quality mappings (enforce a mapping quality of 50 or greater), I would just use the following command:Originally posted by xguo View PostCan anyone help me with two options in samtools view?
-f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
-F INT Skip alignments with bits present in INT [0]
I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?
You can save the above to a file, or import it back into a "bam" file.Code:samtools view <in.bam> | awk '{ if($5 > 50) { print $0 }}'
In my opinion it is better to use as much information during mapping as the alignment program should be able to squeeze out extra information from the last five bases. You can then "filter" the alignment file (SAM) and remove the last five bases (you will have to do this yourself).In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?
thanks
Xiang
Leave a comment:
-
-
Can anyone help me with two options in samtools view?
-f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
-F INT Skip alignments with bits present in INT [0]
I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?
In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?
thanks
Xiang
Leave a comment:
-
Latest Articles
Collapse
-
by SEQadmin2
I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.
Here are nine questions we think about, in roughly the order they matter, before...-
Channel: Articles
06-18-2026, 07:11 AM -
-
by SEQadmin2
Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.
The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
...-
Channel: Articles
06-02-2026, 10:05 AM -
ad_right_rmr
Collapse
News
Collapse
| Topics | Statistics | Last Post | ||
|---|---|---|---|---|
|
Started by SEQadmin2, 06-26-2026, 11:10 AM
|
0 responses
12 views
0 reactions
|
Last Post
by SEQadmin2
06-26-2026, 11:10 AM
|
||
|
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population
by SEQadmin2
Started by SEQadmin2, 06-17-2026, 06:09 AM
|
0 responses
46 views
0 reactions
|
Last Post
by SEQadmin2
06-17-2026, 06:09 AM
|
||
|
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism
by SEQadmin2
Started by SEQadmin2, 06-09-2026, 11:58 AM
|
0 responses
106 views
0 reactions
|
Last Post
by SEQadmin2
06-09-2026, 11:58 AM
|
||
|
Started by SEQadmin2, 06-05-2026, 10:09 AM
|
0 responses
125 views
0 reactions
|
Last Post
by SEQadmin2
06-05-2026, 10:09 AM
|
Leave a comment: