Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • ech
    replied
    samtools rmdup

    It looks like removal of duplicates for paired end reads is limited to cases when both reads map to the same chromosome. Is this correct? If yes, is there an alternative?

    Leave a comment:


  • lh3
    replied
    Originally posted by wjeck View Post
    This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair

    I can hack a workaround, but I imagine I am not the only one who will run into this problem.
    Could you send a small example to me? If read names are ended up with "/[12]", they should be converted in principle.

    By the way, thank Nils very much for replying these questions. I am not here in the forum as often as before.

    Leave a comment:


  • lh3
    replied
    Originally posted by griffon42 View Post
    To add a bit to xguo's questions regarding the varFilter in samtools:

    I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.

    Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?

    Thanks!
    The caller is pretty much the same, but the default rules used in filtering are slightly different. When I use maq and samtools to call SNPs on maq alignment, 99% of them are identical. What parameters are you using (in particular the maximum depth)?

    Leave a comment:


  • lh3
    replied
    Originally posted by xguo View Post
    I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?
    Yes, you need this unless you are doing target sequencing in which case the read depth is expected to vary a lot.

    Originally posted by xguo
    The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?
    RMS=root mean square. It is more stable than maximum mapping quality in filtration.

    Originally posted by xguo
    Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?
    You should try it out. Different project may prefer different threshold.

    In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?
    You can only calculate by yourself. The number of occurrences is stored in the X0 tag which is specific to BWA.

    Leave a comment:


  • nilshomer
    replied
    Originally posted by wjeck View Post
    This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair

    I can hack a workaround, but I imagine I am not the only one who will run into this problem.
    You could always move past MAQ, onto its successor BWA (same author). Most likely the SAM output of BWA is compliant (hl3 is the author of the C version of SAMtools, MAQ and BWA). There also other aligners out there that I believe support the SAM format, like Bowtie, SOAP or BFAST (my own admittedly).

    Leave a comment:


  • wjeck
    replied
    Originally posted by nilshomer View Post
    If it is in proper SAM format, then there in the flag field there should be two bits that identify if it is "read1" or "read2" (see 2.2.2 http://samtools.sourceforge.net/SAM1.pdf). The latest trunk C version has a "-X" option in the "samtools view" command that can better display the flag field for you.
    This seems to be the crux of the problem, then. For whatever reason that information is not surviving the conversion by maq2sam-long. Specifically neither the 0x0040 nor 0x0080 is being tacked on to identify the first or second read in the pair

    I can hack a workaround, but I imagine I am not the only one who will run into this problem.

    Leave a comment:


  • nilshomer
    replied
    Originally posted by wjeck View Post
    Quick additional question about maq2sam-long. I am finding that when I move maq alignments that include paired end data into sam format I loose the unique identifiers for each of the reads. Instead, both reads of the pair end up receiving the same name. This would be fine, except that when I then export the data for use with the Integrative Genomics Viewer I can't visualize the pairing information as I would like to.

    Is there a way to get maq's map files into SAM format while preserving the unique pair names and pairing information? Perhaps this stripping of unique names is appropriate and I am confused about the standard SAM format? Any help would be appreciated.
    If it is in proper SAM format, then there in the flag field there should be two bits that identify if it is "read1" or "read2" (see 2.2.2 http://samtools.sourceforge.net/SAM1.pdf). The latest trunk C version has a "-X" option in the "samtools view" command that can better display the flag field for you.

    Leave a comment:


  • wjeck
    replied
    Quick additional question about maq2sam-long. I am finding that when I move maq alignments that include paired end data into sam format I loose the unique identifiers for each of the reads. Instead, both reads of the pair end up receiving the same name. This would be fine, except that when I then export the data for use with the Integrative Genomics Viewer I can't visualize the pairing information as I would like to.

    Is there a way to get maq's map files into SAM format while preserving the unique pair names and pairing information? Perhaps this stripping of unique names is appropriate and I am confused about the standard SAM format? Any help would be appreciated.

    Leave a comment:


  • griffon42
    replied
    To add a bit to xguo's questions regarding the varFilter in samtools:

    I'm also struggling to understand the differences in varFilter relative to MAQ's SNPfilter. Using the exact same set of bowtie alignments (converted to either SAM format or MAQ format) I get dramatically different (3x) numbers of SNPs post-filtering step in samtools varFilter versus MAQ's SNPfilter.

    Can anyone (Heng?) shed some additional light on the differences in SNP calling policies employed by samtools varFilter?

    Thanks!

    Leave a comment:


  • nilshomer
    replied
    I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".
    The question is how do you identify PCR duplicates with high coverage data? If you have 50bp reads with 1000x coverage, then on average you should get 20 reads with the same start site at a given position. So if you remove PCR duplicates in this situation, you should on average reduce your coverage by a factor of 20.

    Now if you have paired end reads, and the insert distribution is not tight (say >500bp), you could enforce both ends must map to the same start site, which would only remove say >1% of the true data (making this up).

    If you are trying to measure gene expression, then I would not filter out PCR duplicates at high coverage data. If you are trying to find novel splice events, then the case is more ambiguous.

    Leave a comment:


  • xguo
    replied
    [QUOTE=nilshomer;7023]Why did you restrict it in the first place? I am probably missing something obvious.

    I read somewhere that the covered region may be a repeat if the coverage is too high, so the SNPs called may not be accurate. I don't know if it's the reason to restrict the maximum coverage. However, RNASeq is different from DNA sequencing. There may be hundreds of RNA copies in the cell, resulting in extremely high coverage for certain transcripts. Removing duplicate reads merely reduces PCR artifacts. Highly expressed transcripts still have very high coverage. Is it right? On a side note, is it necessary to remove duplicates and use only reads with unique starting sites for the gene expression and alternative splicing analysis? For my data, half of the mapped reads will be removed if I run "samtools rmdupse".

    Leave a comment:


  • nilshomer
    replied
    Originally posted by xguo View Post
    nilshomer, thanks for your prompt reply.

    I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth?
    Why did you restrict it in the first place? I am probably missing something obvious.

    The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq?
    If you use MAQ then the maximum mapping quality is the same. If you go to this page you will see a description of the consensus calling, which mentions the maximum mapping quality among other things. I don't see RMS though.

    Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them?
    Assuming they represent true probabilities (a liberal assumption) then you should select it to match your maximum false-positive rate to obtain the largest number of SNPS.

    In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?
    The above is a very crude approximation, due to the extreme variability of coverage as well as the large dependency on the local repeat structure (context) when it comes to mapping. In the pileup format, there is a column for the # of reads covering the given consensus call.

    Sorry for so many questions. Thanks a lot for any insight.
    I don't use MAQ very much, since I am an author of my own aligner BFAST. On the other hand, samtools was developed by the same author as MAQ (Heng Li) so you should be safe making some of the above assumptions about MAQ and samtools.

    Leave a comment:


  • xguo
    replied
    nilshomer, thanks for your prompt reply.

    I got a list of candidate SNPs using BWA and samtools for RNASeq data, and am trying to weigh various filtering options. "samtools.pl varFilter" gives a list of filtering criterion with default setting. The maximum read depth is set at 100. Given that duplicate reads have been removed by "samtools rmdup", do I still need to limit the maximum read depth? The mapping quality is called RMS. Is it different from the maximum mapping quality used by maq? Consensus and SNP quality are not included in samtools.pl varFilter. What do you think are reasonable thresholds for them? In maq, the average number of hits of reads covering the candidate SNP site is said to give the copy number of the flanking region in the reference genome. Is it still possible to get this value since BWA uses a different alignment algorithm than maq?

    Sorry for so many questions. Thanks a lot for any insight.

    Leave a comment:


  • nilshomer
    replied
    Originally posted by xguo View Post
    Can anyone help me with two options in samtools view?

    -f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
    -F INT Skip alignments with bits present in INT [0]

    I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?
    See the SAM format for a description of the "FLAG" field. With the "FLAG" field, you can filter out unmapped reads, unpaired reads (paired end reads where one end maps), among other things. To remove low quality mappings (enforce a mapping quality of 50 or greater), I would just use the following command:

    Code:
    samtools view <in.bam> | awk '{ if($5 > 50) { print $0 }}'
    You can save the above to a file, or import it back into a "bam" file.

    In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?

    thanks
    Xiang
    In my opinion it is better to use as much information during mapping as the alignment program should be able to squeeze out extra information from the last five bases. You can then "filter" the alignment file (SAM) and remove the last five bases (you will have to do this yourself).

    Leave a comment:


  • xguo
    replied
    Can anyone help me with two options in samtools view?

    -f INT Only output alignments with all bits in INT present in the FLAG field. INT can be in hex in the format of /^0x[0-9A-F]+/ [0]
    -F INT Skip alignments with bits present in INT [0]

    I'm not sure about how these options can be used. Can they be used to filter out reads with low base quality?

    In addition, if I want to reduce the effect of an unreliable position (e.g. last 5 nt) in the read to the accuracy of SNP calling, is it better to mask it before alignment or discard the individual alignment after the mapping using full-length reads?

    thanks
    Xiang

    Leave a comment:

Latest Articles

Collapse

  • SEQadmin2
    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by SEQadmin2


    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

    Here are nine questions we think about, in roughly the order they matter, before...
    06-18-2026, 07:11 AM
  • SEQadmin2
    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
    by SEQadmin2


    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
    ...
    06-02-2026, 10:05 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, 06-26-2026, 11:10 AM
0 responses
12 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-17-2026, 06:09 AM
0 responses
46 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-09-2026, 11:58 AM
0 responses
106 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-05-2026, 10:09 AM
0 responses
125 views
0 reactions
Last Post SEQadmin2  
Working...