Compare mapped reads in SAM/BAM files

sschavan replied

06-15-2015, 11:09 AM
Thank you Brian, this information certainly helps. I will try it out.

SSC
Leave a comment:
Brian Bushnell replied

06-10-2015, 11:49 AM
BBSplit supports all the flags supported by BBMap, including these:

Code:

idfilter=0 Independant of minid; sets exact minimum identity allowed for alignments to be printed. Range 0 to 1. subfilter=-1 Ban alignments with more than this many substitutions. insfilter=-1 Ban alignments with more than this many insertions. delfilter=-1 Ban alignments with more than this many deletions. indelfilter=-1 Ban alignments with more than this many indels. editfilter=-1 Ban alignments with more than this many edits. inslenfilter=-1 Ban alignments with an insertion longer than this. dellenfilter=-1 Ban alignments with a deletion longer than this.

Those are postfiltering operations - they take place after alignment, after ambiguity was decided, etc. and just suppress alignments failing the filters.

You can alternately map with the "semiperfectmode" flag, which will only allow alignments in which there are no mismatches or indels (this is different from "perfectmode" in that semiperfectmode allows a read to go off the end of a contig; perfectmode requires all bases to match the reference). Or, you could use the flag "minid=0.99" which will require 99% identity. These latter options are faster than mapping at default sensitivity, then imposing a specific filter.
Last edited by Brian Bushnell; 06-15-2015, 12:07 PM.
Leave a comment:
sschavan replied

06-10-2015, 10:27 AM
Hello Brian,

Thank you for the detailed explanation, I appreciate your response.

I was wondering if I can set BBSplit to allow (a) maximum one mismatch and (b) 0 mismatch. If I could set it in this way probably I could use it to separate Mouse and Human reads, at a high stringency level, to start with.

About the error in filterbyname.sh, the result output (OnlyHumReads_BBv4.sam) was an incomplete file with only two header lines, presumably its truncated output. I am re-running it with -da flag, I will update if that worked for me.

Thanks,
SSC
Leave a comment:
Brian Bushnell replied

06-08-2015, 08:59 AM
By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive.

BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).

As for your specific error - that seems to be possibly related to a read with an invalid cigar string with more clipped bases than actual bases, or a read of zero length, or something like that. The program may have problems with secondary alignments in which the bases are suppressed (with a * symbol instead) - I have not extensively tested it on sam files, just fasta/fastq. I'll see if I can replicate and fix it (if it's a bug), and at least make it print out the actual line that causes the problem, since the current error message is not very useful. In the meantime, the error message appears to be harmless and you can skip it by disabling assertions with the -da flag:

bbmap/filterbyname.sh in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f -da
Leave a comment:
sschavan replied

06-08-2015, 08:27 AM
Hello GenoMax,

Thank you for the suggestion, yes I am trying to separate Human and Mouse reads, I will look into BBSplit as well, however, I was wondering about the alignment stringency that is considered by BBSplit to call a read as mapped to Human or Mouse reference, and how much that would differ as compared to BWA/Stampy.

SSC
Leave a comment:
GenoMax replied

06-08-2015, 01:17 AM
@sschavan: It appears that you are trying to separate human/mouse data. While Brian responds to your specific question you can do the separation using BBSplit with original reads: http://seqanswers.com/forums/showthread.php?t=41288
Leave a comment:
sschavan replied

06-07-2015, 09:06 PM
Hello Brian,

Thanks for making this tool available. I am trying to use it to compare two sam files and find the reads that are unique to one of the bam files. However, I encountered the below error. Could you possibly point out the problem here?

Your help is highly appreciated.

Thank you,
SSC

Command:
bbmap/filterbyname.sh in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f

Console:
java -ea -Xmx55181m -cp /Align2MouseGenome/bbmap/current/ driver.FilterReadsByName in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f
Executing driver.FilterReadsByName [in=HumReads_BBv4.sam, names=MouReads_BBv4.sam, out=OnlyHumReads_BBv4.sam, include=f]

java.lang.AssertionError: 24056740, 24056739
at stream.SamLine.toRead(SamLine.java:1490)
at stream.SamLine.toRead(SamLine.java:1451)
at stream.SamReadInputStream.toReadList(SamReadInputStream.java:125)
at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:98)
at stream.SamReadInputStream.nextList(SamReadInputStream.java:81)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:667)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:646)
Time: 1900.461 seconds.
Reads Processed: 17281200 9.09k reads/sec
Bases Processed: 1732191375 0.91m bases/sec
Reads Out: 0
Bases Out: 0
Exception in thread "main" java.lang.RuntimeException: driver.FilterReadsByName terminated in an error state; the output may be corrupt.
at driver.FilterReadsByName.process(FilterReadsByName.java:363)
at driver.FilterReadsByName.main(FilterReadsByName.java:42)
Leave a comment:
Marisa_Miller replied

02-12-2015, 02:39 PM
Originally posted by Brian Bushnell View Post

OK, I've fixed that too, in v34.48 I had never actually run it before on sam files, just fasta and fastq, so thanks for finding those bugs!

FYI, the reason it does not complain if the filename is wrong for "names" is because if it is a filename, it will use the names in the file; if it is not a valid filename, it will use that as a literal list of names to filter against.

Thank you! Just tested the new version and it's working now. And that makes sense about the "names" input. I will keep that in mind for the future.
Leave a comment:
Brian Bushnell replied

02-12-2015, 01:46 PM
OK, I've fixed that too, in v34.48 I had never actually run it before on sam files, just fasta and fastq, so thanks for finding those bugs!

FYI, the reason it does not complain if the filename is wrong for "names" is because if it is a filename, it will use the names in the file; if it is not a valid filename, it will use that as a literal list of names to filter against.
Leave a comment:

Marisa_Miller replied

02-12-2015, 12:31 PM

Originally posted by Brian Bushnell View Post

Wow, I'm just full of fail! Yes, that would work. I will fix that bug in the next release.

Sorry to bug you with another bug! I thought filterbyname.sh was working fine but I realized made an error in my command. Due to a typing error I was using a file that doesn't exist as my "names" file for filterbyname.sh. I realized I can input any made up file name for the "names" file, but if I put in a made up or wrong file name for the "in" file I will get an error message saying no such file exists.

When I run the command using both the "in" and "names" sam files with no headers, I am now getting the following error:

Code:

 java -ea -Xmx5534m -cp /home/kianians/millerm2/scripts/bbmap/current/ driver.FilterReadsByName in=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/mitochondria/chinese_spring/2_cs_mt_mapped_noheader.sam names=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/chloroplast/chinese_spring/2_cs_chlr_mapped_noheader.sam out=2_cs_mt_chlr_shared.sam include=t
Executing driver.FilterReadsByName [in=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/mitochondria/chinese_spring/2_cs_mt_mapped_noheader.sam, names=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/chloroplast/chinese_spring/2_cs_chlr_mapped_noheader.sam, out=2_cs_mt_chlr_shared.sam, include=t]

Input is being processed as unpaired
Exception in thread "Thread-3" java.lang.AssertionError: Sam format cannot be used as input to this program when no genome build is loaded.
Please index the reference first and rerun with e.g. 'build=1', or use a different input format.
	at stream.SamLine.<init>(SamLine.java:93)
	at stream.ReadStreamByteWriter.writeSam(ReadStreamByteWriter.java:442)
	at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:86)
	at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
	at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
Exception in thread "main" java.lang.RuntimeException: Writing to a terminated thread.
	at stream.ConcurrentGenericReadOutputStream.write(ConcurrentGenericReadOutputStream.java:198)
	at stream.ConcurrentGenericReadOutputStream.addDisordered(ConcurrentGenericReadOutputStream.java:193)
	at stream.ConcurrentGenericReadOutputStream.add(ConcurrentGenericReadOutputStream.java:97)
	at driver.FilterReadsByName.process(FilterReadsByName.java:354)
	at driver.FilterReadsByName.main(FilterReadsByName.java:42)

Any suggestions? And thanks again for all the support you provide for your package!

Leave a comment:

Brian Bushnell replied

02-12-2015, 10:52 AM
Wow, I'm just full of fail! Yes, that would work. I will fix that bug in the next release.
Leave a comment:
Marisa_Miller replied

02-12-2015, 06:41 AM
Originally posted by Brian Bushnell View Post

Hi Marisa,

I think you're using an old version; Reformat did not use to be able to input and output sam files, but it can now. Just kill that and rerun it with the latest version (34.46).

-Brian

Hi Brian,

I downloaded the new version and I was able to successfully filter out the mapped reads only.

When I try to get the intersection between the two files I am getting this error:

Code:

Exception in thread "main" java.lang.AssertionError: Tried to make a SamLine from a header: @HD at stream.SamLine.<init>(SamLine.java:420) at stream.SamLine.<init>(SamLine.java:49) at driver.FilterReadsByName.<init>(FilterReadsByName.java:144) at driver.FilterReadsByName.main(FilterReadsByName.java:41)

Can I fix this by removing the headers from the sam files containing only mapped reads?

Thank you,
Marisa
Leave a comment:
Brian Bushnell replied

02-11-2015, 10:12 AM
Hi Marisa,

I think you're using an old version; Reformat did not use to be able to input and output sam files, but it can now. Just kill that and rerun it with the latest version (34.46).

-Brian

Last edited by Brian Bushnell; 02-11-2015, 10:23 AM.
Leave a comment:
Marisa_Miller replied

02-11-2015, 09:35 AM
Originally posted by Brian Bushnell View Post

It's possible to do so with BBTools like this:

reformat.sh in=file1.sam out=mapped1.sam mappedonly
reformat.sh in=file2.sam out=mapped2.sam mappedonly

That gets you the mapped reads only. Then:

filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

...which gets you the set intersection;

filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f

...which get you the set subtractions.

Hi Brian,

Thanks again for the help. I am currently running the program now and it gave me an error message:

Code:

java -ea -Xmx200m -cp /home/kianians/millerm2/scripts/bbmap/current/ jgi.ReformatReads in=1_trimq10_100bpminlen_PE_cs_mt.sam out=1_trimq10_100bpminlen_PE_cs_mt_mapped.sam mappedonly Executing jgi.ReformatReads [in=1_trimq10_100bpminlen_PE_cs_mt.sam, out=1_trimq10_100bpminlen_PE_cs_mt_mapped.sam, mappedonly] Input is being processed as unpaired Exception in thread "Thread-3" java.lang.AssertionError: Sam format cannot be used as input to this program when no genome build is loaded. Please index the reference first and rerun with e.g. 'build=1', or use a different input format. at stream.SamLine.<init>(SamLine.java:138) at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:117) at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:19)

The program is still running so I'm not sure if it will complete normally or not. An output file has been created but it is 0 bytes currently.
Leave a comment:
dpryan replied

02-11-2015, 12:07 AM
This is a simple enough sort of thing to write in python with pysam or C with htslib that I'd personally just do that. For example, here is a small C program that will accept two BAM files and tell you which alignments have different mapping positions or CIGAR strings. It'd be simple to modify that to compare only mapped vs unmapped and keep a count.

Note: That's not production code, just an ad hoc solution with no error checking.

Last edited by dpryan; 02-11-2015, 12:18 AM.
Leave a comment:

Previous 1 2 template Next

Exploring the Dynamics of the Tumor Microenvironment

by seqadmin

The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
- Channel: Articles
07-08-2024, 03:19 PM

Topics	Statistics	Last Post
Gene Misexpression in the Healthy Human Population by seqadmin Started by seqadmin, Yesterday, 06:46 AM	0 responses 9 views 0 likes	Last Post by seqadmin Yesterday, 06:46 AM
New Method for Rapid Genetic Diagnosis of Mendelian Disorders by seqadmin Started by seqadmin, 07-24-2024, 11:09 AM	0 responses 24 views 0 likes	Last Post by seqadmin 07-24-2024, 11:09 AM
Advancing Nanopore Technology for Portable Sensing Devices by seqadmin Started by seqadmin, 07-19-2024, 07:20 AM	0 responses 159 views 0 likes	Last Post by seqadmin 07-19-2024, 07:20 AM
New RNA-Based Gene Writing Technology Achieves Precise Gene Integration by seqadmin Started by seqadmin, 07-16-2024, 05:49 AM	0 responses 127 views 0 likes	Last Post by seqadmin 07-16-2024, 05:49 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News