Seqanswers Leaderboard Ad

**cmbetts** · 02-10-2015, 02:58 PM

If you can get the read names into two R character vectors, you could use the various set operations like setdiff.
There's probably a much more elegant (and faster) way to do this, but quick and dirty:

library("GenomicAlignments");
reads_alnmt1 <- names(readGAlignments(bam_file1, use.names=TRUE));
reads_alnmt2 <- names(readGAlignments(bam_file2, use.names=TRUE));

unique1 <- setdiff(reads_alnmt1, reads_alnmt2);
unique2 <- setdiff(reads_alnmt2, reads_alnmt1);

**Brian Bushnell** · 02-10-2015, 03:15 PM

It's possible to do so with BBTools like this:

reformat.sh in=file1.sam out=mapped1.sam mappedonly
reformat.sh in=file2.sam out=mapped2.sam mappedonly

That gets you the mapped reads only. Then:

filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

...which gets you the set intersection;

filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f

...which get you the set subtractions.

**dpryan** · 02-11-2015, 12:07 AM

This is a simple enough sort of thing to write in python with pysam or C with htslib that I'd personally just do that. For example, here is a small C program that will accept two BAM files and tell you which alignments have different mapping positions or CIGAR strings. It'd be simple to modify that to compare only mapped vs unmapped and keep a count.

Note: That's not production code, just an ad hoc solution with no error checking.

**Marisa_Miller** · 02-11-2015, 09:35 AM

Originally posted by Brian Bushnell View Post

It's possible to do so with BBTools like this:

reformat.sh in=file1.sam out=mapped1.sam mappedonly
reformat.sh in=file2.sam out=mapped2.sam mappedonly

That gets you the mapped reads only. Then:

filterbyname.sh in=mapped1.sam names=mapped2.sam out=shared.sam include=t

...which gets you the set intersection;

filterbyname.sh in=mapped1.sam names=mapped2.sam out=only1.sam include=f
filterbyname.sh in=mapped2.sam names=mapped1.sam out=only2.sam include=f

...which get you the set subtractions.

Hi Brian,

Thanks again for the help. I am currently running the program now and it gave me an error message:

Code:

java -ea -Xmx200m -cp /home/kianians/millerm2/scripts/bbmap/current/ jgi.ReformatReads in=1_trimq10_100bpminlen_PE_cs_mt.sam out=1_trimq10_100bpminlen_PE_cs_mt_mapped.sam mappedonly
Executing jgi.ReformatReads [in=1_trimq10_100bpminlen_PE_cs_mt.sam, out=1_trimq10_100bpminlen_PE_cs_mt_mapped.sam, mappedonly]

Input is being processed as unpaired
Exception in thread "Thread-3" java.lang.AssertionError: Sam format cannot be used as input to this program when no genome build is loaded.
Please index the reference first and rerun with e.g. 'build=1', or use a different input format.
	at stream.SamLine.<init>(SamLine.java:138)
	at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:117)
	at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:19)

The program is still running so I'm not sure if it will complete normally or not. An output file has been created but it is 0 bytes currently.

**Brian Bushnell** · 02-11-2015, 10:12 AM

Hi Marisa,

I think you're using an old version; Reformat did not use to be able to input and output sam files, but it can now. Just kill that and rerun it with the latest version (34.46).

-Brian

**Marisa_Miller** · 02-12-2015, 06:41 AM

Originally posted by Brian Bushnell View Post

Hi Marisa,

I think you're using an old version; Reformat did not use to be able to input and output sam files, but it can now. Just kill that and rerun it with the latest version (34.46).

-Brian

Hi Brian,

I downloaded the new version and I was able to successfully filter out the mapped reads only.

When I try to get the intersection between the two files I am getting this error:

Code:

Exception in thread "main" java.lang.AssertionError: Tried to make a SamLine from a header: @HD
	at stream.SamLine.<init>(SamLine.java:420)
	at stream.SamLine.<init>(SamLine.java:49)
	at driver.FilterReadsByName.<init>(FilterReadsByName.java:144)
	at driver.FilterReadsByName.main(FilterReadsByName.java:41)

Can I fix this by removing the headers from the sam files containing only mapped reads?

Thank you,
Marisa

**Brian Bushnell** · 02-12-2015, 10:52 AM

Wow, I'm just full of fail! Yes, that would work. I will fix that bug in the next release.

**Marisa_Miller** · 02-12-2015, 12:31 PM

Originally posted by Brian Bushnell View Post

Wow, I'm just full of fail! Yes, that would work. I will fix that bug in the next release.

Sorry to bug you with another bug! I thought filterbyname.sh was working fine but I realized made an error in my command. Due to a typing error I was using a file that doesn't exist as my "names" file for filterbyname.sh. I realized I can input any made up file name for the "names" file, but if I put in a made up or wrong file name for the "in" file I will get an error message saying no such file exists.

When I run the command using both the "in" and "names" sam files with no headers, I am now getting the following error:

Code:

 java -ea -Xmx5534m -cp /home/kianians/millerm2/scripts/bbmap/current/ driver.FilterReadsByName in=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/mitochondria/chinese_spring/2_cs_mt_mapped_noheader.sam names=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/chloroplast/chinese_spring/2_cs_chlr_mapped_noheader.sam out=2_cs_mt_chlr_shared.sam include=t
Executing driver.FilterReadsByName [in=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/mitochondria/chinese_spring/2_cs_mt_mapped_noheader.sam, names=/home/kianians/millerm2/mitochondria_sequencing/samples1_14/mapping/chloroplast/chinese_spring/2_cs_chlr_mapped_noheader.sam, out=2_cs_mt_chlr_shared.sam, include=t]

Input is being processed as unpaired
Exception in thread "Thread-3" java.lang.AssertionError: Sam format cannot be used as input to this program when no genome build is loaded.
Please index the reference first and rerun with e.g. 'build=1', or use a different input format.
	at stream.SamLine.<init>(SamLine.java:93)
	at stream.ReadStreamByteWriter.writeSam(ReadStreamByteWriter.java:442)
	at stream.ReadStreamByteWriter.processJobs(ReadStreamByteWriter.java:86)
	at stream.ReadStreamByteWriter.run2(ReadStreamByteWriter.java:41)
	at stream.ReadStreamByteWriter.run(ReadStreamByteWriter.java:27)
Exception in thread "main" java.lang.RuntimeException: Writing to a terminated thread.
	at stream.ConcurrentGenericReadOutputStream.write(ConcurrentGenericReadOutputStream.java:198)
	at stream.ConcurrentGenericReadOutputStream.addDisordered(ConcurrentGenericReadOutputStream.java:193)
	at stream.ConcurrentGenericReadOutputStream.add(ConcurrentGenericReadOutputStream.java:97)
	at driver.FilterReadsByName.process(FilterReadsByName.java:354)
	at driver.FilterReadsByName.main(FilterReadsByName.java:42)

Any suggestions? And thanks again for all the support you provide for your package!

**Brian Bushnell** · 02-12-2015, 01:46 PM

OK, I've fixed that too, in v34.48

I had never actually run it before on sam files, just fasta and fastq, so thanks for finding those bugs!

FYI, the reason it does not complain if the filename is wrong for "names" is because if it is a filename, it will use the names in the file; if it is not a valid filename, it will use that as a literal list of names to filter against.

**Marisa_Miller** · 02-12-2015, 02:39 PM

Originally posted by Brian Bushnell View Post

OK, I've fixed that too, in v34.48

I had never actually run it before on sam files, just fasta and fastq, so thanks for finding those bugs!

FYI, the reason it does not complain if the filename is wrong for "names" is because if it is a filename, it will use the names in the file; if it is not a valid filename, it will use that as a literal list of names to filter against.

Thank you! Just tested the new version and it's working now. And that makes sense about the "names" input. I will keep that in mind for the future.

**sschavan** · 06-07-2015, 09:06 PM

Hello Brian,

Thanks for making this tool available. I am trying to use it to compare two sam files and find the reads that are unique to one of the bam files. However, I encountered the below error. Could you possibly point out the problem here?

Your help is highly appreciated.

Thank you,
SSC

Command:
bbmap/filterbyname.sh in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f

Console:
java -ea -Xmx55181m -cp /Align2MouseGenome/bbmap/current/ driver.FilterReadsByName in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f
Executing driver.FilterReadsByName [in=HumReads_BBv4.sam, names=MouReads_BBv4.sam, out=OnlyHumReads_BBv4.sam, include=f]

java.lang.AssertionError: 24056740, 24056739
at stream.SamLine.toRead(SamLine.java:1490)
at stream.SamLine.toRead(SamLine.java:1451)
at stream.SamReadInputStream.toReadList(SamReadInputStream.java:125)
at stream.SamReadInputStream.fillBuffer(SamReadInputStream.java:98)
at stream.SamReadInputStream.nextList(SamReadInputStream.java:81)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:667)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:646)
Time: 1900.461 seconds.
Reads Processed: 17281200 9.09k reads/sec
Bases Processed: 1732191375 0.91m bases/sec
Reads Out: 0
Bases Out: 0
Exception in thread "main" java.lang.RuntimeException: driver.FilterReadsByName terminated in an error state; the output may be corrupt.
at driver.FilterReadsByName.process(FilterReadsByName.java:363)
at driver.FilterReadsByName.main(FilterReadsByName.java:42)

**GenoMax** · 06-08-2015, 01:17 AM

@sschavan: It appears that you are trying to separate human/mouse data. While Brian responds to your specific question you can do the separation using BBSplit with original reads: http://seqanswers.com/forums/showthread.php?t=41288

**sschavan** · 06-08-2015, 08:27 AM

Hello GenoMax,

Thank you for the suggestion, yes I am trying to separate Human and Mouse reads, I will look into BBSplit as well, however, I was wondering about the alignment stringency that is considered by BBSplit to call a read as mapped to Human or Mouse reference, and how much that would differ as compared to BWA/Stampy.

SSC

**Brian Bushnell** · 06-08-2015, 08:59 AM

By default BBSplit uses fairly strict mapping parameters; you can get the same sensitivity as BBMap by adding the flags "minid=0.76 maxindel=16k minhits=1". With those parameters it is extremely sensitive.

BBSplit has different ambiguity settings for dealing with reads that map to multiple genomes. In any case, if the alignment score is higher to one genome than another, it will be associated with that genome only (this considers the combined scores of read pairs - pairs are always kept together). But when a read or pair has two identically-scoring mapping locations, on different genomes, the behavior is controlled by the "ambig2" flag - "ambig2=toss" will discard the read, "all" will send it to all output files, and "split" will send it to a separate file for ambiguously-mapped reads (one per genome to which it maps).

As for your specific error - that seems to be possibly related to a read with an invalid cigar string with more clipped bases than actual bases, or a read of zero length, or something like that. The program may have problems with secondary alignments in which the bases are suppressed (with a * symbol instead) - I have not extensively tested it on sam files, just fasta/fastq. I'll see if I can replicate and fix it (if it's a bug), and at least make it print out the actual line that causes the problem, since the current error message is not very useful. In the meantime, the error message appears to be harmless and you can skip it by disabling assertions with the -da flag:

bbmap/filterbyname.sh in=HumReads_BBv4.sam names=MouReads_BBv4.sam out=OnlyHumReads_BBv4.sam include=f -da

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 19 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

Compare mapped reads in SAM/BAM files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News