Seqanswers Leaderboard Ad

**JeanLove** · 05-29-2016, 12:59 AM

Originally posted by Brian Bushnell View Post

If you know the sequence numbers, you can use getreads.sh:

I have tried it, but i get the java error:

Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
at jgi.GetReads.<init>(GetReads.java:321)
at jgi.GetReads.main(GetReads.java:37)

what could be the cause? the input file is normal .fasta file with 22000 reads from Illumina MiSeq.

**Brian Bushnell** · 05-29-2016, 04:01 AM

Can you please post the exact command you used, and the entire screen output?

**JeanLove** · 05-29-2016, 05:06 AM

Originally posted by Brian Bushnell View Post

Can you please post the exact command you used, and the entire screen output?

getreads.sh in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594

java -ea -Xmx200m -cp /common/WORK/bbmap/current/ jgi.GetReads in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594
Executing jgi.GetReads [in=af_svi_0.05Perc_22633.fasta, out=bla.fasta, id=6272,19594]

Input is unpaired
java.lang.AssertionError: Attempting to read from a closed file. Current header: MISEQ:70:000000000-A4FTJ:1:2111:9262:20597 1:N:0:
at stream.FastaReadInputStream.nextBases(FastaReadInputStream.java:357)
at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:237)
at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:210)
at stream.FastaReadInputStream.nextList(FastaReadInputStream.java:121)
at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
Time: 0.167 seconds.
Reads Processed: 19595 117.09k reads/sec
Bases Processed: 4621k 27.62m bases/sec
Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
at jgi.GetReads.<init>(GetReads.java:321)
at jgi.GetReads.main(GetReads.java:37)

I get the output file and it seems normal, but I don't know why do I get this error.

**Canadian_philosophy** · 06-03-2016, 12:26 PM

BBMap

1) I am interested in understanding how BBmap is simultaneously aligning reads to multiple genomes? Is this different from using bwa to align to multiple genomes sequencially?

2) I used Bbsplit to split my fastq to hg19 reads and mouse reads. The output was only in samformat ( although I tried to name the o/p as _1.fastq.gz and 2.fastq.gz). Is it possible to directly get fastq outputs after aligning to multiple genomes?
with something like this :
out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq

3) The sam output header for both the .mm10.sam and .hg19.sam had both mouse and human chromosomes. I had to manually edit this for samtools to do post processing. Is there a way out of this?

**Brian Bushnell** · 06-03-2016, 01:05 PM

1) The difference is only in highly conserved regions. If you first map to mouse, then map the unmapped reads to human, human reads from regions with high identity to mouse will already have been removed, because they mapped to the mouse. When mapping to both genomes at once, the reads will go to the organism they actually came from because they (should) match that one better.

2) Can you give your entire command line? I just tried it and verified that I got fastq output when I used this command:

bbsplit.sh in=reads.fq basename=o_%.fq

3) Sorry, that's kind of inevitable. Actually I do not recommend using sam output from BBSplit, only fastq output. Sam output will have the ambiguously-mapped reads listed as mapped to only one organism if you are in ambig=all mode. For example, if a read mappes to human and mouse equally well, it will be output in both sam files, but in each case it will be listed as mapped to the same location (which is either human or mouse).

**Canadian_philosophy** · 06-03-2016, 01:20 PM

Hi Brian,

Thanks for responding.

1) That is what I assumed, wanted to hear it from someone else here

2) These are things I am trying with bbsplit:

Trial 1 :

Code:

 
java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6
in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19=SL118119_t4_hg19_gatkorder_R1.fastq, out_hg19=SL118119_t4_hg19_gatkorder_R2.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R1.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss

Trial 2:

Code:

java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss

3) Thanks for clarifying the bam issue. I saw the fastq as recommended output

Thanks,
Deepthi

**Brian Bushnell** · 06-03-2016, 01:24 PM

Oh, looks like the problem is the commas. There should not be any commas in the command line. BBTools is interpreting "SL118119_S5_L001_R2_001.fastq.gz," as a filename, and it does not know what format "fastq.gz," is so it defaults to sam. I'll change it to default to fastq for that program.

**Canadian_philosophy** · 06-03-2016, 01:33 PM

Hmmm.. so I pasted that from the stderr/out file of the program, maybe that is how you formatted it there? This is how my command looks

Code:

java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss

The only comma is in between ref files. Does this look okay to you?

Also, thanks for fixing the default input.

**Brian Bushnell** · 06-03-2016, 01:40 PM

Hmmm, that command looks fine. Just to clarify, is that the specific command that produced sam output?

**Canadian_philosophy** · 06-06-2016, 06:23 AM

Hi Brian,
This command below worked for me. It looks like if I name the output files, that doesn't work (e.g out1_hg19=,out2_hg19=,out1_mm10= etc)
.
If I use the basename command, the fastqs look fine!

java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss

**GenoMax** · 07-22-2016, 06:02 AM

Originally posted by moistplus

Is it possible to compute the insert size (and make a histogram distribution) from a .bam file ?

I think so (use unsorted bam). If reads are spliced you will get some odd results. @Brian may have a suggestion later on how to handle that.

Make sure samtools is available in your $PATH.

Code:

reformat.sh in=your_file.bam ihist=hist.txt

**Brian Bushnell** · 07-22-2016, 08:59 AM

As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.

**JVGen** · 09-07-2016, 10:15 AM

Originally posted by Brian Bushnell View Post

As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.

Hi Brian and Geno, I didn't want to start a new thread, so I figured I'd quote and post to get your attention.

I started using BBDeDupe and Normalization in Geneious on Merged Paired Reads. These reads are not yet assembled, but will be with De Novo after DeDupe/Norm. I'm doing it because I have about 50,000 reads that cover a 10kb PCR amplicon, and I think the depth is causing assembly issues.

I wasn't able to find much information on DeDupe. Does it remove all 100% matching reads that are encompassed in another read, thereby reducing my coverage to 1? If so, is there a way to increase my coverage, to say 50-100 instead?

I'm getting a few areas with low coverage, less than 10, and I generally require that I have a minimum coverage of 15 to call a base in the consensus sequence, otherwise I call a gap. I generally have >500-1000 coverage in all places, but occasionally I get some places that have <10 coverage. I don't think this is amplicon bias, rather I think it is contaminating sequence. I use a core service for the illumina sequencing, but nearly everyone that submits samples is working on HIV, hence my need for a coverage threshold.

Thoughts on this?

Thanks,
Jake

**Brian Bushnell** · 09-07-2016, 10:56 AM

Hi Jake,

By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.

**JVGen** · 09-07-2016, 11:31 AM

Originally posted by Brian Bushnell View Post

Hi Jake,

By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.

Thanks Brian,

I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

Are you available for hire as a consultant?

Thanks,
Jake

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News