Seqanswers Leaderboard Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • JeanLove
    Junior Member
    • Jan 2016
    • 5

    Originally posted by Brian Bushnell View Post
    If you know the sequence numbers, you can use getreads.sh:
    I have tried it, but i get the java error:

    Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
    at jgi.GetReads.<init>(GetReads.java:321)
    at jgi.GetReads.main(GetReads.java:37)

    what could be the cause? the input file is normal .fasta file with 22000 reads from Illumina MiSeq.

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      Can you please post the exact command you used, and the entire screen output?

      Comment

      • JeanLove
        Junior Member
        • Jan 2016
        • 5

        Originally posted by Brian Bushnell View Post
        Can you please post the exact command you used, and the entire screen output?
        getreads.sh in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594

        java -ea -Xmx200m -cp /common/WORK/bbmap/current/ jgi.GetReads in=af_svi_0.05Perc_22633.fasta out=bla.fasta id=6272,19594
        Executing jgi.GetReads [in=af_svi_0.05Perc_22633.fasta, out=bla.fasta, id=6272,19594]

        Input is unpaired
        java.lang.AssertionError: Attempting to read from a closed file. Current header: MISEQ:70:000000000-A4FTJ:1:2111:9262:20597 1:N:0:
        at stream.FastaReadInputStream.nextBases(FastaReadInputStream.java:357)
        at stream.FastaReadInputStream.generateRead(FastaReadInputStream.java:237)
        at stream.FastaReadInputStream.fillList(FastaReadInputStream.java:210)
        at stream.FastaReadInputStream.nextList(FastaReadInputStream.java:121)
        at stream.ConcurrentGenericReadInputStream$ReadThread.readLists(ConcurrentGenericReadInputStream.java:656)
        at stream.ConcurrentGenericReadInputStream$ReadThread.run(ConcurrentGenericReadInputStream.java:635)
        Time: 0.167 seconds.
        Reads Processed: 19595 117.09k reads/sec
        Bases Processed: 4621k 27.62m bases/sec
        Exception in thread "main" java.lang.RuntimeException: GetReads terminated in an error state; the output may be corrupt.
        at jgi.GetReads.<init>(GetReads.java:321)
        at jgi.GetReads.main(GetReads.java:37)

        I get the output file and it seems normal, but I don't know why do I get this error.
        Last edited by JeanLove; 05-29-2016, 05:08 AM. Reason: forgot to write something

        Comment

        • Canadian_philosophy
          Junior Member
          • Apr 2010
          • 9

          BBMap

          1) I am interested in understanding how BBmap is simultaneously aligning reads to multiple genomes? Is this different from using bwa to align to multiple genomes sequencially?

          2) I used Bbsplit to split my fastq to hg19 reads and mouse reads. The output was only in samformat ( although I tried to name the o/p as _1.fastq.gz and 2.fastq.gz). Is it possible to directly get fastq outputs after aligning to multiple genomes?
          with something like this :
          out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq

          3) The sam output header for both the .mm10.sam and .hg19.sam had both mouse and human chromosomes. I had to manually edit this for samtools to do post processing. Is there a way out of this?

          Comment

          • Brian Bushnell
            Super Moderator
            • Jan 2014
            • 2709

            1) The difference is only in highly conserved regions. If you first map to mouse, then map the unmapped reads to human, human reads from regions with high identity to mouse will already have been removed, because they mapped to the mouse. When mapping to both genomes at once, the reads will go to the organism they actually came from because they (should) match that one better.

            2) Can you give your entire command line? I just tried it and verified that I got fastq output when I used this command:

            bbsplit.sh in=reads.fq basename=o_%.fq

            3) Sorry, that's kind of inevitable. Actually I do not recommend using sam output from BBSplit, only fastq output. Sam output will have the ambiguously-mapped reads listed as mapped to only one organism if you are in ambig=all mode. For example, if a read mappes to human and mouse equally well, it will be output in both sam files, but in each case it will be listed as mapped to the same location (which is either human or mouse).

            Comment

            • Canadian_philosophy
              Junior Member
              • Apr 2010
              • 9

              Hi Brian,

              Thanks for responding.

              1) That is what I assumed, wanted to hear it from someone else here

              2) These are things I am trying with bbsplit:

              Trial 1 :
              Code:
               
              java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6
              in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19=SL118119_t4_hg19_gatkorder_R1.fastq, out_hg19=SL118119_t4_hg19_gatkorder_R2.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R1.fastq, out_mm10=SL118119_t4_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss
              Trial 2:
              Code:
              java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz, in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz, out_hg19_1=SL118119_t3_hg19_gatkorder_R1.fastq, out_hg19_2=SL118119_t3_hg19_gatkorder_R2.fastq, out_mm10_1=SL118119_t3_mm10_gatkorder_R1.fastq, out_mm10_2=SL118119_t3_mm10_gatkorder_R2.fastq, ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa, ambiguous2=toss
              3) Thanks for clarifying the bam issue. I saw the fastq as recommended output

              Thanks,
              Deepthi

              Comment

              • Brian Bushnell
                Super Moderator
                • Jan 2014
                • 2709

                Oh, looks like the problem is the commas. There should not be any commas in the command line. BBTools is interpreting "SL118119_S5_L001_R2_001.fastq.gz," as a filename, and it does not know what format "fastq.gz," is so it defaults to sam. I'll change it to default to fastq for that program.

                Comment

                • Canadian_philosophy
                  Junior Member
                  • Apr 2010
                  • 9

                  Hmmm.. so I pasted that from the stderr/out file of the program, maybe that is how you formatted it there? This is how my command looks
                  Code:
                  java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss
                  The only comma is in between ref files. Does this look okay to you?

                  Also, thanks for fixing the default input.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    Hmmm, that command looks fine. Just to clarify, is that the specific command that produced sam output?

                    Comment

                    • Canadian_philosophy
                      Junior Member
                      • Apr 2010
                      • 9

                      Hi Brian,
                      This command below worked for me. It looks like if I name the output files, that doesn't work (e.g out1_hg19=,out2_hg19=,out1_mm10= etc)
                      .
                      If I use the basename command, the fastqs look fine!

                      java -Djava.library.path=/data/davelab/dr153/softwares/bbmap/jni/ -ea -Xmx56G -cp /data/davelab/dr153/softwares/bbmap/current/ align2.BBSplitter ow=t fastareadlen=500 minhits=1 minratio=0.56 maxindel=20 qtrim=rl untrim=t trimq=6 in1=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R1_001.fastq.gz in2=/data/davelab/projects/Xenomousie/fastq_link/SL118119_S5_L001_R2_001.fastq.gz interleaved=True basename=o5%_#.fq ref=/data/davelab/bix/resources/genomes/hg19/ucsc.hg19.fasta,/data/davelab/bix/resources/genomes/mm10/mm10_gatkorder.fa refstats=refstats.txt ambiguous2=toss

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        Originally posted by moistplus
                        Is it possible to compute the insert size (and make a histogram distribution) from a .bam file ?
                        I think so (use unsorted bam). If reads are spliced you will get some odd results. @Brian may have a suggestion later on how to handle that.

                        Make sure samtools is available in your $PATH.

                        Code:
                        reformat.sh in=your_file.bam ihist=hist.txt

                        Comment

                        • Brian Bushnell
                          Super Moderator
                          • Jan 2014
                          • 2709

                          As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.

                          Comment

                          • JVGen
                            Member
                            • Jul 2016
                            • 41

                            Originally posted by Brian Bushnell View Post
                            As long as the file is unsorted or name-sorted, that command will work. Spliced reads shouldn't cause a problem, unless the splice junction is in the unsequenced area between the reads, of course.
                            Hi Brian and Geno, I didn't want to start a new thread, so I figured I'd quote and post to get your attention.

                            I started using BBDeDupe and Normalization in Geneious on Merged Paired Reads. These reads are not yet assembled, but will be with De Novo after DeDupe/Norm. I'm doing it because I have about 50,000 reads that cover a 10kb PCR amplicon, and I think the depth is causing assembly issues.

                            I wasn't able to find much information on DeDupe. Does it remove all 100% matching reads that are encompassed in another read, thereby reducing my coverage to 1? If so, is there a way to increase my coverage, to say 50-100 instead?

                            I'm getting a few areas with low coverage, less than 10, and I generally require that I have a minimum coverage of 15 to call a base in the consensus sequence, otherwise I call a gap. I generally have >500-1000 coverage in all places, but occasionally I get some places that have <10 coverage. I don't think this is amplicon bias, rather I think it is contaminating sequence. I use a core service for the illumina sequencing, but nearly everyone that submits samples is working on HIV, hence my need for a coverage threshold.

                            Thoughts on this?

                            Thanks,
                            Jake

                            Comment

                            • Brian Bushnell
                              Super Moderator
                              • Jan 2014
                              • 2709

                              Hi Jake,

                              By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

                              If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

                              Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.

                              Comment

                              • JVGen
                                Member
                                • Jul 2016
                                • 41

                                Originally posted by Brian Bushnell View Post
                                Hi Jake,

                                By default, Dedupe will remove all but one copy of reads that are fully contained within other reads. This may not be a good idea prior to assembly of merged reads (since many will be contained by others due to the variable insert size); just using normalization alone with a target of 100x may work better. You can set Dedupe to only remove reads that are identical to others, though, by adding the flag "ac=f".

                                If you suspect your data is contaminated, I suggest you try to decontaminate it prior to normalization/assembly. If you are working on non-HIV and it is contaminated with HIV, that's fairly straightforward; you can use BBSplit or Seal (if you have both references) or just BBMap (if you only have the HIV reference but nothing for your amplicon) to separate the reads by mapping. Alternately, if you are confident that all of the low-coverage stuff is contamination, you can assemble the low-coverage portion of your data using Tadpole, which allows coverage thresholds (e.g. "mincr=1 maxcr=15" will make it assemble only those contigs with coverage between 1x and 15x). Then use the resulting assembly to filter all of the reads via mapping, keeping only the unmapped reads.

                                Incidentally, Tadpole tends to do a good job assembling very-high-coverage short sequences.
                                Thanks Brian,

                                I took the de novo approach to try to eliminate contaminating sequences. We're all working on HIV, and it's highly mutated: it's hard to remove contaminants by DNA sequence alone. However, since I know the approximate size of my PCR amplicons (it's different in each case because of DNA deletions), I know what size my contig should be after de novo assembly. By using stringent assembly parameters, that require ~30 bp overlap with maybe 1 mismatch, I can force contaminants to be assembled into a separate contig. I can then extract consensus sequences from each of these contigs, and filter them so that only contigs greater than ~150 bp are used in a subsequent map to reference, which should remove any minor contaminants that were present. I can then extract the consensus from this alignment, which should represent the original PCR amplicon - any contaminants that might have made it into my contig should be lost as they are outnumber by the 100s.

                                Does this workflow make sense for my application? I'm working on a desktop Mac, so I have limited options. I've been told that Spades might be a better assembler for me, but I think I'd need to purchase a server. With the little coding experience I have, I'm a bit nervous to invest the money, lest we never get the thing to work.

                                Are you available for hire as a consultant?

                                Thanks,
                                Jake
                                Last edited by JVGen; 09-07-2016, 11:34 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  New Genomics Tools and Methods Shared at AGBT 2025
                                  by seqadmin


                                  This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                                  The Headliner
                                  The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                                  03-03-2025, 01:39 PM
                                • seqadmin
                                  Investigating the Gut Microbiome Through Diet and Spatial Biology
                                  by seqadmin




                                  The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                                  02-24-2025, 06:31 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 03-20-2025, 05:03 AM
                                0 responses
                                22 views
                                0 reactions
                                Last Post seqadmin  
                                Started by seqadmin, 03-19-2025, 07:27 AM
                                0 responses
                                27 views
                                0 reactions
                                Last Post seqadmin  
                                Started by seqadmin, 03-18-2025, 12:50 PM
                                0 responses
                                21 views
                                0 reactions
                                Last Post seqadmin  
                                Started by seqadmin, 03-03-2025, 01:15 PM
                                0 responses
                                189 views
                                0 reactions
                                Last Post seqadmin  
                                Working...