Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dho
    Junior Member
    • Sep 2017
    • 5

    Hi Brian,

    I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

    The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

    Thanks for your help this week and I hope my solution helps others who encounter the same issue!

    dave

    Comment

    • Brian Bushnell
      Super Moderator
      • Jan 2014
      • 2709

      Originally posted by dho View Post
      Hi Brian,

      I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

      The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

      Thanks for your help this week and I hope my solution helps others who encounter the same issue!

      dave
      Hi Dave,

      Thanks for the followup. I apologize for not getting back to you in a timely fashion, I'm pretty swamped currently!

      Comment

      • dho
        Junior Member
        • Sep 2017
        • 5

        Hi Brian,

        No problem! Thanks for all that you do.

        For posterity, the solution I proposed above worked for the most part, but I encountered a few cases where it didn't. Since then, I've decided to be fully exhaustive and map against each reference sequence individually. This is computationally intensive but gives the most robust results. Even with these settings, though, reads mapping to very small reference sequences (<50bp) doesn't seem to work as consistently as I'd like, though I haven't figured out the source of this inconsistency.

        --dave

        Comment

        • gokhulkrishnakilaru
          Member
          • Jul 2011
          • 39

          Adding Unique Identifier To Paired End Reads. Editing FASTQ Read ID based on randomer

          Hi Brian-

          We have a paired end rnaseq data.

          For every sequence in read2, we want to extract the first 6nucleotides and append them to the read id in read2 and also to the respective read id in read1.

          Please see example below.

          Code:
          [I]cat read1[/I]
          @NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
          CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
          +
          AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
          @NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
          CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
          +
          AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
          @NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
          ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
          +
          AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
          @NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
          CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
          +
          AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E
          
          
          [I]cat read2[/I]
          @NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
          GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
          +
          AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
          @NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
          TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
          +
          AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
          @NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
          TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
          +
          /AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
          @NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
          TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
          +
          AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE




          Desired Output


          Code:
          [I]cat read1[/I]
          @[COLOR="Red"]GCTGGG[/COLOR]_NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
          CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
          +
          AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
          @TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
          CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
          +
          AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
          @TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
          ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
          +
          AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
          @TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
          CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
          +
          AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E
          
          
          [I]cat read2[/I]
          @[COLOR="Red"]GCTGGG[/COLOR]_NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
          GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
          +
          AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
          @TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
          TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
          +
          AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
          @TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
          TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
          +
          /AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
          @TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
          TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
          +
          AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE
          Do you know if the BB suite can achieve this? I tried fastx toolkit but it prints the full sequence instead of a partial sequence.

          Comment

          • santiagorevale
            Member
            • Dec 2016
            • 18

            Hi Brian,

            I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

            The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

            When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

            Here is the error:

            java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
            Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.util.Arrays.copyOfRange(Arrays.java:3664)
            at java.lang.String.<init>(String.java:207)
            at java.lang.String.toLowerCase(String.java:2647)
            at java.lang.String.toLowerCase(String.java:2670)
            at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
            at driver.FilterReadsByName.main(FilterReadsByName.java:40)

            Thank you very much in advance.

            Best regards,
            Santiago

            Comment

            • bio_d
              Junior Member
              • Oct 2017
              • 6

              problem using bbmap

              Hi Brian,

              I am trying a Denovo assembly of a reptilian genome (size comparable to human genome) and have the following Illumina sequence libraries (paired-end, mate-pair and long mate-pair with estimated insert sizes 200bp, 5.2kb and 10kb respectively.).

              After quality and adapter trimming (using FASTQc toolkit), I have used bbmap.sh to align the reads to a reference (assembly of contigs from CLC workbench). However, I am unable to generate bam file (using samtools) from the sam file generated using bbmap. I have used "rcomp=t" and "rcs=f" flag for both the mate pair and long mate pair libraries. I had earlier tried with "rcomp=t" flag only but that didn't help either.

              Please help.

              Best,
              D

              SNIPPET submit script:

              srun ./bbmap/bbmap.sh rcomp=t rcs=f in1=mate_pair_1.fq.gz in2=mate_pair_2.fq.gz outu=mate_pair_u.sam outm=mate_pair_m.sam
              srun ./samtools-1.6/samtools view -b -o mate_pair_m.bam mate_pair_m.sam
              srun ./samtools-1.6/samtools view -b -o mate_pair_u.bam mate_pair_u.sam

              snippet Error:

              [E::sam_parse1] SEQ and QUAL are of different length
              [W::sam_read1] Parse error at line 2051364
              [main_samview] truncated file.
              [E::sam_parse1] SEQ and QUAL are of different length
              [W::sam_read1] Parse error at line 2051364
              [main_samview] truncated file.
              srun: error: cluster01: task 0: Exited with exit code 1
              srun: error: cluster01: task 1: Exited with exit code 1

              Comment

              • GenoMax
                Senior Member
                • Feb 2008
                • 7142

                Can you check what the lines in question look like? Use "zcat mate_pair_1.fq.gz | sed -n '2051364,2051366p' filename" to extract those lines.

                Comment

                • bio_d
                  Junior Member
                  • Oct 2017
                  • 6

                  Hi,

                  I got this with the command you suggested.

                  BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
                  @HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
                  CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC

                  Best,
                  D

                  Comment

                  • GenoMax
                    Senior Member
                    • Feb 2008
                    • 7142

                    Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.

                    If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?

                    Comment

                    • bio_d
                      Junior Member
                      • Oct 2017
                      • 6

                      Hi,

                      zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'


                      TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
                      +
                      BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F

                      I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.

                      Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.

                      Best,
                      D

                      Comment

                      • GenoMax
                        Senior Member
                        • Feb 2008
                        • 7142

                        Can you check the R2 file by the same method?

                        Comment

                        • bio_d
                          Junior Member
                          • Oct 2017
                          • 6

                          Hi,

                          There seems to be nothing wrong with the second read too

                          zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

                          @HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
                          TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
                          +
                          @BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FG
                          Last edited by bio_d; 10-23-2017, 08:34 PM.

                          Comment

                          • santiagorevale
                            Member
                            • Dec 2016
                            • 18

                            Originally posted by santiagorevale View Post
                            Hi Brian,

                            I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

                            The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

                            When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

                            Here is the error:

                            java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
                            Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

                            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
                            at java.util.Arrays.copyOfRange(Arrays.java:3664)
                            at java.lang.String.<init>(String.java:207)
                            at java.lang.String.toLowerCase(String.java:2647)
                            at java.lang.String.toLowerCase(String.java:2670)
                            at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
                            at driver.FilterReadsByName.main(FilterReadsByName.java:40)

                            Thank you very much in advance.

                            Best regards,
                            Santiago
                            Hi there,

                            Any hint on what I've previously asked?

                            Thanks!

                            Comment

                            • TomHarrop
                              Member
                              • Jul 2014
                              • 20

                              Hi Brian & others,

                              I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

                              Code:
                              Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
                                  at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
                                  at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
                                  at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
                                  at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)
                              I'm wondering if I used a bad combination of parameters. Here's the call:

                              Code:
                              java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt
                              Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

                              I've posted the log on pastebin: https://pastebin.com/jPkKagFs

                              Thanks again for the bbmap suite!

                              Tom

                              Comment

                              • boulund
                                Member
                                • Jan 2017
                                • 18

                                Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.

                                Comment

                                Latest Articles

                                Collapse

                                • SEQadmin2
                                  From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                  by SEQadmin2


                                  Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                  The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                  ...
                                  06-02-2026, 10:05 AM
                                • SEQadmin2
                                  Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                  by SEQadmin2


                                  With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                  Introduction

                                  Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                  05-22-2026, 06:42 AM
                                • SEQadmin2
                                  Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                  by SEQadmin2

                                  Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                  Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                  05-06-2026, 09:04 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by SEQadmin2, Today, 08:59 AM
                                0 responses
                                10 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 12:03 PM
                                0 responses
                                21 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 06-02-2026, 11:40 AM
                                0 responses
                                17 views
                                0 reactions
                                Last Post SEQadmin2  
                                Started by SEQadmin2, 05-28-2026, 11:40 AM
                                0 responses
                                31 views
                                0 reactions
                                Last Post SEQadmin2  
                                Working...