Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • vzinche
    replied
    Sorry, I didn't describe the problem well enough in the previous message.

    The mapping isn't the main goal and the main problem.
    I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
    The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
    When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
    That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.

    But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
    Increasing the number of genomes, the percentage of mapped reads decreased.

    Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.

    Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.

    Leave a comment:


  • GenoMax
    replied
    Originally posted by vzinche View Post
    Hello!

    I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
    The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
    How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.

    Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?

    Leave a comment:


  • vzinche
    replied
    randomreads.sh for huge data

    Hello!

    I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
    The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

    Leave a comment:


  • santiagorevale
    replied
    Originally posted by GenoMax View Post
    While that is an odd restriction it is what it is when one is using shared compute resources.

    Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
    I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).

    Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.

    I ran the command on 232 sets of files with -Xmx18G, with the following results:
    - Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times

    Code:
    Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
            at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
            at java.util.HashMap.putVal(HashMap.java:641)
            at java.util.HashMap.put(HashMap.java:611)
            at java.util.HashSet.add(HashSet.java:219)
            at shared.Tools.addNames(Tools.java:456)
            at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
            at driver.FilterReadsByName.main(FilterReadsByName.java:40)
    - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 times

    Code:
    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.lang.StringCoding.decode(StringCoding.java:187)
            at java.lang.StringCoding.decode(StringCoding.java:254)
            at java.lang.String.<init>(String.java:546)
            at java.lang.String.<init>(String.java:566)
            at shared.Tools.addNames(Tools.java:456)
            at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
            at driver.FilterReadsByName.main(FilterReadsByName.java:40)
    - java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 times

    Code:
    java.lang.OutOfMemoryError: GC overhead limit exceeded
            at java.util.Arrays.copyOfRange(Arrays.java:3520)
            at stream.KillSwitch.copyOfRange(KillSwitch.java:300)
            at fileIO.ByteFile1.nextLine(ByteFile1.java:164)
            at shared.Tools.addNames(Tools.java:454)
            at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
            at driver.FilterReadsByName.main(FilterReadsByName.java:40)
    
    This program ran out of memory.
    Try increasing the -Xmx flag and using tool-specific memory-related parameters.
    I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.

    One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.

    I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.

    Thanks for your help!

    Leave a comment:


  • GenoMax
    replied
    Originally posted by santiagorevale View Post
    Hi GenoMax,

    Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

    However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

    Thanks!
    While that is an odd restriction it is what it is when one is using shared compute resources.

    Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

    Leave a comment:


  • santiagorevale
    replied
    Originally posted by GenoMax View Post
    Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
    Hi GenoMax,

    Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

    However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

    Thanks!

    Leave a comment:


  • GenoMax
    replied
    Originally posted by santiagorevale View Post
    Hi there,

    Any hint on what I've previously asked?

    Thanks!
    Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use

    Leave a comment:


  • boulund
    replied
    Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.

    Leave a comment:


  • TomHarrop
    replied
    Hi Brian & others,

    I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

    Code:
    Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
        at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
        at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
        at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
        at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)
    I'm wondering if I used a bad combination of parameters. Here's the call:

    Code:
    java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt
    Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

    I've posted the log on pastebin: https://pastebin.com/jPkKagFs

    Thanks again for the bbmap suite!

    Tom

    Leave a comment:


  • santiagorevale
    replied
    Originally posted by santiagorevale View Post
    Hi Brian,

    I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

    The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

    When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

    Here is the error:

    java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
    Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

    Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOfRange(Arrays.java:3664)
    at java.lang.String.<init>(String.java:207)
    at java.lang.String.toLowerCase(String.java:2647)
    at java.lang.String.toLowerCase(String.java:2670)
    at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
    at driver.FilterReadsByName.main(FilterReadsByName.java:40)

    Thank you very much in advance.

    Best regards,
    Santiago
    Hi there,

    Any hint on what I've previously asked?

    Thanks!

    Leave a comment:


  • bio_d
    replied
    Hi,

    There seems to be nothing wrong with the second read too

    zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

    @HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
    TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
    +
    @BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FG
    Last edited by bio_d; 10-23-2017, 08:34 PM.

    Leave a comment:


  • GenoMax
    replied
    Can you check the R2 file by the same method?

    Leave a comment:


  • bio_d
    replied
    Hi,

    zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'


    TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
    +
    BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F

    I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.

    Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.

    Best,
    D

    Leave a comment:


  • GenoMax
    replied
    Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.

    If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?

    Leave a comment:


  • bio_d
    replied
    Hi,

    I got this with the command you suggested.

    BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
    @HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
    CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC

    Best,
    D

    Leave a comment:

Latest Articles

Collapse

  • GATTACAT
    Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by GATTACAT
    Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
    Yesterday, 11:43 AM
  • SEQadmin2
    Nine Things a Sample Prep Scientist Thinks About Before Sequencing
    by SEQadmin2


    I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

    Here are nine questions we think about, in roughly the order they matter, before...
    06-18-2026, 07:11 AM
  • SEQadmin2
    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
    by SEQadmin2


    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
    ...
    06-02-2026, 10:05 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by SEQadmin2, 06-30-2026, 05:37 AM
0 responses
9 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-26-2026, 11:10 AM
0 responses
18 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-17-2026, 06:09 AM
0 responses
52 views
0 reactions
Last Post SEQadmin2  
Started by SEQadmin2, 06-09-2026, 11:58 AM
0 responses
110 views
0 reactions
Last Post SEQadmin2  
Working...