Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by santiagorevale View Post
    Hi there,

    Any hint on what I've previously asked?

    Thanks!
    Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use

    Comment


    • Originally posted by GenoMax View Post
      Perhaps. But if you have more memory why not allocate more and see if that helps. Unless you are being charged by every megabyte you use
      Hi GenoMax,

      Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

      However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

      Thanks!

      Comment


      • Originally posted by santiagorevale View Post
        Hi GenoMax,

        Because I'm running this in a cluster, to get more memory means to get more cores (slots), and processes requiring more cores take longer to be executed. Also, I was running this command along other commands requiring same amount of cores.

        However, isn't it weird for the script to require much more memory than the size of both uncompressed FastQ plus IDs files together?

        Thanks!
        While that is an odd restriction it is what it is when one is using shared compute resources.

        Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.

        Comment


        • Originally posted by GenoMax View Post
          While that is an odd restriction it is what it is when one is using shared compute resources.

          Just for kicks have you tried to run this on a local desktop that has a decent amount of RAM (16G)? Just keeping fastq headers in memory should not take a large amount of RAM as you speculate.
          I tried running it in a computer with 8Gb of RAM and in cluster nodes using a -Xmx limit of 18Gb and 24Gb (the max memory of the nodes is between 96 and 128 Gb).

          Before I wasn't saying that keeping headers in memory take lots of RAM. I just tried to say that I couldn't understand why it ran out of memory when using 24Gb, because if the program were to load both files (FastQ and IDs files) into memory (I currently don't know how the program works), that would add up to 17.1Gb. So even in this scenario it should have not ran out of memory.

          I ran the command on 232 sets of files with -Xmx18G, with the following results:
          - Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded, 39 times

          Code:
          Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
                  at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
                  at java.util.HashMap.putVal(HashMap.java:641)
                  at java.util.HashMap.put(HashMap.java:611)
                  at java.util.HashSet.add(HashSet.java:219)
                  at shared.Tools.addNames(Tools.java:456)
                  at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
                  at driver.FilterReadsByName.main(FilterReadsByName.java:40)
          - Exception in thread "main" java.lang.OutOfMemoryError: Java heap space, 8 times

          Code:
          Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
                  at java.lang.StringCoding.decode(StringCoding.java:187)
                  at java.lang.StringCoding.decode(StringCoding.java:254)
                  at java.lang.String.<init>(String.java:546)
                  at java.lang.String.<init>(String.java:566)
                  at shared.Tools.addNames(Tools.java:456)
                  at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
                  at driver.FilterReadsByName.main(FilterReadsByName.java:40)
          - java.lang.OutOfMemoryError: GC overhead limit exceeded, 5 times

          Code:
          java.lang.OutOfMemoryError: GC overhead limit exceeded
                  at java.util.Arrays.copyOfRange(Arrays.java:3520)
                  at stream.KillSwitch.copyOfRange(KillSwitch.java:300)
                  at fileIO.ByteFile1.nextLine(ByteFile1.java:164)
                  at shared.Tools.addNames(Tools.java:454)
                  at driver.FilterReadsByName.<init>(FilterReadsByName.java:138)
                  at driver.FilterReadsByName.main(FilterReadsByName.java:40)
          
          This program ran out of memory.
          Try increasing the -Xmx flag and using tool-specific memory-related parameters.
          I couldn't identify a particular reason for each of the three different errors. But what I do can tell is that the driver for failing is related to the amount of reads kept: all of the processes that failed were trying to retain at least 56,881,244 pair-end reads. The first one not failing was retaining 50,519,102 pair-end reads.

          One thing that I realise it could be causing it to crash is that it doesn't have a way of limiting the threads it's using. So it's always using all the available cores in the machine. Even if you launch it using the option "threads=1" (which is currently not defined as an option for "filterbyname"), you get the message "Set threads to 1" but it still uses all of them.

          I don't want you to make this a priority because I manage to avoid this solution. But I think it should be something to check. Also, I think limiting the threads should be a must on any command, because in most scenarios they will be run on shared servers/clusters.

          Thanks for your help!

          Comment


          • randomreads.sh for huge data

            Hello!

            I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
            The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?

            Comment


            • Originally posted by vzinche View Post
              Hello!

              I am trying to simulate the reads from many genomes using metagenomics mode of randomreads.
              The problem is the more genomes I use, the worse is the quality of the reads. Let's say, when I use 100 of genomes, around 99% of the reads can be mapped back (using bbmap) to the original genomes. Though, while using 1000 genomes, I can map only around 30-40% of generated reads. Is there some reasonable explanation for this?
              How similar are those 1000 genomes? What parameters are you using related to multi-mapping with BBMap? As the number of similar genomes increases the numbers of reads that multi-map will go up as well. You could use "ambig=all" to allow reads to map to every location/genome and that will likely take the % of aligned reads up. But you are losing specificity at that point. Other thing you could do is to generate longer reads that will increase mapping specificity.

              Can you say what is the reason behind this exercise and what exact parameters you used for the randomreads.sh and bbmap.sh runs?

              Comment


              • Sorry, I didn't describe the problem well enough in the previous message.

                The mapping isn't the main goal and the main problem.
                I need to simulate a huge metagenomics dataset (1000 genomes) for further usage, but I need to carefully keep track of the positions of the reads on genomes.
                The dataset was simulated with following parameters: len=250 paired coverage=5 metagenome=t simplenames snprate=0.02
                When I tried to manually compare the sequence located on the genome between the positions stated in read header with the actual read sequence, for most of the reads they were too different (blast alignment of these sequences showed no similarity). Though, for some they matched perfectly. I checked only +stand reads for simplicity.
                That's why I head an idea to ran BBmap to estimate the number of reads that can't be even mapped to original genomes. I ran it with all the default parameters and it could map only around 35% of reads.

                But when I have redone all the same with 100 genomes (randomly samples from these 1000), I couldn't find these 'messed up' reads and could map more than 99%.
                Increasing the number of genomes, the percentage of mapped reads decreased.

                Genomes are not very closely related, and changing the number of genomes being used didn't really affect their similarity.

                Thus, my main concern is not the mapping itself, but the source of these 'messed up' reads.

                Comment


                • @Brian will likely have to weigh in on this (especially "positions stated in read header with the actual read sequence, for most of the reads they were too different ") but be aware that he has been behind on support of late.

                  A few things to check that I can think of:

                  1. If you are only going to check the + strands then perhaps you should have used the samestrand=t option when generating the reads.
                  2. Default value for BBMap is ambig=best. Can you try mapping with ambig=all to see if that improves alignments?
                  3. Do you know why the remaining reads are not mapping (are they chimeras)?

                  Comment


                  • I will try that, thank you.

                    And regarding the third question, that is actually a problem. I have no idea where these reads come from. I tried to search them or parts of them in the original genomes, but apparently with no success. Could be chimeras made up of short sequences, but I can't say for sure.

                    The first thought was that it could be some memory problem, since it gets worse when increasing the size of the initial file, but it's just a random idea.

                    Comment


                    • Have you looked through the logs and such to see if there is any indication of any issues? There is always the possibility that @Brian may not have checked extreme usage case like this for randomreads.sh and this may be a genuine bug that is clearly a road-block.

                      Since you have said that 100 genomes seem to work fine you could do 10 runs of 100 genomes each and then perhaps merge the data. A thought.

                      Comment


                      • BBSplit ambig=toss

                        Hi Brian et al.,
                        When I run BBsplit with ambig=toss, the ambiguous reads are not written to unmapped.fq; but when I run BBmap, they are. Is this the expected behavior? I'd like to be able to retrieve the ambiguous reads from BBsplit (both within/between two references).
                        Thanks,
                        MC

                        Comment


                        • summarizing mapped reads by orf

                          Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

                          While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

                          Thanks,
                          MCMC

                          Comment


                          • Originally posted by mcmc View Post
                            Is there a way to use BBtools to summarize reads mapped to a genome (using BBmap/BBsplit, in a sam file) by orf? I see that pileup.sh will take a prodigal-output fasta file with orf info, but I've got a genome downloaded from refseq with all the ncbi files (gff, cds fasta, gb). Can BBtools parse one of these to summarize my sam file by orf?

                            While I could map to the orfs.fna instead, I'm interested in intergenics too, e.g. for orf/RNA discovery.

                            Thanks,
                            MCMC
                            BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.

                            Comment


                            • Originally posted by GenoMax View Post
                              BBTools currently has no count utilities. They may be on the wish list since many have asked Brian. For now, your best bet is to use featureCounts.
                              Thanks! I'm surprised there's something BBTools doesn't do

                              Comment


                              • Ultimately, i'd like to do variant calling on a combined pac bio / illumina whole viral genome dataset. I am working with BBMap right now as it has the intuitive minid flag, which seems desirable. As a first step, I'm trying to optimize my mapping as much as possible on one of the samples that is most divergent to the reference.

                                Here is my working command:
                                bbmap/mapPacBio.sh in=200335185_usedreads.fastq.gz ref=200303013.fa maxindel=40 minid=0.4 vslow k=8 out=200335185.sam overwrite=t bamscript=bs.sh; sh bs.sh

                                It optimizes the number of reads mapped (4148/4164) and minimizes the number of ambiguous mapping reads (1).

                                Given, k=8 and minid=0.4, geneious mapper maps all 4164 reads for maxindel ranging from 20-500. If it is in the cards, I'd like to be able to map the remaining stragglers but don't know what other BBMap flags I should try in this endeavor. Also, I'm curious why bbmap is so much more sensitive to the valueof maxindel... here are select bbmap results:
                                maxindel num reads num ambiguous
                                20 4145 2
                                40 4148 1
                                60 4137 1
                                100 4130 3
                                200 4125 4

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                201 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 08:54 AM
                                0 responses
                                212 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-02-2024, 03:00 PM
                                0 responses
                                194 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X