Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Hi Brian,

    I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

    The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

    Thanks for your help this week and I hope my solution helps others who encounter the same issue!

    dave

    Comment


    • Originally posted by dho View Post
      Hi Brian,

      I figured out the problem and a workaround.The problem is that even with your recommended settings reads that extended beyond the ends of both short and long reference sequences would only be mapped to the longer reference sequences.

      The workaround is to avoid having longer and shorter reference sequences as mapping targets at the same time. I subdivided by reference sequences into multiple reference sequences that each contain sequences of the same size and then map against each of these individually. I can then merge the output from all of these mappings.

      Thanks for your help this week and I hope my solution helps others who encounter the same issue!

      dave
      Hi Dave,

      Thanks for the followup. I apologize for not getting back to you in a timely fashion, I'm pretty swamped currently!

      Comment


      • Hi Brian,

        No problem! Thanks for all that you do.

        For posterity, the solution I proposed above worked for the most part, but I encountered a few cases where it didn't. Since then, I've decided to be fully exhaustive and map against each reference sequence individually. This is computationally intensive but gives the most robust results. Even with these settings, though, reads mapping to very small reference sequences (<50bp) doesn't seem to work as consistently as I'd like, though I haven't figured out the source of this inconsistency.

        --dave

        Comment


        • Adding Unique Identifier To Paired End Reads. Editing FASTQ Read ID based on randomer

          Hi Brian-

          We have a paired end rnaseq data.

          For every sequence in read2, we want to extract the first 6nucleotides and append them to the read id in read2 and also to the respective read id in read1.

          Please see example below.

          Code:
          [I]cat read1[/I]
          @NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
          CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
          +
          AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
          @NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
          CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
          +
          AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
          @NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
          ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
          +
          AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
          @NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
          CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
          +
          AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E
          
          
          [I]cat read2[/I]
          @NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
          GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
          +
          AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
          @NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
          TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
          +
          AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
          @NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
          TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
          +
          /AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
          @NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
          TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
          +
          AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE




          Desired Output


          Code:
          [I]cat read1[/I]
          @[COLOR="Red"]GCTGGG[/COLOR]_NB501293:231:HV3CTBGX2:1:11101:2280:1047 1:N:0:CAGATC
          CCGCANGTTGCAGAGCGGGTGGGAGCCNCTNCGGGCGCGGCACTGNAGCCCTGANACTGAACCCCGAACCCGAGCC
          +
          AAAAA#EEEEEEE/EEEEEEEEE6E//#<E#/EEE/E/EAEAEEE#/EA/AEEE#/EAAAEE/EEEAEEE/EE//6
          @TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 1:N:0:CAGATC
          CTCAANGGGAGAGACCTTAGATGATACNCANGATGACAGTAGGTANAGGGAACTTATAGAGCCACCTCCATCAGGA
          +
          AAAAA#EEEEEEEEEAEEEEEEEEEEE#EE#EEEEEEEEEEEEEE#EEEEEEEEEEEEEEEEEAEEEEEEEEEEEE
          @TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 1:N:0:CAGATC
          ACGGANCTCTGGCTGTTGTATGGAAAGNTANGCTGTAACACGCACNGACAGAAGAGAGCCATTTTCTCCCTGAACT
          +
          AA/AA#EEEEEAEAEAEEEEE/EEEEE#E<#6EEAEEAAAE///E#EEAEE//A/</EE</EE//E6E6EEEEEE6
          @TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 1:N:0:CAGATC
          CCTGGNAGCCGCCGCAAGCGCCGGACCNCANGCACTCCCAGGCGCGCGCGCTTCTTCTGCAAAAAGTTGAGGGCTC
          +
          AAAAA#EEAEEEEEE6EEEEEEEEAE/#EE#EE/EE//E/EEAEEEEEEEEAEEEEEEEE/EEEEEEEEEEEEE/E
          
          
          [I]cat read2[/I]
          @[COLOR="Red"]GCTGGG[/COLOR]_NB501293:231:HV3CTBGX2:1:11101:2280:1047 2:N:0:CAGATC
          GCTGGGCGAGTAGCTTCTGGATCCTGGCCTCCTGAGCCTGTGGCCCGGGCTAGGCTCGGGGCTCGGGTTCGGGGTT
          +
          AAAAAEEEEEEE/AEEAEEEEEEEE/AEEEAAAE/EEEEEAEAEEEE6EEEAEEEEEEEEEEE/EAEAAE6EEE<A
          @TGACTG_NB501293:231:HV3CTBGX2:1:11101:9866:1047 2:N:0:CAGATC
          TGACTGTGGGGTGGCAACCCCATTCCTCACTTGATGTCCTGTCTTCCTGATGGAGGTGGCTCTATAAGTTCCCTCT
          +
          AAAAAEEEEEEEEEE/AEEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE
          @TCGCAT_NB501293:231:HV3CTBGX2:1:11101:24301:1048 2:N:0:CAGATC
          TCGCATCATCCTACACAGCACGGACGCTAGATGACAGGACGTGCCATGACAGTCTAAGTTCAGGGAGAAAATGGCT
          +
          /AAAAEE//EEEEEEEEEEAEAE/EEEEEE/EA/EEA/AEEEEEEE/E/EEA/EEEEE/EEEEEEEEAEEEEA/EE
          @TGACCA_NB501293:231:HV3CTBGX2:1:11101:16754:1048 2:N:0:CAGATC
          TGACCAGCCATTGGCTGGTGGGAGTAGTGATGTCACCCATATGACACCCTGATAACGAGTTGAGAGAGAGCCCTCA
          +
          AA/AAEEEEEAEEEEEEEEEAEEEEEEEEAEEEEEA/EEEEE</</EAEEEEE<EEEAEEEEEEEEAA/EE<6/EE
          Do you know if the BB suite can achieve this? I tried fastx toolkit but it prints the full sequence instead of a partial sequence.

          Comment


          • Hi Brian,

            I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

            The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

            When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

            Here is the error:

            java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
            Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
            at java.util.Arrays.copyOfRange(Arrays.java:3664)
            at java.lang.String.<init>(String.java:207)
            at java.lang.String.toLowerCase(String.java:2647)
            at java.lang.String.toLowerCase(String.java:2670)
            at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
            at driver.FilterReadsByName.main(FilterReadsByName.java:40)

            Thank you very much in advance.

            Best regards,
            Santiago

            Comment


            • problem using bbmap

              Hi Brian,

              I am trying a Denovo assembly of a reptilian genome (size comparable to human genome) and have the following Illumina sequence libraries (paired-end, mate-pair and long mate-pair with estimated insert sizes 200bp, 5.2kb and 10kb respectively.).

              After quality and adapter trimming (using FASTQc toolkit), I have used bbmap.sh to align the reads to a reference (assembly of contigs from CLC workbench). However, I am unable to generate bam file (using samtools) from the sam file generated using bbmap. I have used "rcomp=t" and "rcs=f" flag for both the mate pair and long mate pair libraries. I had earlier tried with "rcomp=t" flag only but that didn't help either.

              Please help.

              Best,
              D

              SNIPPET submit script:

              srun ./bbmap/bbmap.sh rcomp=t rcs=f in1=mate_pair_1.fq.gz in2=mate_pair_2.fq.gz outu=mate_pair_u.sam outm=mate_pair_m.sam
              srun ./samtools-1.6/samtools view -b -o mate_pair_m.bam mate_pair_m.sam
              srun ./samtools-1.6/samtools view -b -o mate_pair_u.bam mate_pair_u.sam

              snippet Error:

              [E::sam_parse1] SEQ and QUAL are of different length
              [W::sam_read1] Parse error at line 2051364
              [main_samview] truncated file.
              [E::sam_parse1] SEQ and QUAL are of different length
              [W::sam_read1] Parse error at line 2051364
              [main_samview] truncated file.
              srun: error: cluster01: task 0: Exited with exit code 1
              srun: error: cluster01: task 1: Exited with exit code 1

              Comment


              • Can you check what the lines in question look like? Use "zcat mate_pair_1.fq.gz | sed -n '2051364,2051366p' filename" to extract those lines.

                Comment


                • Hi,

                  I got this with the command you suggested.

                  BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F
                  @HWI-D00294:282:CAB4VANXX:7:1101:13889:60146 1:N:0:GCCAAT
                  CCAATGGGGAATGGCGAAAGCACTGCTCAGCATTTCTGGCTCTGCCTGAGGCTGGAATGCAGAAAACCCTGCAGTAGAGGGGGATCTTCTCTTTGGGGTGCTCCTCGTGCCTCCCCCTTACTGC

                  Best,
                  D

                  Comment


                  • Compare the sequence and quality lengths of record that the first line belongs to, so let us try "zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p" instead.

                    If the lengths of sequence and quality values lines are different then you have a malformed fastq record. You could delete that record from both R1/R2 files as a work around. Was this data trimmed? Do you have the original files?

                    Comment


                    • Hi,

                      zcat mate_pair_1.fq.gz | sed -n '2051361,2051364p'


                      TACACATCTAGATGTGTATAAGAGACAGGTAATGGGATTGCCAGGTTCCCCCTCACTTGTAGTTTTGGATTTGGATTTATTATTCTTAATGTATGTATGTAGCACCATAGCTATGTGTGCTCAGG
                      +
                      BBBCCGC>DE1>FC1FCCFFGEEEBB1FGGDDCD111CF@EG1FBFFFDC>C><F>CC1:FF:11119:1111B1111E@>GC>G1=:11:=<CFD<1F:0=:F@0BFCG>>0FFB>F000;00F

                      I checked the lengths of the sequence and quality values it is the same 125. Yes, the fastq files used were trimmed using trim galore toolkit and quality checked using fastqc toolkit. I do have the raw data (illumina) as well.

                      Is it because trimming was done by the above mentioned tools and not using bbduk.sh ? Can't really understand what is going wrong.

                      Best,
                      D

                      Comment


                      • Can you check the R2 file by the same method?

                        Comment


                        • Hi,

                          There seems to be nothing wrong with the second read too

                          zcat mate_pair_2.fq.gz | sed -n '2051361,2051364p'

                          @HWI-D00294:282:CAB4VANXX:7:1101:13812:601352:N:0:GCCAAT
                          TTGAAGCAGCAGTTCAAAAACATTGTCTCAGTCTGTCTTAATTTGGTATAATCCCCTGAATCTATTAAACCAAGACCAGCTGTCTGACATTTTTCACTATTTTCTTTTCTCCGCTTGTTCTTTTC
                          +
                          @BBB@FG1EBGG@D1FFGFGGGGGEEGGGFCDGGE=@>DF@F@1@@>FG>FG>DEG1E>1@FDGGGCGECBGEGBE>1@>GGFC=D>FGE@FFGGGG00E>F>DCGGGGGGGGGDF=@>C0B@FG
                          Last edited by bio_d; 10-23-2017, 08:34 PM.

                          Comment


                          • Originally posted by santiagorevale View Post
                            Hi Brian,

                            I'm using "filterbyname.sh" script from bbmap v37.60 (using Java 1.8.0_102) to extract some reads from a FastQ file given a list of IDs.

                            The current FastQ file has 196 Mi reads and I want to keep 85 Mi. Uncompressed FastQ file size is 14G while compressed is only 1.4G. IDs file is 3.1G.

                            When running the script using 24G of RAM it dies with OutOfMemoryError. Isn't it an excessive use of memory for just filtering a FastQ file? Also, among the script arguments the is no "threads" option, however the script is using all available cores. Any way of limiting both memory as well as threads usage?

                            Here is the error:

                            java -ea -Xmx24G -cp /software/bbmap-37.60/current/ driver.FilterReadsByName -Xmx24G include=t in=Sample1.I1.fastq.gz out=filtered.Sample1.I1.fastq.gz names=reads.ids
                            Executing driver.FilterReadsByName [-Xmx24G, include=t, in=Sample1.I1.fastq.gz, out=filtered.Sample1.I1.fastq.gz, names=reads.ids]

                            Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
                            at java.util.Arrays.copyOfRange(Arrays.java:3664)
                            at java.lang.String.<init>(String.java:207)
                            at java.lang.String.toLowerCase(String.java:2647)
                            at java.lang.String.toLowerCase(String.java:2670)
                            at driver.FilterReadsByName.<init>(FilterReadsByName.java:145)
                            at driver.FilterReadsByName.main(FilterReadsByName.java:40)

                            Thank you very much in advance.

                            Best regards,
                            Santiago
                            Hi there,

                            Any hint on what I've previously asked?

                            Thanks!

                            Comment


                            • Hi Brian & others,

                              I tried to run bbnorm with a kmer size of 99, but it crashed with the following error:

                              Code:
                              Exception in thread "Thread-371" Exception in thread "Thread-357" Exception in thread "Thread-368" Exception in thread "Thread-377" Exception in thread "Thread-367" Exception in thread "Thread-362" Exception in thread "Thread-380" Exception in thread "Thread-363" Exception in thread "Thread-365" Exception in thread "Thread-364" Exception in thread "Thread-366" Exception in thread "Thread-358" Exception in thread "Thread-361" Exception in thread "Thread-360" Exception in thread "Thread-381" Exception in thread "Thread-387" Exception in thread "Thread-372" Exception in thread "Thread-399" java.lang.AssertionError: this function not tested with k>31
                                  at jgi.KmerNormalize.correctErrors(KmerNormalize.java:2124)
                                  at jgi.KmerNormalize.access$19(KmerNormalize.java:2121)
                                  at jgi.KmerNormalize$ProcessThread.normalizeInThread(KmerNormalize.java:3043)
                                  at jgi.KmerNormalize$ProcessThread.run(KmerNormalize.java:2806)
                              I'm wondering if I used a bad combination of parameters. Here's the call:

                              Code:
                              java -ea -Xmx132160m -Xms132160m -cp PATH/TO/bbmap/current/ jgi.KmerNormalize bits=32 in=output/trim_decon/reads.fastq.gz threads=50 out=output/k_99/norm/normalised.fastq.gz zl=9 hist=output/k_99/norm/hist_before.txt histout=output/k_99/norm/hist_after.txt target=50 min=5 prefilter ecc k=99 peaks=output/k_99/norm/peaks.txt
                              Otherwise, is it supported to use bbnorm with larger kmer sizes, or would you recommend estimating the target coverage for k = 99 based on the coverage at k = 31?

                              I've posted the log on pastebin: https://pastebin.com/jPkKagFs

                              Thanks again for the bbmap suite!

                              Tom

                              Comment


                              • Hi, just want to make sure I'm not missing anything here, but randomreads.sh cannot produce metagenomes according to a specific profile, right? I only find information about it drawing random numbers from an exponential distribution for each reference sequence and thus produces a simulated metagenome from a set of reference sequences, which of course is awesome, but right now I would like to produce a simulated metagenome with a very specific composition.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                30 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                201 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 08:54 AM
                                0 responses
                                212 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-02-2024, 03:00 PM
                                0 responses
                                194 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X