Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Originally posted by darthsequencer View Post
    Thanks that's helpful. On the note of loading references - is there a way to use wildcards with the input and output of bbwrap?
    Ah, sorry, but nope. You can, however, do something like...

    Code:
    ls *.fastq.gz > ls.txt
    ...then replace '\n' with ',' using a text editor (like Notepad++). I'm sure there's a simpler sed/awk solution, too.

    Comment


    • Hi Brian,
      thank you so much for this useful tool. I would like to ask you a question. Does BBMap generate a list of transcripts as a result of the mapping of the reads? What I mean is if this tool generates a fasta file of reconstructed transcripts based on the reference genome and the reads used for mapping.
      I need this because the genome that I am using is not annotated (I mean, I do not have a gff) and I want to generate a reference transcriptome using my reads and the genome.

      Thank you again!
      Best,
      Lucila.

      Comment


      • Originally posted by lucila View Post
        Hi Brian,
        thank you so much for this useful tool. I would like to ask you a question. Does BBMap generate a list of transcripts as a result of the mapping of the reads? What I mean is if this tool generates a fasta file of reconstructed transcripts based on the reference genome and the reads used for mapping.
        I need this because the genome that I am using is not annotated (I mean, I do not have a gff) and I want to generate a reference transcriptome using my reads and the genome.
        Hi Lucila,

        No, it does not, but I have been considering adding gff output, at least for exon boundaries, if not full transcripts. For generating a reference transcriptome, you can do either mapping or assembly; so, assemblers like Trinity are also an option.

        I have not tried it, but you may also want to examine this:
        Genome Annotation Without Nightmares. Contribute to enormandeau/gawn development by creating an account on GitHub.

        Comment


        • Originally posted by Brian Bushnell View Post
          Hi Lucila,

          No, it does not, but I have been considering adding gff output, at least for exon boundaries, if not full transcripts. For generating a reference transcriptome, you can do either mapping or assembly; so, assemblers like Trinity are also an option.

          I have not tried it, but you may also want to examine this:
          https://github.com/enormandeau/gawn
          Thank you Brian for the information!!
          Cheers,
          Lucila.

          Comment


          • Hi Brian,

            I'm having an issue I'm wondering you can help with.

            I am using bbmap to align to a small viral genome. The resulting bam or sam file give me an error something like : Segmentation fault (core dumped) - every time I run the program lofreq viterbi, which is used before variant calling with lofreq. When I generate a bam with the same reference file and fastq files with bowtie2 I do not get the same error.

            I have tried using reformat.sh to convert to cigar 1.3 but this doesn't seem to work. The fasta file has only one sequence and there are no spaces in the name.

            Here is an example of the error with the command -
            Code:
            lofreq viterbi -f ZIKV_PRVABC59_CDS.fa -o bbmapv.bam bbmap.bam
            Segmentation fault (core dumped)
            It runs OK with bowtie produced bam.

            Here is the output from the command head with either bowtie or bbmap produced sam file.

            bbmap
            Code:
            @HD     VN:1.4  SO:unsorted
            @SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
            @PG     ID:BBMap        PN:BBMap        VN:37.47        CL:java -Djava.library.path=/home/weger/bb
            p/jni/ -ea -Xmx216859m align2.BBMap build=1 overwrite=true fastareadlen=500 in1=1_ctff_R1.fq.gz in
            1_ctff_R2.fq.gz out=1_bbmap.sam ref=ZIKV_PRVABC59_CDS.fa maq=25
            NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        77      *       0       0
            TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEAE/EEEA/AEEEE/EA
            NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        141     *       0       0
            AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
            AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA      AAAE<6<<<66<<<<<6<<6666666666666E66666666E66E666A6666A66AA
            E6EAA6EAAAEAEAEAEEAEEAEEEEEAE6EAEAAEAAAEEEAEAEAAEEEEAEAEEAEEEEEEEEAEEEEEEEE
            NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 77      *       0       0       *
            TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEEAEEE//EEEEE//E
            NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 141     *       0       0       *
            AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 6666666666<<<<<<666666666666666666666666
            NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
            TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
            TTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAE<EE<A/EA//E
            NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        141     *       0       0
            AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
            AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A6A6<<<6<<E<<<<<6<<6<6<<<<<6666<<<<6666E66E6666666EE6666E66EEEE666
            6AAAAAAAEA6EAAEAAE6E6EAE6EAEEAEAAEAEAEEEAEEEAAEEAEEEEEEAEEEEAEAE
            NS500697:37:HH5V7BGXY:2:22101:16007:4646 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
            TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT      AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEE/AE/EEA//A
            bowtie2
            Code:
            @HD     VN:1.0  SO:unsorted
            @SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
            @PG     ID:bowtie2      PN:bowtie2      VN:2.2.9        CL:"/apps/bowtie2-2.2.9/bowtie2-align-s --
            apper basic-0 -x /data/ebel_lab/Weger/indexed_fasta_bowtie/ZIKV__PRVABC59_CDS -S 1_bowtie.sam -1 /
            p/31806.inpipe1 -2 /tmp/31806.inpipe2"
            NS500697:37:HH5V7BGXY:4:13502:9896:14279        77      *       0       0       *       *       0
            AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EE        YT:Z:UP
            NS500697:37:HH5V7BGXY:4:13502:9896:14279        141     *       0       0       *       *       0
            CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
            CAGAAAGACTTGTTTTTTTTTTTTTAAT    AAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEE/AEEEEEEEEEEEEEEEE/<E   YT:Z:UP
            NS500697:37:HH5V7BGXY:4:12605:4827:14803        77      *       0       0       *       *       0
            AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAACAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEAEE        YT:Z:UP
            NS500697:37:HH5V7BGXY:4:12605:4827:14803        141     *       0       0       *       *       0
            CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
            CAGAAAGACTTGTTTTTTTTTTTTTAATT   AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA<AEEEEEEEEEEEEEEEEEEEEEEE<<<E  YT:Z:UP
            NS500697:37:HH5V7BGXY:4:22607:4886:15556        77      *       0       0       *       *       0
            AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEAEEAEEEEEEEE
            E/EEEAE/AAEEEEEEAAEEEEEAA6EEE6EEEEEEEE/E        YT:Z:UP
            NS500697:37:HH5V7BGXY:4:22607:4886:15556        141     *       0       0       *       *       0
            AGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTATG
            AGAAAGACTTGTTTTTTTTTTTTTTAATTA  AAAAEEEEEEA/EAEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEE
            EEEEEEEEEEAEEAEEEEEEEAEEEEEEAEAAE/AEAAEEEEE6E/EEEAEEAEEEEA//<EA YT:Z:UP
            NS500697:37:HH5V7BGXY:4:12507:9984:15827        77      *       0       0       *       *       0
            AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAAGTCTTTAT     AAAAAEAEEEEEEEEEEE
            EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6E6EEE/EEEEE/EE/E6EAE/E        YT:Z:UP
            Any ideas you might have would be greatly appreciated.

            Thanks in advance.
            Last edited by GenoMax; 08-18-2017, 03:23 AM. Reason: Added [CODE] tags

            Comment


            • I wonder if viterbi program is having trouble with (NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG ) the tag sequence which BBMap preserves but Bowtie does not.

              Can you "reformat" the bbmap alignment using "trd=t" flag and see if that helps?

              Comment


              • So I tried

                reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t sam=1.3

                and just

                reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t

                I then tried running lofreq viterbi either with sorting or without and same error.

                Segmentation fault (core dumped)

                I did notice that bbmap bams have the N and bowtie does not.

                Any more thoughts?

                Thanks for the quick reply.

                Comment


                • I can't think of anything now. Let us see if Brian has any other suggestions.

                  lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.

                  Comment


                  • Originally posted by GenoMax View Post
                    I can't think of anything now. Let us see if Brian has any other suggestions.

                    lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.
                    It seems to end up calling weird indels when viterbi is not run that are clearly not real. Thanks for the help. I'll wait for Brian and see what he thinks.

                    Comment


                    • Hi Brian,

                      I'm trying to call some kmers with kmercountexact.sh and am running into some troubles.

                      I am merging paired end reads with bbmerge then mapping to a small viral reference file with bbmap. I am then using reformat.sh to trim to a smaller read length and then calling/counting kmers with kmercountexact.

                      The problem I am having is that after mapping I'm getting reads that are both in the forward and reverse sense. So when calling kmers I get them from both ends of the sequence I'm interested in. Is there any way to reverse complement just the reads that are in the reverse complement in relation to the reference sequence?

                      Thanks in advance.

                      Comment


                      • After mapping, you can split the sam file like this:

                        Code:
                        splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam
                        reformat.sh in=plus.sam out=plus.fq
                        reformat.sh in=minus.sam out=minus.fq rcomp
                        cat plus.fq minus.fq > all.fq
                        As for your previous question (sorry, somehow I did not notice it) - I think Genomax may be right that the issue is the read names, but it's hard to say. Also, it's hard to see what's going on here since none of the reads are mapped...

                        Incidentally, what do you mean by this?

                        I did notice that bbmap bams have the N and bowtie does not.
                        Do you mean N in the read name, or in the cigar string, or in the read sequence?

                        Also, I'd be interested to see how the output of callvariants.sh (with appropriate sensitivity parameters) compares to lofreq, if you want to give it a try.
                        Last edited by Brian Bushnell; 08-23-2017, 03:06 PM.

                        Comment


                        • Brian,

                          Thanks for the splitsam, that worked great!

                          As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

                          BBmap
                          NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

                          Bowtie
                          NS500697:37:HH5V7BGXY:4:13502:9896:14279

                          Comment


                          • As for the comparison for lofreq and callvariants, I can never seem to get callvariants to call what look like good snps for my samples. Maybe I am using it wrong in some way.

                            The sensitivity parameters are a little trick to play with in lofreq. But with callvariants and lofreq with standard settings at rarity=0.01, callvariants calls completely different snps as compared to lofreq. After manual inspection of the bam file it looks like lofreq is more accurate. Could definitely be some paremeter issues though. I'm open to any suggestions at changing these, as I'm a big fan of the bbmap suite.

                            Here is a link to the vcf files and the script I'm running if you are interested.


                            Access Google Drive with a Google account (for personal use) or Google Workspace account (for business use).


                            Thanks as always.

                            Comment


                            • Originally posted by jweger1988 View Post
                              Brian,

                              Thanks for the splitsam, that worked great!

                              As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

                              BBmap
                              NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

                              Bowtie
                              NS500697:37:HH5V7BGXY:4:13502:9896:14279
                              Yep, BBMap uses the full read name whereas Bowtie truncates the read name at the first whitespace. As Genomax noted, you can add the flag "trd" (trimreaddescriptions) to BBMap to produce truncated names, you you can reprocess the already-mapped sam file with Reformat using the "trd" flag to do that also.

                              Thanks for the feedback on CallVariants! It's certainly odd that there seems to be nothing in common between the two results. Would it be possible for you to email me the bam and reference so I can examine it in more detail?
                              Last edited by Brian Bushnell; 08-24-2017, 10:46 AM.

                              Comment


                              • Unfortunately, it looks like your data is very low quality and has inaccurate quality scores, things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions. I recommend:

                                1) Adapter-trimming (you have >5% of reads with adapters)
                                2) Mapping
                                3) Recalibrating the raw reads
                                4) Adapter-trimming and quality-trimming the recalibrated reads, with "ftm=5"
                                5) Mapping again
                                6) Calling variants

                                E.g.
                                Code:
                                bbduk.sh in=reads.fq.gz out=trimmed0.fq.gz ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
                                bbmap.sh in=trimmed0.fq.gz ref=ref.fa out=mapped0.sam nodisk qahist=qahist.txt qhist=qhist.txt mhist=mhist.txt
                                rm trimmed0.fq.gz
                                calctruequality.sh in=mapped0.sam
                                rm mapped0.sam
                                bbduk.sh in=reads.fq.gz out=recal.fq.gz
                                bbduk.sh in=recal.fq.gz out=trimmed.fq.gz qtrim=rl trimq=14 ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
                                bbmap.sh in=trimmed.fq ref=ref.fa out=mapped.sam nodisk bs=bs.sh  qahist=qahist_rc.txt qhist=qhist_rc.txt mhist=mhist_rc.txt; sh bs.sh

                                Result of recalibration:
                                Before:



                                After:



                                However, I am not sure you will get anything useful out of this data. We have found that NextSeq runs are just not trustworthy for calling variants at a frequency of below 5%. I looked at lofreq's highest-quality variant, with a Q-score of 1392, and it was seen in 2 reads, with a Q-score of 14 in each case, once within 3 bp of the read end (probably in adapter sequence because there are 2 mismatches and an insertion next to it there) and both reads had a lot of other mismatches. That is the the highest quality variation, and it has zero chance of being correct.

                                The reason you were not getting any variant calls with CallVariants except at the ends is because for some reason you have stranded data; all the reads only map to the minus strand. In that situation you need to set "usebias=f" and "minstrandratio=0" to turn off strand-bias filters. It still won't call much, though. 95 variants are called using this command (on the trimmed recalibrated reads):

                                Code:
                                callvariants.sh in=recal.sam.gz ref=ref.fa rarity=0.01 out=foo3.txt vcf=foo3.vcf minreads=2 minscore=10 usebias=f minstrandratio=0
                                But only 12 are above Q20 which is my normal cutoff (default is minscore=20, which I recommend), so I'd suggest perhaps looking at those. Maybe they're real. The insertion at 9225 looks plausible.
                                Attached Files

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Exploring the Dynamics of the Tumor Microenvironment
                                  by seqadmin




                                  The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                                  07-08-2024, 03:19 PM
                                • seqadmin
                                  Exploring Human Diversity Through Large-Scale Omics
                                  by seqadmin


                                  In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                                  06-25-2024, 06:43 AM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, Yesterday, 05:49 AM
                                0 responses
                                15 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-15-2024, 06:53 AM
                                0 responses
                                26 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-10-2024, 07:30 AM
                                0 responses
                                37 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 07-03-2024, 09:45 AM
                                0 responses
                                204 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X