Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Brian Bushnell
    replied
    Originally posted by jweger1988 View Post
    I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.
    I would not use the "Q30" number for anything; I consider it marketing fluff. The only way to determine data quality is to actually map it and count the number of sequencing errors. You can see from the graph that when you do that, the average starts at Q11 (which is incredibly low and very unusual), goes up to a peak of just under 30, and drops to Q12 at the end, which is also very low. At no point does the average quality even reach Q30.

    You can see this visually in IGV - the screen is always littered with random blips indicating indels or mismatches to the reference, which is very unusual for Illumina data (at least, HiSeq 2500 Illumina data). The screen should look very clean except where mismatches to the reference form an obvious column, indicating a real mutation.

    We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?
    You might want to make sure you are using the latest chemistry and software, and talk to your Illumina rep about your concerns with the quality. We are still using NextSeq routinely, and the quality is usually much better than that, but I don't know the reason.

    I don't think the platform is appropriate for low-frequency variant-calling anyway, but on any platform, I'd recommend paired-end reads for that (preferably overlapping so they can be merged), and definitely not strand-specific protocols, which incur bias (particularly, certain genomic features, read in a certain direction, are prone to miscalls). How did you end up with only minus-strand reads, anyway?

    Leave a comment:


  • jweger1988
    replied
    Originally posted by GenoMax View Post
    A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.
    Thanks Genomax.

    I'm just curious why Brian said the data was really low quality. We are using a Nextseq 500 and V2 chemistry. The Q30>30 was 80% and it was actually underclustered.

    We only have access to Nextseq at the moment, is there anything we can do to improve this or is it just how it is with Nextseq?

    Leave a comment:


  • GenoMax
    replied
    A MiSeq/HiSeq generally generates better data so if you do have the option of using one of those then do switch. Especially for lowfreq SNP calls.

    Leave a comment:


  • jweger1988
    replied
    Brian,

    Wow, that's very surprising to me.

    things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions
    Are you still currently using a nextseq? Do you recommend a MiSeq or a Hiseq instead? I assume by chemistry and software revisions you mean those put out by Illumina? I wonder if we don't have the latest software?

    Any suggestion to improve this in the future?

    Leave a comment:


  • Brian Bushnell
    replied
    Unfortunately, it looks like your data is very low quality and has inaccurate quality scores, things we used to see on our NextSeqs until it eventually got better after various chemistry and software revisions. I recommend:

    1) Adapter-trimming (you have >5% of reads with adapters)
    2) Mapping
    3) Recalibrating the raw reads
    4) Adapter-trimming and quality-trimming the recalibrated reads, with "ftm=5"
    5) Mapping again
    6) Calling variants

    E.g.
    Code:
    bbduk.sh in=reads.fq.gz out=trimmed0.fq.gz ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
    bbmap.sh in=trimmed0.fq.gz ref=ref.fa out=mapped0.sam nodisk qahist=qahist.txt qhist=qhist.txt mhist=mhist.txt
    rm trimmed0.fq.gz
    calctruequality.sh in=mapped0.sam
    rm mapped0.sam
    bbduk.sh in=reads.fq.gz out=recal.fq.gz
    bbduk.sh in=recal.fq.gz out=trimmed.fq.gz qtrim=rl trimq=14 ktrim=r k=23 mink=11 hdist=1 minlen=100 ref=adapters
    bbmap.sh in=trimmed.fq ref=ref.fa out=mapped.sam nodisk bs=bs.sh  qahist=qahist_rc.txt qhist=qhist_rc.txt mhist=mhist_rc.txt; sh bs.sh

    Result of recalibration:
    Before:



    After:



    However, I am not sure you will get anything useful out of this data. We have found that NextSeq runs are just not trustworthy for calling variants at a frequency of below 5%. I looked at lofreq's highest-quality variant, with a Q-score of 1392, and it was seen in 2 reads, with a Q-score of 14 in each case, once within 3 bp of the read end (probably in adapter sequence because there are 2 mismatches and an insertion next to it there) and both reads had a lot of other mismatches. That is the the highest quality variation, and it has zero chance of being correct.

    The reason you were not getting any variant calls with CallVariants except at the ends is because for some reason you have stranded data; all the reads only map to the minus strand. In that situation you need to set "usebias=f" and "minstrandratio=0" to turn off strand-bias filters. It still won't call much, though. 95 variants are called using this command (on the trimmed recalibrated reads):

    Code:
    callvariants.sh in=recal.sam.gz ref=ref.fa rarity=0.01 out=foo3.txt vcf=foo3.vcf minreads=2 minscore=10 usebias=f minstrandratio=0
    But only 12 are above Q20 which is my normal cutoff (default is minscore=20, which I recommend), so I'd suggest perhaps looking at those. Maybe they're real. The insertion at 9225 looks plausible.
    Attached Files

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by jweger1988 View Post
    Brian,

    Thanks for the splitsam, that worked great!

    As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

    BBmap
    NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

    Bowtie
    NS500697:37:HH5V7BGXY:4:13502:9896:14279
    Yep, BBMap uses the full read name whereas Bowtie truncates the read name at the first whitespace. As Genomax noted, you can add the flag "trd" (trimreaddescriptions) to BBMap to produce truncated names, you you can reprocess the already-mapped sam file with Reformat using the "trd" flag to do that also.

    Thanks for the feedback on CallVariants! It's certainly odd that there seems to be nothing in common between the two results. Would it be possible for you to email me the bam and reference so I can examine it in more detail?
    Last edited by Brian Bushnell; 08-24-2017, 10:46 AM.

    Leave a comment:


  • jweger1988
    replied
    As for the comparison for lofreq and callvariants, I can never seem to get callvariants to call what look like good snps for my samples. Maybe I am using it wrong in some way.

    The sensitivity parameters are a little trick to play with in lofreq. But with callvariants and lofreq with standard settings at rarity=0.01, callvariants calls completely different snps as compared to lofreq. After manual inspection of the bam file it looks like lofreq is more accurate. Could definitely be some paremeter issues though. I'm open to any suggestions at changing these, as I'm a big fan of the bbmap suite.

    Here is a link to the vcf files and the script I'm running if you are interested.


    Access Google Drive with a Google account (for personal use) or Google Workspace account (for business use).


    Thanks as always.

    Leave a comment:


  • jweger1988
    replied
    Brian,

    Thanks for the splitsam, that worked great!

    As for the other question by the Ns I meant read name I think? Sorry my grasp of nomenclature with these things is not great.

    BBmap
    NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG

    Bowtie
    NS500697:37:HH5V7BGXY:4:13502:9896:14279

    Leave a comment:


  • Brian Bushnell
    replied
    After mapping, you can split the sam file like this:

    Code:
    splitsam.sh mapped.sam plus.sam minus.sam unmapped.sam
    reformat.sh in=plus.sam out=plus.fq
    reformat.sh in=minus.sam out=minus.fq rcomp
    cat plus.fq minus.fq > all.fq
    As for your previous question (sorry, somehow I did not notice it) - I think Genomax may be right that the issue is the read names, but it's hard to say. Also, it's hard to see what's going on here since none of the reads are mapped...

    Incidentally, what do you mean by this?

    I did notice that bbmap bams have the N and bowtie does not.
    Do you mean N in the read name, or in the cigar string, or in the read sequence?

    Also, I'd be interested to see how the output of callvariants.sh (with appropriate sensitivity parameters) compares to lofreq, if you want to give it a try.
    Last edited by Brian Bushnell; 08-23-2017, 03:06 PM.

    Leave a comment:


  • jweger1988
    replied
    Hi Brian,

    I'm trying to call some kmers with kmercountexact.sh and am running into some troubles.

    I am merging paired end reads with bbmerge then mapping to a small viral reference file with bbmap. I am then using reformat.sh to trim to a smaller read length and then calling/counting kmers with kmercountexact.

    The problem I am having is that after mapping I'm getting reads that are both in the forward and reverse sense. So when calling kmers I get them from both ends of the sequence I'm interested in. Is there any way to reverse complement just the reads that are in the reverse complement in relation to the reference sequence?

    Thanks in advance.

    Leave a comment:


  • jweger1988
    replied
    Originally posted by GenoMax View Post
    I can't think of anything now. Let us see if Brian has any other suggestions.

    lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.
    It seems to end up calling weird indels when viterbi is not run that are clearly not real. Thanks for the help. I'll wait for Brian and see what he thinks.

    Leave a comment:


  • GenoMax
    replied
    I can't think of anything now. Let us see if Brian has any other suggestions.

    lofreq viterbi seems to be for re-alignment. I wonder if you can do without that with BBMap generated files.

    Leave a comment:


  • jweger1988
    replied
    So I tried

    reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t sam=1.3

    and just

    reformat.sh in=1_bbmap.sam out=1_bbmap_reform1.bam trd=t

    I then tried running lofreq viterbi either with sorting or without and same error.

    Segmentation fault (core dumped)

    I did notice that bbmap bams have the N and bowtie does not.

    Any more thoughts?

    Thanks for the quick reply.

    Leave a comment:


  • GenoMax
    replied
    I wonder if viterbi program is having trouble with (NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG ) the tag sequence which BBMap preserves but Bowtie does not.

    Can you "reformat" the bbmap alignment using "trd=t" flag and see if that helps?

    Leave a comment:


  • jweger1988
    replied
    Hi Brian,

    I'm having an issue I'm wondering you can help with.

    I am using bbmap to align to a small viral genome. The resulting bam or sam file give me an error something like : Segmentation fault (core dumped) - every time I run the program lofreq viterbi, which is used before variant calling with lofreq. When I generate a bam with the same reference file and fastq files with bowtie2 I do not get the same error.

    I have tried using reformat.sh to convert to cigar 1.3 but this doesn't seem to work. The fasta file has only one sequence and there are no spaces in the name.

    Here is an example of the error with the command -
    Code:
    lofreq viterbi -f ZIKV_PRVABC59_CDS.fa -o bbmapv.bam bbmap.bam
    Segmentation fault (core dumped)
    It runs OK with bowtie produced bam.

    Here is the output from the command head with either bowtie or bbmap produced sam file.

    bbmap
    Code:
    @HD     VN:1.4  SO:unsorted
    @SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
    @PG     ID:BBMap        PN:BBMap        VN:37.47        CL:java -Djava.library.path=/home/weger/bb
    p/jni/ -ea -Xmx216859m align2.BBMap build=1 overwrite=true fastareadlen=500 in1=1_ctff_R1.fq.gz in
    1_ctff_R2.fq.gz out=1_bbmap.sam ref=ZIKV_PRVABC59_CDS.fa maq=25
    NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        77      *       0       0
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT   AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEAE/EEEA/AEEEE/EA
    NS500697:37:HH5V7BGXY:2:23303:13708:4313 1:N:0:TAAGGCGA+ATATAGAG        141     *       0       0
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA      AAAE<6<<<66<<<<<6<<6666666666666E66666666E66E666A6666A66AA
    E6EAA6EAAAEAEAEAEEAEEAEEEEEAE6EAEAAEAAAEEEAEAEAAEEEEAEAEEAEEEEEEEEAEEEEEEEE
    NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 77      *       0       0       *
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEEAEEE//EEEEE//E
    NS500697:37:HH5V7BGXY:2:21207:1798:4356 1:N:0:TAAGGCGA+ATAGAGAC 141     *       0       0       *
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 6666666666<<<<<<666666666666666666666666
    NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
    TTTTTTTTTTTTTTTTTTTTTT  AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEEEEAE<EE<A/EA//E
    NS500697:45:HJGYTAFXX:2:11101:23195:4485 1:N:0:TAAGGCGA+ATAGAGAG        141     *       0       0
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA A6A6<<<6<<E<<<<<6<<6<6<<<<<6666<<<<6666E66E6666666EE6666E66EEEE666
    6AAAAAAAEA6EAAEAAE6E6EAE6EAEEAEAAEAEAEEEAEEEAAEEAEEEEEEAEEEEAEAE
    NS500697:37:HH5V7BGXY:2:22101:16007:4646 1:N:0:TAAGGCGA+ATAGAGAG        77      *       0       0
    TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT      AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEE/AE/EEA//A
    bowtie2
    Code:
    @HD     VN:1.0  SO:unsorted
    @SQ     SN:ZIKV_PRVABC59_CDS    LN:10272
    @PG     ID:bowtie2      PN:bowtie2      VN:2.2.9        CL:"/apps/bowtie2-2.2.9/bowtie2-align-s --
    apper basic-0 -x /data/ebel_lab/Weger/indexed_fasta_bowtie/ZIKV__PRVABC59_CDS -S 1_bowtie.sam -1 /
    p/31806.inpipe1 -2 /tmp/31806.inpipe2"
    NS500697:37:HH5V7BGXY:4:13502:9896:14279        77      *       0       0       *       *       0
    AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE/EE        YT:Z:UP
    NS500697:37:HH5V7BGXY:4:13502:9896:14279        141     *       0       0       *       *       0
    CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
    CAGAAAGACTTGTTTTTTTTTTTTTAAT    AAAAAEEEEEEE/EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEAEEE/AEEEEEEEEEEEEEEEE/<E   YT:Z:UP
    NS500697:37:HH5V7BGXY:4:12605:4827:14803        77      *       0       0       *       *       0
    AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAACAA     AAAAAEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEEEEEEAEEAEEEEEEEEEEEEEEEEAEE        YT:Z:UP
    NS500697:37:HH5V7BGXY:4:12605:4827:14803        141     *       0       0       *       *       0
    CAGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTAT
    CAGAAAGACTTGTTTTTTTTTTTTTAATT   AAAAAEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEA<AEEEEEEEEEEEEEEEEEEEEEEE<<<E  YT:Z:UP
    NS500697:37:HH5V7BGXY:4:22607:4886:15556        77      *       0       0       *       *       0
    AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAA     AAAAAEEEEEEEEEAEEAEEEEEEEE
    E/EEEAE/AAEEEEEEAAEEEEEAA6EEE6EEEEEEEE/E        YT:Z:UP
    NS500697:37:HH5V7BGXY:4:22607:4886:15556        141     *       0       0       *       *       0
    AGAGACATACCCCTCCTTACAAATAAAAATGTTCTTTATAGATGTAAATTTATTTTACAAAAATGTTTCAAAATGACCAGATGAAAATCATCCTTATG
    AGAAAGACTTGTTTTTTTTTTTTTTAATTA  AAAAEEEEEEA/EAEEEEEEAEEEEEEEEEEEEEEEEEEEEAEEEEEEEEEEEEEEEEEEEEEEEE
    EEEEEEEEEEAEEAEEEEEEEAEEEEEEAEAAE/AEAAEEEEE6E/EEEAEEAEEEEA//<EA YT:Z:UP
    NS500697:37:HH5V7BGXY:4:12507:9984:15827        77      *       0       0       *       *       0
    AGCCTAAGTATGTCAATACAACAAATACTTACTGTTTCATTTCTAGTAATGAAAAAAAAAAAAAAAAGTCTTTAT     AAAAAEAEEEEEEEEEEE
    EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE6E6EEE/EEEEE/EE/E6EAE/E        YT:Z:UP
    Any ideas you might have would be greatly appreciated.

    Thanks in advance.
    Last edited by GenoMax; 08-18-2017, 03:23 AM. Reason: Added [CODE] tags

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM
  • seqadmin
    Strategies for Sequencing Challenging Samples
    by seqadmin


    Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
    03-22-2024, 06:39 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
31 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
32 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
28 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Working...
X