Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap generation and mapping of artificial paired-end fastq

    Hello everyone.
    I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.

    So, I generate my set of files with:
    Code:
    ./randomreads.sh ref=reference.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t mininsert=1 maxinsert=12 snprate=0.1 insrate=0.1 delrate=0.1 minq=12 -Xmx10g
    The command works just fine and I get a set of fastq files. I can see (and also read somewhere in the documentation) that the read headers actually contain the info of the changes compared to the reference. So, I can write a small python script to parse them and gather that info, but my first question was if there is any tool in the software suite that can do this for me. I thought that maybe reformat.sh could do this, but the bhist option of that seems to output statistics based on read position for the R1 and R2 files.

    Then my other question is about BBmap.sh. I tried to map the generated fastq files to my reference, using the command:

    Code:
    /bbmap.sh ref=reference.fa local=t in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
    but got a very low average coverage per position (225, but expected ~10K based on input) and the mapping percentage is very low at 2.25%.
    The variant rates are very high,
    HTML Code:
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	  2.2502% 	    22429 	  2.2502% 	     3364350
    unambiguous:     	  2.2502% 	    22429 	  2.2502% 	     3364350
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    
    perfect best site:	  0.2408% 	     2400 	  0.2408% 	      360000
    semiperfect site:	  0.2408% 	     2400 	  0.2408% 	      360000
    rescued:         	  0.4520% 	     4505
    
    Match Rate:      	      NA 	       NA 	 95.1746% 	     3202707
    Error Rate:      	 86.5130% 	    19404 	  4.1666% 	      140208
    Sub Rate:        	 60.4441% 	    13557 	  0.8741% 	       29415
    Del Rate:        	  0.1338% 	       30 	  0.0219% 	         736
    Ins Rate:        	 65.3707% 	    14662 	  3.2706% 	      110057
    N Rate:          	 14.2583% 	     3198 	  0.6589% 	       22171
    
    
    Read 2 data:      	pct reads	num reads 	pct bases	   num bases
    
    mapped:          	  2.2421% 	    22348 	  2.2421% 	     3352200
    unambiguous:     	  2.2421% 	    22348 	  2.2421% 	     3352200
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.2039% 	     2032 	  0.2039% 	      304800
    
    perfect best site:	  0.4693% 	     4678 	  0.4693% 	      701700
    semiperfect site:	  0.4693% 	     4678 	  0.4693% 	      701700
    rescued:         	  0.3604% 	     3592
    
    Match Rate:      	      NA 	       NA 	 96.8958% 	     3248505
    Error Rate:      	 76.8301% 	    17170 	  2.7130% 	       90955
    Sub Rate:        	 60.3454% 	    13486 	  0.8803% 	       29513
    Del Rate:        	  0.1253% 	       28 	  0.0112% 	         376
    Ins Rate:        	 39.6680% 	     8865 	  1.8215% 	       61066
    N Rate:          	 10.8198% 	     2418 	  0.3912% 	       13116
    but that seems contradicting based on my input. So, my second question is what am I missing here? I was aiming for a file that has a few events (substitutions, indels), but the end result seems quite extreme.

    I also ran a second test where I lowered the rates drastically
    Code:
    snprate=0.1 insrate=0.01 delrate=0.01
    , but that yielded an even poorer mapping percentage of 0.2%.

    I appreciate any pointers, thanks for reading!

  • #2
    It may be best to use "mutate.sh" (to look at in-line help) to introduce the mutations after you generate the reads with randomreads.sh. You will get a VCF files of changes.

    I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
    Last edited by GenoMax; 04-30-2020, 05:49 AM.

    Comment


    • #3
      Originally posted by GenoMax View Post
      It may be best to use "mutate.sh" (to look at in-line help) to introduce the mutations after you generate the reads with randomreads.sh. You will get a VCF files of changes.

      I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
      Thank you for your ultra-fast reply GenoMax! I have a question on your comment. Did you mean to say to use "mutate.sh" before generating the reads? I actually tried that part:
      1) First mutate my original sequence:
      Code:
      ./mutate.sh in=reference.fa out=reference_Mut.fa vcf=Variants.vcf subrate=0.01 indelrate=0.005 maxindel=3 overwrite=t
      The vcf file indeed looks super!
      2) Then generate reads based on the new sequence, without any additional flags (only min quality):
      Code:
      ./randomreads.sh ref=reference_Mut.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t minq=12 -Xmx10g
      3) Map the reads to the mutated reference (exclude "local" option):
      Code:
      ./bbmap.sh ref=reference_Mut.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
      This indeed yields a 99.905% mapping, which makes sense based on some quality filter I assume.
      4) Map the reads to the original reference.
      Code:
      ./bbmap.sh ref=reference.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
      This yielded a 99.875% mapping which also makes sense.

      My follow-up question then is can I safely assume that the fastq files I'm generating, using the method above, contain the variants at a rate of 100%?

      If that assumption is correct, what would be the best way to regulate the variation rate? I'm thinking along the lines of:
      1) Generate perfect reads from un-mutated reference.
      2) Generate perfect reads from mutated sequence.
      3) Mix the file sets in various percentages to get the desired effect and make a new set of files.
      Is that correct logic or is there a way to already do that?

      Thank you in advance, I really appreciate it!

      Comment


      • #4
        Perhaps I am missing a subtle point but since you can control mutation type/rates with mutate.sh would it not be better to go with just that data (#2 in list above)? If you mix reads from mutated and un-mutated reference you can't be sure that you will maintain the fraction of mutations at the same level in new pool?

        Comment


        • #5
          Hi GenoMax! Yes, you are right, the fraction of mutations may change in the new pool, but - based on the mixing percentages (in my scenario) - I'll be able to control the level of mutations per position. So, say on position 100, go from 100% MUT to 80% MUT if I mix my original reference fastq and mutated reference fastq using 80/20 percent of reads respectively.

          Comment


          • #6
            You could try it out and see if it works (I have a hunch it won't be perfect).

            If that does not work acceptably then I suggest you run mutate.sh multiple times with new values and create new datasets.

            Comment


            • #7
              Thank you, I'll try it and report back!

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 06:37 PM
              0 responses
              10 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, Yesterday, 06:07 PM
              0 responses
              9 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              50 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              67 views
              0 likes
              Last Post seqadmin  
              Working...
              X