No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • BBmap generation and mapping of artificial paired-end fastq

    Hello everyone.
    I have discovered BBmap and have just started using it. I want to run a benchmark on some variant callers and I was thinking of making a set of paired-end Illumina fastq files where I know exactly the sample composition per position (e.g. position 100, 80% A (REF), 20% T (MUT)) and then I can run the file through the pipelines and assess how they perform.

    So, I generate my set of files with:
    ./ ref=reference.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t mininsert=1 maxinsert=12 snprate=0.1 insrate=0.1 delrate=0.1 minq=12 -Xmx10g
    The command works just fine and I get a set of fastq files. I can see (and also read somewhere in the documentation) that the read headers actually contain the info of the changes compared to the reference. So, I can write a small python script to parse them and gather that info, but my first question was if there is any tool in the software suite that can do this for me. I thought that maybe could do this, but the bhist option of that seems to output statistics based on read position for the R1 and R2 files.

    Then my other question is about I tried to map the generated fastq files to my reference, using the command:

    / ref=reference.fa local=t in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
    but got a very low average coverage per position (225, but expected ~10K based on input) and the mapping percentage is very low at 2.25%.
    The variant rates are very high,
    HTML Code:
    Read 1 data:      	pct reads	num reads 	pct bases	   num bases
    mapped:          	  2.2502% 	    22429 	  2.2502% 	     3364350
    unambiguous:     	  2.2502% 	    22429 	  2.2502% 	     3364350
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.0000% 	        0 	  0.0000% 	           0
    perfect best site:	  0.2408% 	     2400 	  0.2408% 	      360000
    semiperfect site:	  0.2408% 	     2400 	  0.2408% 	      360000
    rescued:         	  0.4520% 	     4505
    Match Rate:      	      NA 	       NA 	 95.1746% 	     3202707
    Error Rate:      	 86.5130% 	    19404 	  4.1666% 	      140208
    Sub Rate:        	 60.4441% 	    13557 	  0.8741% 	       29415
    Del Rate:        	  0.1338% 	       30 	  0.0219% 	         736
    Ins Rate:        	 65.3707% 	    14662 	  3.2706% 	      110057
    N Rate:          	 14.2583% 	     3198 	  0.6589% 	       22171
    Read 2 data:      	pct reads	num reads 	pct bases	   num bases
    mapped:          	  2.2421% 	    22348 	  2.2421% 	     3352200
    unambiguous:     	  2.2421% 	    22348 	  2.2421% 	     3352200
    ambiguous:       	  0.0000% 	        0 	  0.0000% 	           0
    low-Q discards:  	  0.2039% 	     2032 	  0.2039% 	      304800
    perfect best site:	  0.4693% 	     4678 	  0.4693% 	      701700
    semiperfect site:	  0.4693% 	     4678 	  0.4693% 	      701700
    rescued:         	  0.3604% 	     3592
    Match Rate:      	      NA 	       NA 	 96.8958% 	     3248505
    Error Rate:      	 76.8301% 	    17170 	  2.7130% 	       90955
    Sub Rate:        	 60.3454% 	    13486 	  0.8803% 	       29513
    Del Rate:        	  0.1253% 	       28 	  0.0112% 	         376
    Ins Rate:        	 39.6680% 	     8865 	  1.8215% 	       61066
    N Rate:          	 10.8198% 	     2418 	  0.3912% 	       13116
    but that seems contradicting based on my input. So, my second question is what am I missing here? I was aiming for a file that has a few events (substitutions, indels), but the end result seems quite extreme.

    I also ran a second test where I lowered the rates drastically
    snprate=0.1 insrate=0.01 delrate=0.01
    , but that yielded an even poorer mapping percentage of 0.2%.

    I appreciate any pointers, thanks for reading!

  • #2
    It may be best to use "" (to look at in-line help) to introduce the mutations after you generate the reads with You will get a VCF files of changes.

    I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
    Last edited by GenoMax; 04-30-2020, 05:49 AM.


    • #3
      Originally posted by GenoMax View Post
      It may be best to use "" (to look at in-line help) to introduce the mutations after you generate the reads with You will get a VCF files of changes.

      I think using "local=t" is causing the alignment issues. It is meant for local alignments when you expect errors at end of reads.
      Thank you for your ultra-fast reply GenoMax! I have a question on your comment. Did you mean to say to use "" before generating the reads? I actually tried that part:
      1) First mutate my original sequence:
      ./ in=reference.fa out=reference_Mut.fa vcf=Variants.vcf subrate=0.01 indelrate=0.005 maxindel=3 overwrite=t
      The vcf file indeed looks super!
      2) Then generate reads based on the new sequence, without any additional flags (only min quality):
      ./ ref=reference_Mut.fa out1=Sample1_R1.fastq out2=Sample1_R2.fastq coverage=10000 paired=t minq=12 -Xmx10g
      3) Map the reads to the mutated reference (exclude "local" option):
      ./ ref=reference_Mut.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
      This indeed yields a 99.905% mapping, which makes sense based on some quality filter I assume.
      4) Map the reads to the original reference.
      ./ ref=reference.fa in1=Sample1_R1.fastq in2=Sample1_R2.fastq covstats=Covstats.cov
      This yielded a 99.875% mapping which also makes sense.

      My follow-up question then is can I safely assume that the fastq files I'm generating, using the method above, contain the variants at a rate of 100%?

      If that assumption is correct, what would be the best way to regulate the variation rate? I'm thinking along the lines of:
      1) Generate perfect reads from un-mutated reference.
      2) Generate perfect reads from mutated sequence.
      3) Mix the file sets in various percentages to get the desired effect and make a new set of files.
      Is that correct logic or is there a way to already do that?

      Thank you in advance, I really appreciate it!


      • #4
        Perhaps I am missing a subtle point but since you can control mutation type/rates with would it not be better to go with just that data (#2 in list above)? If you mix reads from mutated and un-mutated reference you can't be sure that you will maintain the fraction of mutations at the same level in new pool?


        • #5
          Hi GenoMax! Yes, you are right, the fraction of mutations may change in the new pool, but - based on the mixing percentages (in my scenario) - I'll be able to control the level of mutations per position. So, say on position 100, go from 100% MUT to 80% MUT if I mix my original reference fastq and mutated reference fastq using 80/20 percent of reads respectively.


          • #6
            You could try it out and see if it works (I have a hunch it won't be perfect).

            If that does not work acceptably then I suggest you run multiple times with new values and create new datasets.


            • #7
              Thank you, I'll try it and report back!


              Latest Articles


              • seqadmin
                Advanced Tools Transforming the Field of Cytogenomics
                by seqadmin

                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                09-26-2023, 06:26 AM
              • seqadmin
                How RNA-Seq is Transforming Cancer Studies
                by seqadmin

                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                09-07-2023, 11:15 PM
              • seqadmin
                Methods for Investigating the Transcriptome
                by seqadmin

                Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

                Whole Transcriptome RNA-seq
                Whole transcriptome sequencing...
                08-31-2023, 11:07 AM





              Topics Statistics Last Post
              Started by seqadmin, Yesterday, 09:38 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-27-2023, 06:57 AM
              0 responses
              Last Post seqadmin  
              Started by seqadmin, 09-26-2023, 07:53 AM
              1 response
              Last Post seed_phrase_metal_storage  
              Started by seqadmin, 09-25-2023, 07:42 AM
              0 responses
              Last Post seqadmin