Header Leaderboard Ad

Collapse

bbmap inflating read count and not finding one sequence after header

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • bbmap inflating read count and not finding one sequence after header

    I'm using bbmap to map transcriptome reads to a set of target loci. I'm working with 12 samples with pair-end reads and 1 sample with single-end reads, all from NCBI's SRA. I'm having no problems with bbmap reading paired-end data and completing analyses correctly. It's the one sample with single-end reads that's causing two issues:

    1. The first issue is that the input fasta file only has 288915 reads. I have confirmed this with grep ">" file.fasta | wc -l. However, bbmap reports "Reads used: 308655". I have no idea why the read count is inflated; again, this is not an issue with the paired-end data.

    2. bbmap fails to recognize a sequence immediately after the fasta header: "Warning: A fasta header with no sequence was encountered: SRR768524.9631" The sequence in question is formatted exactly like all others in the file and I have checked the EOL, which is fine. Below is what a snippet of the fasta file looks like, with the 3rd sequence being the problematic one for bbmap.

    >SRR768524.9629
    CTATCAAAGGGAAATCCCGCTGGCCTGCTATCATACAGTCTTGAACCTCCACATGCAATATCAGCTGAATCTCCAACGTGGCCGCTGACAAATGGAGTAACTACGACTGCCAAAACGAAAGCGCGACCTCCTTTCCATCCCATGGGTAATTGGAGTCTTTGAGGAAATCCACATCGGCCAGCCTCTGAATAATGGAATGGTTCCTTCTGGTTGAGTGCACTGTTTATTGAAGTGTAAAGAGACCTGAATCCTTCTTGGTCATGGATAAAGAGGGGAGAATCTGTTGAACTTCTTTCCCACACATTAACACCCTCAGTCAATTTTACTGGGAAACGGTCAATTTCAAATAAGCTAGTTTTGCTTGGTCATAAGTGAGGAAATGTTCATCAGAATCATATTTCGGTCCAATGAATACTCTCACAATGGCATCATCAGCTTTT
    >SRR768524.9630
    ATAATGCAATTATAGATTGTTGGAGTGCAGGTAAAGCTACCACTGTTATGATTAAAGATAATCCAAAGGTTGAAATTCTTGATGTAGAAGATGTTAAGGTTGGAAAGATAAGACAATTTTGTGAGTTGGACTTGGCATTGAACATGGCCTTACGAAAGTATTTTGGTAGTGTGTTTGATAAAATGGCAGTTACATCTAATGAAACGCCGTGGAAAGTTGCTTGGAATCCATATTTTATGCCTCATCACATCGTGGCGATAGAGAACGACAAGTACGATGTCTTTTGTATAGATGTGAAAAGAATGGATAAGAATTTACCAGTCCAATTCACTGAGATATTGTGT
    >SRR768524.9631
    TGTTACTGGGTAGGGCTGTGGCACTGGGACCTTGACTGGATAGGGTACATGCTTCTCGACTGGTACTGGGTATGGCTTGGGAATGTGGACAGGGTATGGCCTATCGACTGGGACCTTCACGGGATAGGGAACTTTCTTCTCGATATGGACTGGATAAGGTACTGGTACCTCCTTGTGGATGGTGATAGCTTTAATGTGGCCGTGTTCCTCATGTCCTCCGAGTTCATAGCCACCTCCATATCCGTATCCTCCTAAGTCATGTCCACCTCCATATCCTCCATGTCCACCGCCGTATCCTCCAAGCTCATAACCACCTCCATATCCTAAGCCAAGAAGTCCTCGTTTCTCCTGCTTCTTGTCATCTGTTGGTGCCGCTGCTTTGTCGGTCTTGGATTCGGCCTTCTTCTCTTCAGCTGATGCTGTGGCAAGCAGTGCCAACAGCCCTACCCACAGTACCTTGGATTGCATTGTTGAGTCGTGGTGTGGTCGGCGTCTCCCAA
    >SRR768524.9632
    AATTCCCAACGACCAAGTATCTGAACATGAGTGGATCAATGCTGAGATCCTGCCTGCTACTGCTATTCCTAGCTTACTGTGTGTCCTGCTATAGAAGTCATGTTCCTAGAGGCGGGAGTTACTCTCTACCGCCTGGAGTTAATCCAACATTCCCAGGAAGGAACCAAGGACTGCCTCCGGCTTATCATGGAAAATTCAAGAGATCACTGGAAGGAGGTTTAGAACCTGAAGATGGTGGTGTCCTTGCAGTTGATGAACCTGCTGATTATCTGAAAGTCAAAAGGTCAGTGGAAGATGTTGAAGGTGAATTCCTTGTGAACGAAGAACCTCAAGAATTTGAGACACTGAGAGCGCGCCGTGACGTCAGAATAATTCATCCAACT
    >SRR768524.9633
    TTCCAAACTGTCGATTCATGATGTACACAATACCAAAAAAGGCAAATAAGAAATAAAAGT
    I'm at a loss for figuring out how to resolve these issues. I appreciate any help in getting this to run properly on this last fasta file.

  • #2
    Is there a specific reason you converted these reads to fasta format? If this is data from SRA then you should be able to map the fastq reads directly.

    You may be getting secondary alignments and that may be the reason why your read count seems inflated.

    Comment

    Latest Articles

    Collapse

    • seqadmin
      How RNA-Seq is Transforming Cancer Studies
      by seqadmin



      Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
      09-07-2023, 11:15 PM
    • seqadmin
      Methods for Investigating the Transcriptome
      by seqadmin




      Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

      Whole Transcriptome RNA-seq
      Whole transcriptome sequencing...
      08-31-2023, 11:07 AM

    ad_right_rmr

    Collapse

    News

    Collapse

    Topics Statistics Last Post
    Started by seqadmin, Yesterday, 07:42 AM
    0 responses
    10 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 09-22-2023, 09:05 AM
    0 responses
    23 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 09-21-2023, 06:18 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Started by seqadmin, 09-20-2023, 09:17 AM
    0 responses
    16 views
    0 likes
    Last Post seqadmin  
    Working...
    X