Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Human whole-genome sequencing data analysis with low mapping rate

    Hi, everyone
    I got four samples' Human WGS data few days before to identify the variants as well as CNVs. After QC and mapping steps of my analysis workflow, I found each sample's mapping rate is in a vary low level which listed as follows:

    samples mapping rate

    sample1_H04C3ALXX_L4 57.62%
    sample1_H04C3ALXX_L5 8.67%
    sample1_H04C3ALXX_L6 13.68%
    sample1_H04C3ALXX_L7 26.78%
    sample1_H04C3ALXX_L8 28.19%

    sample2_H04C3ALXX_L1 2.49%
    sample2_H04C3ALXX_L2 2.17%
    sample2_H04C3ALXX_L3 31.80%
    sample2_H04C3ALXX_L4 32.57%
    sample2_H04C3ALXX_L5 31.81%
    sample2_H04C3ALXX_L6 31.63%
    sample2_H04C3ALXX_L7 31.87%
    sample2_H04C3ALXX_L8 31.81%

    sample3_H04B1ALXX_L3 4.36%
    sample3_H04B1ALXX_L4 59.36%
    sample3_H04B1ALXX_L5 2.49%
    sample3_H04B1ALXX_L6 3.21%

    sample4_H04C3ALXX_L5 27.06%
    sample4_H04C3ALXX_L6 26.67%
    sample4_H04C3ALXX_L7 27.52%
    sample4_H04C3ALXX_L8 27.79%
    sample4_H04C3ALXX_L1 14.82%
    sample4_H04C3ALXX_L2 13.96%
    sample4_H04C3ALXX_L3 24.75%
    sample4_H04C3ALXX_L4 24.75%

    The mapping software was BWA with its version 0.7.10-r789

    To figure out why so little rate was generated, I randomly picked 1000 unmaped reads and performed a blast analysis against nt library. Each read output a best hit result, and most aligned sequences are human clone fragments like:

    Human DNA sequence from clone RP3-376K6, complete sequence
    Homo sapiens Chromosome 16 BAC clone CIT987SK-A-926E7, complete sequence
    Homo sapiens chromosome 18, clone RP11-529J17, complete sequence
    Homo sapiens chromosome 18, clone CTD-2504O24, complete sequence

    So my question is :
    what are these sequences?(cds or genome seq?)
    Are my samples contaminated?

    what causes the extreme low mapping rate from sample
    sample2_H04C3ALXX_L1 2.49%
    sample2_H04C3ALXX_L2 2.17%
    sample3_H04B1ALXX_L5 2.49%
    sample3_H04B1ALXX_L6 3.21%
    , samples or software?

    Any comment will be greatly appreciated, thank you very much!
    Last edited by zinky; 11-05-2014, 05:43 AM.

  • #2
    It would help if you run FastQC and post the output, as well as your QC steps, and mapping command line. As it stands, the reason could be anything.


    • #3
      I use NGS QC Toolkit to do QC, and the result shows that more than 80% of reads are high quality filtered reads. So I do the mapping step. My mapping commond lines are:
      bwa aln -t 5 genome.fa file_1.fastq > file_1.fastq.sai
      bwa aln -t 5 genome.fa file_2.fastq > file_2.fastq.sai
      bwa sampe -A -a 600 -r '@RG\tID:noID\tPL:ILLUMINA\tLB:noLB\tSM:"file"' genome file_1.fastq.sai file_2.fastq.sai file_1.fastq file_2.fastq > file.sam


      • #4
        You may have short inserts and thus high adapter contamination. You can get an insert size distribution with BBMerge, like this: in1=file_1.fastq in2=file_2.fastq ihist=ihist.txt

        If a lot of reads have insert sizes shorter than read length, that will indicate adapter contamination which needs to be removed (e.g. with BBDuk).

        Also, I don't recommend bwa aln, particularly in recent versions of bwa. You will achieve higher speed and accuracy with bwa mem or BBMap, which can also generate some useful diagnostic plots (such as mhist).

        But I still recommend you post FastQC results.


        • #5
          thanks for your suggestion,I have asked the sequence stuff and got insert size information : 350bp .so my parameter -a was set 600 to tolerate extra larger insert size aiming improve mapping rate. before that,i used fastQc to estimate reads quality either. the qc report was good,which suggested no index contamination(green kmer distribution and green overrepresent sequence)and high sequencing quality.
          ps:i don't know why mypictures can not be uploaded here.

          so i doubt whether the sample was mixed with none human-soured DNA as i metioned above(actually,i don't what they are).
          Also, i will try the tools you suggested,thanks Brain .


          Latest Articles


          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin

            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM
          • seqadmin
            Strategies for Sequencing Challenging Samples
            by seqadmin

            Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
            03-22-2024, 06:39 AM





          Topics Statistics Last Post
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 09:21 AM
          0 responses
          Last Post seqadmin  
          Started by seqadmin, 04-04-2024, 09:00 AM
          0 responses
          Last Post seqadmin