Header Leaderboard Ad

Collapse

Selecting the best alignment BAM file

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Selecting the best alignment BAM file

    Hi,

    I have a PE dataset 300bp inserts by illumina MiSeq. I aligned the raw data using BWA-mem. Mapping statistics generated using Samtools flagstat are below.

    5541008 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    76008 + 0 supplementary
    0 + 0 duplicates
    5413610 + 0 mapped (97.70% : N/A)
    5465000 + 0 paired in sequencing
    2732500 + 0 read1
    2732500 + 0 read2
    5266140 + 0 properly paired (96.36% : N/A)
    5319406 + 0 with itself and mate mapped
    18196 + 0 singletons (0.33% : N/A)
    32368 + 0 with mate mapped to a different chr
    8821 + 0 with mate mapped to a different chr (mapQ>=5)

    I also used Trimmomatic on the same dataset, ILLUMINACLIP to remove any adapter sequences, trimmed reads sliding window 4:10, leading & trailing bases <3, length <39bp. Aligned this set using BWA-mem and got the results as below.

    5529752 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 secondary
    65642 + 0 supplementary
    0 + 0 duplicates
    5396698 + 0 mapped (97.59% : N/A)
    5464110 + 0 paired in sequencing
    2732055 + 0 read1
    2732055 + 0 read2
    5263982 + 0 properly paired (96.34% : N/A)
    5308488 + 0 with itself and mate mapped
    22568 + 0 singletons (0.41% : N/A)
    23856 + 0 with mate mapped to a different chr
    4865 + 0 with mate mapped to a different chr (mapQ>=5)

    1) Can I use this information to select a best alignment based on mapped %. Raw data gave 97.7% mapping which is higher than trimmed data. So can I select BAM I got from raw data as the best?

    2) I used "samtools view -c -f 3 data.bam" to find the properly paired reads. But the value I got is different to the value for that parameter by flagstat for both datasets. I checked some other parameters like itself & mate mapped they too gave different results. What could be the reason.

    Appreciate your answers.
    Thanks in advance.

    Regds
    Rangika

  • #2
    Hi Rangika,

    2nd first:
    You need to be aware of the fact that samtools flagstat produces statistics on alignments. Meaning, a read can align multiple time and will occur multiple times in the flagstat output. You may check your alignment file with e.g. bam_stat.py from the RSeQC tools.
    Furthermore, I'd check the read files with FastQC before and after trimming.

    So:
    1) I'd check a set of different data sets to choose which way to go. Also, I would not rely on the %mapped from samtools flagstat.

    Cheers,
    Michael

    Comment


    • #3
      Thank you Michael. My dataset is DNA-seq. Can I use RSeQC tools to check alignment for DNA data as well. Do you suggest RSeQC statistics would lead in to better BAM selection?

      Appreciate if you would clarify this a bit more since I'm new to this.

      Regards
      Rangika

      Comment


      • #4
        The bam_stat.py was a suggestion since it also works for DNA-seq alignments (you'll hopefully don't see spliced reads).
        You can also have a look at the QC-metrics from Picard tools, or have a look at GATK.
        Or you can extract the aligned reads (samtools view) and count e.g. how often each read is aligned. Without trimming you might have a high %mapping rate given by samtools flagstat, but you don't know how many reads were aligned with a high confidence to a single or few positions.

        Most of the library preps have also a small section of how to deal with the analysis. Additionally, there are a plethora of publications describing their approach to DNA-Seq analysis.

        Cheers,
        Michael

        Comment


        • #5
          Thank you Michael for your answer.

          Regards
          Sumudu

          Comment

          Latest Articles

          Collapse

          • seqadmin
            How RNA-Seq is Transforming Cancer Studies
            by seqadmin



            Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
            09-07-2023, 11:15 PM
          • seqadmin
            Methods for Investigating the Transcriptome
            by seqadmin




            Ribonucleic acid (RNA) represents a range of diverse molecules that play a crucial role in many cellular processes. From serving as a protein template to regulating genes, the complex processes involving RNA make it a focal point of study for many scientists. This article will spotlight various methods scientists have developed to investigate different RNA subtypes and the broader transcriptome.

            Whole Transcriptome RNA-seq
            Whole transcriptome sequencing...
            08-31-2023, 11:07 AM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, Yesterday, 06:18 AM
          0 responses
          5 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-20-2023, 09:17 AM
          0 responses
          8 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-19-2023, 09:23 AM
          0 responses
          25 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 09-19-2023, 09:14 AM
          0 responses
          7 views
          0 likes
          Last Post seqadmin  
          Working...
          X