Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • bbimber
    Member
    • Jan 2010
    • 12

    Questions about maniupating Sam format

    We are building an analysis pipeline for 454 sequence based genotyping. We compare each 454 read against a reference cDNA library and want to ask which sequences the read aligns against, and how many mismatches occur for each hit.

    I have currently approached this by aligning the query reads against a ref library various aligners including bowtie and blat. The result is converted to SAM using perl, unless the aligner already outputs SAM.

    From this SAM file, I am interested in extracting which reference sequences each read aligned against, and the SNPs associated with these alignments.

    I have a few questions about manipulating SAM data:

    1. From an alignment in SAM format, is there an established approach to annotate each alignment with the number of mismatches, their positions and/or identities (ie. G354C)? The normal CIGAR string does not distinguish match/mismatch although the doc says this is being considered.

    2. Pileup does some of SNP detection discussed in point 1, but I did not see any information in the pileup report about which specifies query reads are associated with each SNP. If this information existed, it would be simple to use the result of pileup to annotate the SAM file. Can the read name(s) associated with each SNP be obtained from pileup or a similar tool?

    3. This far, I have been aligning the experimental reads against a ref library of cDNA sequences. Tools like pileup will output reports in which each reference sequence is a 'chromosome' and list the reads that align to it. For my purposes, I am more interested in the reverse of this: which ref sequences align against each experimental read. Is there an established approach to transpose the ref/query sequences in a SAM file? It doesnt seem like it would be very hard in perl, but i thought i'd check to see if something already exists. Theoretically when I perform the alignment I could use my experimental data as the reference and align my cDNA library against it.

    Thank you for any help.
  • nilshomer
    Nils Homer
    • Nov 2008
    • 1283

    #2
    Originally posted by bbimber View Post
    We are building an analysis pipeline for 454 sequence based genotyping. We compare each 454 read against a reference cDNA library and want to ask which sequences the read aligns against, and how many mismatches occur for each hit.

    I have currently approached this by aligning the query reads against a ref library various aligners including bowtie and blat. The result is converted to SAM using perl, unless the aligner already outputs SAM.

    From this SAM file, I am interested in extracting which reference sequences each read aligned against, and the SNPs associated with these alignments.

    I have a few questions about manipulating SAM data:

    1. From an alignment in SAM format, is there an established approach to annotate each alignment with the number of mismatches, their positions and/or identities (ie. G354C)? The normal CIGAR string does not distinguish match/mismatch although the doc says this is being considered.

    2. Pileup does some of SNP detection discussed in point 1, but I did not see any information in the pileup report about which specifies query reads are associated with each SNP. If this information existed, it would be simple to use the result of pileup to annotate the SAM file. Can the read name(s) associated with each SNP be obtained from pileup or a similar tool?

    3. This far, I have been aligning the experimental reads against a ref library of cDNA sequences. Tools like pileup will output reports in which each reference sequence is a 'chromosome' and list the reads that align to it. For my purposes, I am more interested in the reverse of this: which ref sequences align against each experimental read. Is there an established approach to transpose the ref/query sequences in a SAM file? It doesnt seem like it would be very hard in perl, but i thought i'd check to see if something already exists. Theoretically when I perform the alignment I could use my experimental data as the reference and align my cDNA library against it.

    Thank you for any help.
    1.
    The MD tag, if specified, can help you identify the mismatches without using the reference.

    2. You can annotate SAM entries that span or include the SNP,.. The pileup does encode all the reads (see the last few columns), but they cannot be exactly identified by read name.

    3. Depending on what tool you used, there will only be one mapping per read. Otherwise, others may be able to suggest tools.

    Comment

    • bbimber
      Member
      • Jan 2010
      • 12

      #3
      thanks for the reply. one more question:

      an example pileup output, might look like:

      seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

      If I wanted to use this data to annotate a SAM file, I assume I simply find any lines where seq1 is the ref that span 272. What is the order of the characters ,.$.....,,.,.,...,,,.,..^+. relative to the lines of the SAM file? Will they match the order of whatever file was used to generate the pileup report (ie. the second character refers to the second line in the SAM where the ref sequence is seq2)? if true, it seems easy enough to find the associated read name and annotate them.

      are there pre-existing tools or scripts that work to work backwards from a pileup output? is it common to work backwards from pileup output or does the pileup usually contain all the information people need?

      Comment

      • nilshomer
        Nils Homer
        • Nov 2008
        • 1283

        #4
        Originally posted by bbimber View Post
        thanks for the reply. one more question:

        an example pileup output, might look like:

        seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&

        If I wanted to use this data to annotate a SAM file, I assume I simply find any lines where seq1 is the ref that span 272. What is the order of the characters ,.$.....,,.,.,...,,,.,..^+. relative to the lines of the SAM file? Will they match the order of whatever file was used to generate the pileup report (ie. the second character refers to the second line in the SAM where the ref sequence is seq2)? if true, it seems easy enough to find the associated read name and annotate them.

        are there pre-existing tools or scripts that work to work backwards from a pileup output? is it common to work backwards from pileup output or does the pileup usually contain all the information people need?
        I don't know the order, you will have to consult the source code. Nonetheless, if two reads are aligned identically, you wont be able to identify them. It is not common to work backwards from the pileup alone, though going back into the SAM file is often necessary.

        Comment

        Latest Articles

        Collapse

        • SEQadmin2
          From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
          by SEQadmin2


          Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


          The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
          ...
          Yesterday, 10:05 AM
        • SEQadmin2
          Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
          by SEQadmin2


          With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


          Introduction

          Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
          05-22-2026, 06:42 AM
        • SEQadmin2
          Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
          by SEQadmin2

          Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


          Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
          05-06-2026, 09:04 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by SEQadmin2, Yesterday, 12:03 PM
        0 responses
        19 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, Yesterday, 11:40 AM
        0 responses
        14 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-28-2026, 11:40 AM
        0 responses
        29 views
        0 reactions
        Last Post SEQadmin2  
        Started by SEQadmin2, 05-26-2026, 10:12 AM
        0 responses
        31 views
        0 reactions
        Last Post SEQadmin2  
        Working...