We are building an analysis pipeline for 454 sequence based genotyping. We compare each 454 read against a reference cDNA library and want to ask which sequences the read aligns against, and how many mismatches occur for each hit.
I have currently approached this by aligning the query reads against a ref library various aligners including bowtie and blat. The result is converted to SAM using perl, unless the aligner already outputs SAM.
From this SAM file, I am interested in extracting which reference sequences each read aligned against, and the SNPs associated with these alignments.
I have a few questions about manipulating SAM data:
1. From an alignment in SAM format, is there an established approach to annotate each alignment with the number of mismatches, their positions and/or identities (ie. G354C)? The normal CIGAR string does not distinguish match/mismatch although the doc says this is being considered.
2. Pileup does some of SNP detection discussed in point 1, but I did not see any information in the pileup report about which specifies query reads are associated with each SNP. If this information existed, it would be simple to use the result of pileup to annotate the SAM file. Can the read name(s) associated with each SNP be obtained from pileup or a similar tool?
3. This far, I have been aligning the experimental reads against a ref library of cDNA sequences. Tools like pileup will output reports in which each reference sequence is a 'chromosome' and list the reads that align to it. For my purposes, I am more interested in the reverse of this: which ref sequences align against each experimental read. Is there an established approach to transpose the ref/query sequences in a SAM file? It doesnt seem like it would be very hard in perl, but i thought i'd check to see if something already exists. Theoretically when I perform the alignment I could use my experimental data as the reference and align my cDNA library against it.
Thank you for any help.
I have currently approached this by aligning the query reads against a ref library various aligners including bowtie and blat. The result is converted to SAM using perl, unless the aligner already outputs SAM.
From this SAM file, I am interested in extracting which reference sequences each read aligned against, and the SNPs associated with these alignments.
I have a few questions about manipulating SAM data:
1. From an alignment in SAM format, is there an established approach to annotate each alignment with the number of mismatches, their positions and/or identities (ie. G354C)? The normal CIGAR string does not distinguish match/mismatch although the doc says this is being considered.
2. Pileup does some of SNP detection discussed in point 1, but I did not see any information in the pileup report about which specifies query reads are associated with each SNP. If this information existed, it would be simple to use the result of pileup to annotate the SAM file. Can the read name(s) associated with each SNP be obtained from pileup or a similar tool?
3. This far, I have been aligning the experimental reads against a ref library of cDNA sequences. Tools like pileup will output reports in which each reference sequence is a 'chromosome' and list the reads that align to it. For my purposes, I am more interested in the reverse of this: which ref sequences align against each experimental read. Is there an established approach to transpose the ref/query sequences in a SAM file? It doesnt seem like it would be very hard in perl, but i thought i'd check to see if something already exists. Theoretically when I perform the alignment I could use my experimental data as the reference and align my cDNA library against it.
Thank you for any help.
Comment