Seqanswers Leaderboard Ad

**SNPsaurus** · 06-04-2015, 02:36 PM

use samtools mpileup to convert from a mapping of each read to a nucleotide by nucleotide report of the reference sequence and quality scores of each basecall at that reference position.

If you need the output in fastq that would not be the answer. If you want the information of fastq (basecall and quality) in a reference-oriented format, then it would be the answer!

**jmartin** · 06-04-2015, 02:51 PM

Wouldn't mpileup flatten my data into consensus REF and ALT alleles and just show me counts of the alleles at each position? I need to get the sequence from each read (the query) across the specific region of the reference to which it maps. I thought mpileup would give me a per reference-base VCF output, with alleles just counted.

Sorry if I'm being dense. Would mpileup let me get the read-based sequence across a specific position defined in terms of the reference that the read is mapped on? At the end I need 1 sequence per read that mapped, but just the piece of that read that mapped to a specific place in the reference.

**GenoMax** · 06-04-2015, 02:56 PM

@jmartin: Use the samtools view option posted by Devon in this thread: http://seqanswers.com/forums/showthread.php?t=39766

You can then use reformat.sh from BBMap to extract the reads from the SAM file. Reads may extend beyond the exact boundries you are setting.

Code:

$ reformat.sh in=region.sam out=reads.fq

**SNPsaurus** · 06-04-2015, 03:05 PM

The samtools view option from GenoMax will extract the reads. If you wanted to examine the basecalls across that region on a nucleotide by nucleotide basis, then the mpileup would be helpful, I think.

mpileup output is this:

chr1 276 G 22 ...T,,...,,...,,.. hhhhhhhdddhhhhh

This says at position 276 of chr1 there were 22 reads. Most matched the reference G in forward or reverse read orientation (. or ,) but one T basecall was also present. The last column gives the quality score of the basecalls.

**jmartin** · 06-04-2015, 03:08 PM

@GenoMax: In my case the issue is specifically that I need to produce, per query read, the read-based sequence that exactly covers a specific region in the reference. I am able to find the subset of reads that map to the region using samtools as you suggest.

My problem is that I need to only have the sequence within the boundaries that I've set (the read sequence mapping to the exact position of the feature of interest in my reference).

**jmartin** · 06-04-2015, 03:14 PM

@SNPsaurus: Is the 5th column of the output you illustrate (the "...T,,...,,...,,..") showing me in a consistent order, the state of each mapped read at that base in the reference? So if I have 1000 reads mapping onto a base, I would have 1000 characters in that string?

Assuming the order was consistent, I could use that to reconstruct each read. But I would need to know the id of the read.

That sounds close to what I need, but I believe that format would foil me when the read depth at a base position changes. The length of that "...T,,...,,...,,.." string would change when the number of reads covering a position change. And thus I couldn't assume that the 40th character in the 5th column string would always be from the same read.

**SNPsaurus** · 06-04-2015, 03:25 PM

Hmm, the mpileup specification has characters for read start and read end. I haven't had to worry about reconstructing the haplotype so I haven't investigated the behavior of how those are used.

You can cap the number of basecalls at a position, but also have it show all (I've done mpileups with read depths of 30,000 or more when looking for rare variants). You can also show only basecalls of a certain quality score but you wouldn't want to exclude any.

I would hope that without filtering, every start read symbol in a particular column would have an end read symbol in the same column (after accounting for these special characters) 100 bp away (or whatever the read length).

This still wouldn't get you back to the read ID, though.

**GenoMax** · 06-05-2015, 01:15 AM

Originally posted by jmartin View Post

@GenoMax: In my case the issue is specifically that I need to produce, per query read, the read-based sequence that exactly covers a specific region in the reference. I am able to find the subset of reads that map to the region using samtools as you suggest.

My problem is that I need to only have the sequence within the boundaries that I've set (the read sequence mapping to the exact position of the feature of interest in my reference).

If that is a hard requirement then that would almost certainly require custom scripting to cut the reads (extending outside the interval) down to the exact interval, if you want to retain fastq format. If you only need the sequence then perhaps doing a multiple sequence alignment with extracted reads (using fasta format) may suffice to get the exact region.

**jmartin** · 06-05-2015, 10:19 AM

@GenoMax: I'm afraid that a multiple alignment including all my reads that pile up over the feature + my reference will stretch and pull the individual read sequences to make them fit the new alignment. Thats a problem for me because my end goal is to look at the state of 106 variant loci within each read along the place they align to the feature in the reference. I am afraid that a multiple alignment would deform the alignments I have currently, which have been highly refined using GATK's indel realigner.

**GenoMax** · 06-05-2015, 11:48 AM

I don't know a way of keeping the realignment intact in addition to respecting those boundaries while doing read extraction. Perhaps someone else (who does this sort of thing regularly) will have other ideas.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 18 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Need to extract fastq of specific region in bam file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News