Seqanswers Leaderboard Ad

**Bukowski** · 01-16-2012, 06:11 AM

I'm not quite sure what you're asking. BAM files contain reads (mapped and potentially unmapped) to a reference sequence. You indicate you already have the reference, so is the question "How can I extract reads from individual chromosomes from a BAM file?". If not, then this answer might be no use

You could just subset the bam file (http://www.1000genomes.org/faq/how-d...ction-bam-file) by chromosome and then extract the reads with bam2fastq (http://www.hudsonalpha.org/gsl/software/bam2fastq.php) I'm assuming you want fastq and not fasta, as if you convert to fasta you will lose quality information. If you do want fasta then the fastq->fasta conversion is trivial and implemented in many forms.

**ce.log** · 01-16-2012, 06:25 AM

Well, maybe I am just not quite sure, whether what I want to do really makes sense

For instance, given reference Chromosome 1 (of a/the reference genome), and some reads of human NA12283 from the 1000 Genome project, I just ask myself: "What does the Chromosome 1 of NA12283 look like". I would assume (maybe this is my mistake?) that I can just overlay all the mapped reads to the reference Chromosome 1 and then obtain "the" Chromosome 1 of NA12283 - at least roughly.

If all this makes (biologically) sense, then I would be interested in the FASTA file of Chromosome 1 of NA12283 - computed from the reference chromosome and the mapped reads.

About your solution: Once converted to FASTQ - don't I lose all the mapping information? I really want to *combine* the reads and the corresponding reference chromosome, not only convert reads to reads.

**Bukowski** · 01-16-2012, 06:32 AM

Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?

So you could subset your bam's by chromosome and then do something like;

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq

**lh3** · 01-16-2012, 06:35 AM

For 4X coverage, there is essentially no way to generate a good consensus.

**elfuser** · 04-10-2012, 08:55 AM

Originally posted by Bukowski View Post

Ah, so you want to create a consensus sequence from the mapped reads per chromosome for a given individual?

So you could subset your bam's by chromosome and then do something like;

samtools pileup -cf ref.fa aln.bam | samtools.pl pileup2fq -D100 > cns.fastq

I tried to run this but then I realized pileup was changed to mpileup and -c is not supported. so i run mpileup -uf, but I get may errors.

**bioinfosm** · 04-10-2012, 06:48 PM

perhaps you can use the variant calls from this sample along with the reference sequence to sort of come up with 'the' sequence for this individual

**Gabeloooooo** · 11-08-2012, 01:27 PM

I'm trying to do the exact same thing! Did you ever find a way to get a decent fasta complete chromosome from one of the 1000 genomes samples?

My understanding is the reference they use is hs37d5.fa.

So, can you use a BAM/BAI file combo from patient HG00XXX and map the reads to hs37d5.fa to get the rough 'genome' of patient HG00XXX?

**kriikku** · 01-13-2013, 06:49 AM

See here for one way to get a .vcf file with SNPs and indels from the .bam file, or a consensus sequence:

Multisample SNP Calling

http://samtools.sourceforge.net/mpileup.shtml

The consensus sequence generated by this method has the problem that it only applies the SNPs to the reference sequence, but not the indels.
The .vcf file is better since it includes both SNPs and indels.

The .vcf file can be converted to a .fasta sequence using this tool:

https://www.broadinstitute.org/gatk/...Reference.html

However, note that this tool will only take into account indels of length up to 2 bases (as of January 2013). You may want to write your own script to insert all the indels (including the longer ones) from the .vcf into the .fasta.

This method should get the whole sequence from the .bam file, however, I don't know how to extract individual chromosomes from it.

**Gabeloooooo** · 01-14-2013, 09:49 AM

The mpileup method seems to work... I've two questions.

1. Is there a way to know how many SNPs I should expect in this 'consensus' sequence? Say I want to know how many SNPs HGxxxx has in his chromosome 2, etc.

2. When you say the VCF file is better, is there another way to get a VCF from 1000 genomes, other than using the mpileup method?

I was under the impression the ftp site only contained BAM files?

Thanks a lot!

**kriikku** · 01-14-2013, 11:45 AM

1. There could be, but I don't know of one. I usually just check if the last position numbers in the .vcf file are close to the number of base pairs in the chromosome. (Note that if you use the whole genome, then the last positions will be the last positions in the last supercontig, not the whole genome.)

2. 1000 Genomes has some .vcf files in their /release folder (description of contents here: http://www.1000genomes.org/faq/what-bas-file). As for other methods to generate a .vcf file, there might be some, but I personally don't know of any.

I'm not sure what you mean by the 1000 Genomes FTP site only containing .bam files. You can see the link above for all the file types they have. But, sorry, I'm not sure what you're asking, maybe you can clarify?

I'd also like to mention that I found out that when you give mpileup a whole genome .bam file from the 1000 Genomes site, and generate a .vcf from it, then the first number of each line in the .vcf is the number of the chromosome (1...22, X, Y or the ID of a supercontig). Therefore, it is possible to write a simple script to split the .vcf into several .vcf-s, one for each chromosome, and by mapping the changes to individual chromosome fasta files, get the sequences of each chromosome separately.

**Gabeloooooo** · 01-14-2013, 02:04 PM

I mean, do the VCF files in the release folder contain all SNPs for a given sample (ex: all SNPs for HG00254)?

Say I want all the SNPs of a consensus chromosome 3 for HG00254, my understanding was that the easiest method it to get the BAM, create mpileup, create fasta from that

**kriikku** · 01-14-2013, 03:28 PM

I downloaded one of the .vcf files to see, and, as far as I can tell, they don't contain the SNPs of any samples, just the collective SNPs of all the samples, with the population frequencies of each SNP. Something like this for each SNP:
pos: 123456, ref: A, alt: T, african_freq: 0.12, american_freq: 0.01, european_freq: 0.5, asian_freq: 0.14

I think you can only get all SNPs and indels of a given sample by doing what you said, i.e. getting the .bam, running mpileup on it, and generating a .fasta from the .vcf.

**Gabeloooooo** · 01-14-2013, 05:03 PM

Thanks a lot! Will do it this way.

The other thing I don't get is how they sometimes indicate in the 1000 genomes browser a certain genotype, yet I can't find it in the 1000 genomes data.

An example is rs1805007 for sample HG00108. It says T|C in the genome browser (see url below), yet I can't find any instances of a T in the 3 tracks provided by 1000 genomes.

URL in 1000 genomes browser:

Resource is no longer available!

http://www.ncbi.nlm.nih.gov/variation/tools/1000genomes/?chr=NC_000016.9&from=89985617&to=89986617&mk=89986117:89986117|rs1805007&gts=rs1805007

**kriikku** · 02-02-2013, 10:22 AM

Sorry, I don't have much experience with the genome browser. I see C in two of the three HG00108 tracks shown in the genome browser (the exome one is empty). There reference genome shows C|G. Where are you seeing the T|C? (I think linking to the genome browser doesn't work, by the way, so it would be better if you described it.)

Topics	Statistics	Last Post
Study Highlights Challenges in Cellular Reprogramming for Regenerative Medicine by seqadmin Started by seqadmin, Today, 06:25 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 06:25 AM
New DNA Modification Discovered as Key to Gene Activation in Early Development by seqadmin Started by seqadmin, Yesterday, 01:02 PM	0 responses 12 views 0 likes	Last Post by seqadmin Yesterday, 01:02 PM
Wastewater Analysis Unlocks New Method for Identifying Public Health Threats by seqadmin Started by seqadmin, 09-18-2024, 06:39 AM	0 responses 14 views 0 likes	Last Post by seqadmin 09-18-2024, 06:39 AM
Molecular Markers Shared Across Dementias by seqadmin Started by seqadmin, 09-11-2024, 02:44 PM	0 responses 14 views 0 likes	Last Post by seqadmin 09-11-2024, 02:44 PM

Seqanswers Leaderboard Ad

Announcement

Convert 1000-Genomes-proje BAM to FASTA (aligned to reference, grouped by chromosome)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News