Unconfigured Ad

**wesb** · 07-15-2008, 09:32 AM

Reference

For Maq, I split the reference by chromosome and submit each Maq run for a given chromosome to a different node on my cluster. With Soap, I do the same thing. It also works for me to split the reference into 5 chunks for Soap runs. The cluster I was using couldn't handle Soap if I give it all my reads and all the chromosomes at once. I had memory problems. It works fine if you split it up in chunks, or split up your reads as well. I use the hg18 chromosomes from UCSC Genome Browser. The headers are simple chr1, chr2 ...

**lh3** · 07-17-2008, 12:08 AM

For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.

**zee** · 07-17-2008, 06:34 AM

I agree with Ih3 by not splitting the reference genome up. With any tool if you do this you'll need to combine the results and sort them post-alignment.
I've been using maq, soap, eland and novoalign. Eland's quite fast if you want exact, 1 or 2 mismatches and you dont care about quality. Building the index for eland is also very quick.
There is a script with the maq package that will do the automated chunk mapping. If you use novoalign do some neatening up of the fasta headers in your genome, it will save downstream perl/sed jobs and indexing of a full human genome on at least 8Gb server will take about 4-5 min.

**mukatira** · 07-17-2008, 12:47 PM

Genome Viewer - Annotations

Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

I am looking at data from whole genome sequencing effort using 454 FLX.

Cheers

Sm

**zee** · 07-21-2008, 12:26 AM

Originally posted by mukatira View Post

Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

I am looking at data from whole genome sequencing effort using 454 FLX.

Cheers

Sm

Hi Mukatira,

I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

Hope this helps.

**kmay** · 08-01-2008, 05:15 AM

Originally posted by mukatira View Post

Hello,
I would like to know if any of you are able to view annotations along with the assemblies. Any information about which Genome Viewer would be best for such viewing will be appreciated.

I am looking at data from whole genome sequencing effort using 454 FLX.

Cheers

Sm

Mukatira,

if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.

Split human reference genome

We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in it´s entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours it´s done.
For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

Cheers

Klaus

**zee** · 08-05-2008, 06:11 AM

Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.

Originally posted by kmay View Post

Mukatira,

if you have your tags clustered and not more than 10 million regions, you can upload them into ElDorado with a mouse click in RegionMiner

Downside is, that is is not a public database. However, you can have unlimited free access for up to two weeks after registering
So think about the questions you want to ask beforehand and register then. After two weeks it will be gone...

With more than 10 mio regions, you would require the GGA on site, which delivers all data and programs locally.

We developed our own mapping. We do not split the reference genome for many good reasons. Most fast methods are hash tree based where a split would not make much sense. We pre-processed the genomes extensively, analysed for shortest unique information content and load "the genome" into main memory in it´s entirety ( 32 Gig memory required). Then we map the reads. Any tag length above 9bp, number of point mutations and indels allowed.

Pretty fast. 10 million Solid tags, perfect and unique in less than 30 minutes. Increasing number of allowed indels slows down considerably, but usually in a couple of hours it´s done.
For RNA-seq we use an additional reference: all potential splice junctions pre-calculated.

Cheers

Klaus

**kmay** · 08-05-2008, 07:16 AM

Hmmm, 10 million reads in 30 minutes. So you're doing that with 1 CPU ? That's incredibly fast, about 5,555 reads/second. Whoah! I bet indexing the genome and not the reads means that performance will be linear.

Zee,

in fact a "best unique match" search is now is even faster. See the Helicos forum for some more detail on what that means. I will post some benchmarks in a couple of weeks.

Yes, we do it on one CPU. In fact, my earlier posting was outdated. The GMS has tody comes with 64 GB main memory. So... no desktop system

Yes, is scales linear.

Cheers

Klaus

**bioinfosm** · 09-15-2008, 07:37 AM

Originally posted by zee View Post

Hi Mukatira,

I would recommend the UCSC genome browser or gbrowse. Gbrowse will be much easier to set up. I would suggest that you align all your reads and calculate the read density to see which regions are significantly higher than others. Then you can load in the density - wiggle tracks (as developed by UCSC) into either one of these genome browsers along with your read annotations.
For Gbrowse you should be using mysql/postrgresql back-end databases for better performance over in-memory dbs.
Use DAS or upload public human gene, conservation, etc, public annotation tracks from public browsers if they are available.

Hope this helps.

Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?

**zee** · 09-15-2008, 07:53 AM

Hi,

It's not that difficult actually.

First you should convert your alignments to gene feature format (GFF). Then you should also have all your reference genome sequence definitions in GFF as well. There are some good templates

You will need mysql/postgresql setup with database create/file/read permissions. A gbrowse.conf file for your database will also need to exist. A good example is http://ftp.hapmap.org/jimwatsonseque...wsequence.conf

1) Create your gbrowse database in Mysql
2) Use the bp_load_gff.pl or bp_bulk_load_gff.pl to load GFF files of your read alignments, chromosome names, genes, etc
3) make sure you have the source and method columns for each GFF file specific enough to define each track. Have a look at the fgroup table in the gbrowse schema.
4)Edit the gbrows.conf file to reflect you track names (method:source)
5) Ensure your gbrowse.conf has the correct username/hostname/password to connect to the database containing your data.

If all the data was loaded correctly it should be straight forward to see the tracks. I usually edit an existing configuration file.

Some example files can be found on:

http://ftp.hapmap.org/jimwatsonsequence/gbrowse/gff/chromInfo_hg18.gff.gz

http://ftp.hapmap.org/jimwatsonsequence/gbrowse/gff/refgene_hg18.gff.gz

I have some scripts to do novoalign->gff conversion. For maq you could probably take a look at maqview (from the maq page) because it's quite a good browser and reads the .map file directly.

Originally posted by bioinfosm View Post

Zee - is there more information on how one could go about setting up Gbrowse on aligned reads as a viz tool. ANy conversion tools from alignment, say MAQ map, to .bed or wiggle that will be useful to see?

**Layla** · 06-09-2009, 01:41 AM

chunks

Originally posted by lh3 View Post

For maq, one should not split the reference genome; otherwise the mapping quality will be useless and you cannot run subsequent analyses such as mapmerge and assemble.

If you feel maq is too slow, you can split reads into 2-million chunk. It will take 5-7 CPU hours to align one chunk to the human reference, which is not fast but acceptable.

I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

Secondly, are you able to refer me to the command that can do this splitting within maq?

Thank you

L

**Jonathan** · 06-09-2009, 02:31 AM

From the maq manual:

Code:

maq fastq2bfq [-n nreads] in.read.fastq out.read.bfq|out.prefix

Convert reads in FASTQ format to Maq’s BFQ (binary FASTQ) format.

OPTIONS:
-n INT 	number of reads per file [not specified]

As both end'-readfiles are converted separately, I guess you won't do anything wrong if you are using the same 'n' value for both 'end'files.

Followups:
maq map steps for all bfq-parts (using PE-corresponding part files of course)
followed by
maq mapmerge out.aln.map in.aln1.map in.aln2.map [...]

see the maq-docu for this, too.

Best
-Jonathan

**zee** · 06-09-2009, 02:32 AM

Layla,

You should have each member of the pair in a separate file. In other words whatever you're going to be splitting needs to be done on each of these files containing mate 1 and mate 2.
The maq.pl easyrun command should take care of the process. I would recommend splitting into chunks of size 2 million, therefore it should do 5 iterations.

Assuming you have 1 side in reads1.fq and the other in reads2.fq

Something like:

maq.pl easyrun db.bfa reads1.fq reads2.fq

should work fine.

Originally posted by Layla View Post

I know this thread is quite old but useful for me nonetheless. I have 20 million reads (10 million pairs) to align. If I split this file into 2 million chunks do I need to ensure that the pairs stay together in each chunk, i.e. this 2 million chunk will have 1 million pairs or does this not matter?

Secondly, are you able to refer me to the command that can do this splitting within maq?

Thank you

L

**Layla** · 06-12-2009, 07:01 AM

Thankyou Jonathon and Zee,

I guess this should be good enough. Split both files into 2 million chunks and then run the following command 5 times..
./maq match out.map all_chrom.bfa 2million_5prime.bfq 2million_3prime.bfq

Any ideas how long this should take on a 32GB Mac OS?

Running the above command for all 20million reads in one go took 7 days!

Whilst on this subject, is anyone able to guide me with a specific data filtering issue before I use Batman?

Cheers again

L

Topics	Statistics	Last Post
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, Today, 11:41 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Today, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 23 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 36 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 45 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM

Unconfigured Ad

Human reference sequence and tools like MAQ

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News