Unconfigured Ad

**chadn737** · 02-23-2012, 09:29 PM

In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

The fields are described in the ANNOVAR website:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.

404 Not Found

http://www.openbioinformatics.org/annovar/annovar_faq.html#othergenome

You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

Here is a sample of what my Arabidopsis refgene file looks like:

Code:

1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk

Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.

**xujie** · 02-23-2012, 10:27 PM

Originally posted by chadn737 View Post

In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

The fields are described in the ANNOVAR website:

For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.

404 Not Found

http://www.openbioinformatics.org/annovar/annovar_faq.html#othergenome

You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

Here is a sample of what my Arabidopsis refgene file looks like:

Code:

1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk

Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.

Thank you so much for your reply and the information means too much for me.

Topics	Statistics	Last Post
A New Method Makes Hantavirus Genome Analysis Faster and More Accessible by SEQadmin2 Started by SEQadmin2, Yesterday, 10:09 AM	0 responses 9 views 0 reactions	Last Post by SEQadmin2 Yesterday, 10:09 AM
A New Single-Cell Method Maps DNA-Protein Interactions by SEQadmin2 Started by SEQadmin2, 06-04-2026, 08:59 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 06-04-2026, 08:59 AM
Long-Read RNA Sequencing Uncovers a Hidden Layer of Immune Cell Regulation by SEQadmin2 Started by SEQadmin2, 06-02-2026, 12:03 PM	0 responses 26 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 12:03 PM
DNA Methylation Study Reveals How Epigenetic Changes Pass Between Generations by SEQadmin2 Started by SEQadmin2, 06-02-2026, 11:40 AM	0 responses 21 views 0 reactions	Last Post by SEQadmin2 06-02-2026, 11:40 AM

Unconfigured Ad

Questions about ANNOVAR

Comment

Comment

Latest Articles

ad_right_rmr

News