Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • xujie
    Member
    • Nov 2010
    • 11

    Questions about ANNOVAR

    Hello everyone,

    I would like to determine whether or not my calling SNPs are in coding regions and whether they impact the protein sequence. So I use ANNOVAR for annotation.
    However, my research target species is maize ,which even not have the UCSC-type annotation database. So I think I shoud convert my GFF3 maize annotation file to a UCSC-type file. Could you give me any suggestion about the format of the UCSC-type file or any ideas for annotation for maize snps?

    The file "hg18_refGene.txt" in the example database of ANNOVAR
    585 NR_028269 chr1 - 4224 7502 7502 7502 7 4224,4832,5658,6469,6719,7095,7468, 4692,4901,5810,6631,6918,7231,7502, 0 LOC100288778 unk unk -1,-1,-1,-1,-1,-1,-1,


    what is the meaning of the row?



    Thank you advance
    Best wishes
    Xujie
  • chadn737
    Senior Member
    • Jan 2009
    • 392

    #2
    In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

    The fields are described in the ANNOVAR website:

    For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.




    You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

    Here is a sample of what my Arabidopsis refgene file looks like:

    Code:
    1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
    1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
    1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
    1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
    1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
    1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk
    Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

    Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.
    Last edited by chadn737; 02-23-2012, 09:37 PM.

    Comment

    • xujie
      Member
      • Nov 2010
      • 11

      #3
      Originally posted by chadn737 View Post
      In order to get ANNOVAR to work with Arabidopsis I had to build my own database from scratch like you.

      The fields are described in the ANNOVAR website:

      For refGene file, each line has 16 tab-delimited columns: $bin, $name, $chr, $dbstrand, $txstart, $txend, $cdsstart, $cdsend, $exoncount, $exonstart, $exonend, $id, $name2, $cdsstartstat, $cdsendstat, $exonframes. The only real important thing is $name (transcript name), $chr (chromosome), $dbstrand (strand of the transcript in reference genome), $txstart, $txend (transcription start and end), $cdsstart, $cdsend (translation start and end, remember that there are 5/3-UTR in each transcript so the $cdsstart is not the same as $txstart), $exoncount (number of exoms), $exonstart $exonend (comma-delimited exon start and end sites). Remember that all start sites use zero-based coordinates.




      You can start by using the gff3ToGenePred or gtfToGenePred (found here: http://hgdownload.cse.ucsc.edu/admin/exe/linux.x86_64/) on your GFF3 or GTF file. The $bin, $id, $name2, $cdsstartstat, $cdsendstat, and $exonframes are not critical for ANNOVAR function, but you will need something in those columns just as filler for it to work.

      Here is a sample of what my Arabidopsis refgene file looks like:

      Code:
      1	AT5G01010.4	Chr5	-	1222	5061	1387	4924	16	1222,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4237,4467,4679,5061,	name	unk	unk	unk	unk
      1	AT5G01010.1	Chr5	-	1250	5043	1387	4924	15	1250,1571,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
      1	AT5G01010.2	Chr5	-	1278	4994	1387	4924	16	1278,1571,1744,1913,2104,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1459,1646,1780,2007,2181,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,4994,	name	unk	unk	unk	unk
      1	AT5G01010.3	Chr5	-	1278	5043	1526	4924	14	1278,1744,1913,2434,2747,2871,3302,3542,3761,3926,4101,4334,4551,4764,	1646,1780,1961,2509,2799,2934,3383,3659,3802,4005,4258,4467,4679,5043,	name	unk	unk	unk	unk
      1	AT5G01015.1	Chr5	-	5255	5891	5334	5769	2	5255,5696,	5576,5891,	name	unk	unk	unk	unk
      1	AT5G01015.2	Chr5	-	5366	5801	5515	5769	2	5366,5686,	5576,5801,	name	unk	unk	unk	unk
      Its been a while since I made this, but I think I had to manually add in the last 5 columns using sed or something.

      Also its important to know that you need to name your database files as "hg18_refgene" and so on. Either that or go into the annotate_variation.pl and modify every instance of hg18 with the name of your database files. So in my case I replaced hg18 with TAIR10. Otherwise ANNOVAR will complain about not being able to find the right files.
      Thank you so much for your reply and the information means too much for me.

      Comment

      Latest Articles

      Collapse

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, Yesterday, 10:09 AM
      0 responses
      9 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-04-2026, 08:59 AM
      0 responses
      17 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 12:03 PM
      0 responses
      26 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-02-2026, 11:40 AM
      0 responses
      21 views
      0 reactions
      Last Post SEQadmin2  
      Working...