Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Annotating VCF files from GFF or .gbk files

    Hello all,

    I am trying to annotate VCF files using snpEff. I'm using a virus that doesn't have good gene annotations available so I had to build a custom database. That seemed to work well but when I actually try to do the annotation I'm not getting the amino acid changes called. Here is the output
    weger@IDANGS01:~/snpEff$ java -jar snpEff.jar dengue_DR59_2 /data/ebel_lab/Weger/Experiment_27/36_Exp/test/323-minority.vcf > 323-minority_ann.vcf -noGenome
    Error: Error while processing VCF entry (line 59) :
            gb:AB122022|Organism:Dengue     3152    .       C       CA      36.83   PASS    SN=0;STA=3152;STO=3152;TYP=INS;R1P=198;R1M=195;R2P=0;R2M=0;PPC=0;LS=54650;MQS=15536;MQM=42;BQS=13814;BQM=36;EDS=15658;EDM=75;IDS=385516;IDM=992;COV=18847;MCOV=-1;CED=601;HMP=0;DP=18847;AF=0.0209;RAF=0.0209;SB=1.0000;DP4=9226,9228,198,195      GT:DP:AD:AF:RAF:SB:SC:PF        1:18847:393:0.0209:0.0209:1.0000:36.83:PASS
    Error: Error while processing VCF entry (line 65) :
            gb:AB122022|Organism:Dengue     5915    .       C       CA      37.11   PASS    SN=0;STA=5915;STO=5915;TYP=INS;R1P=228;R1M=211;R2P=0;R2M=0;PPC=0;LS=61291;MQS=17579;MQM=42;BQS=15431;BQM=36;EDS=16893;EDM=75;IDS=431762;IDM=992;COV=7479;MCOV=-1;CED=601;HMP=0;DP=7479;AF=0.0587;RAF=0.0587;SB=0.9999;DP4=3512,3528,228,211        GT:DP:AD:AF:RAF:SB:SC:PF        1:7479:439:0.0587:0.0587:0.9999:37.11:PASS
    Error: Error while processing VCF entry (line 75) :
            gb:AB122022|Organism:Dengue     8469    .       C       CA      36.46   PASS    SN=0;STA=8469;STO=8469;TYP=INS;R1P=59;R1M=56;R2P=0;R2M=0;PPC=0;LS=16215;MQS=4113;MQM=41;BQS=4059;BQM=36;EDS=5218;EDM=75;IDS=111026;IDM=992;COV=5471;MCOV=-1;CED=601;HMP=0;DP=5471;AF=0.0210;RAF=0.0210;SB=0.9999;DP4=2677,2679,59,56       GT:DP:AD:AF:RAF:SB:SC:PF        1:5471:115:0.0210:0.0210:0.9999:36.46:PASS
    3 errors.
    The problem seems to be java.lang.NullPointerException. However, I'm not sure what this means.

    I have tried using different variant callers (lofreq and I have also tried annotating both a .gbk and a .gff3 file with the same result.

    Any thoughts?
    Last edited by GenoMax; 05-31-2017, 10:24 AM. Reason: Adding [CODE] tags

  • #2
    Additional info:

    I reran using the -v option

    The program is not detecting the coding region which is clearly in the .gbk file.
    # Genome name : 'Dengue virus'
    # Genome version : 'dengue_DR59'
    # Genome ID : 'dengue_DR59[0]'
    # Has protein coding info : false
    # Has Tr. Support Level info : false
    # Genes : 0
    # Protein coding genes : 0
    # Transcripts : 0
    # Avg. transcripts per gene : NaN
    # Checked transcripts :
    # Protein coding transcripts : 0
    # Cds : 0
    # Exons : 0
    # Exons with sequence : 0
    # Exons without sequence : 0
    # Avg. exons per transcript : NaN
    # Number of chromosomes : 0
    # Chromosomes : Format 'chromo_name size codon_table'

    # WARNING! : No protein coding transcripts found.


    • #3
      Have you built a snpEff database from your genbank file (option 4 on this page)?


      • #4
        @GenoMax, yep, from both genbank and gff3, they give the same error.


        • #5
          Were any errors produced when you built the database using GBK files? Is your "chromosome name" matching in your alignments/VCF/annotation files?


          • #6
            So I rebuilt the database, here is the output

            weger@IDANGS01:~/snpEff$ java -jar snpEff.jar build -genbank -v dengue_DR59_2
            00:00:00 SnpEff version SnpEff 4.3o (build 2017-05-21 09:40), by Pablo Cingolani
            00:00:00 Command: 'build'
            00:00:00 Building database for 'dengue_DR59_2'
            00:00:00 Reading configuration file 'snpEff.config'. Genome: 'dengue_DR59_2'
            00:00:00 Reading config file: /home/weger/snpEff/snpEff.config
            00:00:00 done
            Chromosome: 'AB122022' length: 10723

            Create exons from CDS (if needed): .
            Exons created for 1 transcripts.

            Deleting redundant exons (if needed):
            Total transcripts with deleted exons: 0

            Collapsing zero length introns (if needed):
            Total collapsed transcripts: 0
            Adding genomic sequences to exons: Done (1 sequences added, 0 ignored).

            Adjusting transcripts:
            Adjusting genes:
            Adjusting chromosomes lengths:
            Ranking exons:
            Create UTRs from CDS (if needed):
            Remove empty chromosomes:

            Marking as 'coding' from CDS information:
            Done: 0 transcripts marked
            00:00:00 Caracterizing exons by splicing (stage 1) :

            00:00:00 Caracterizing exons by splicing (stage 2) :
            00:00:00 done.
            00:00:00 [Optional] Rare amino acid annotations
            00:00:00 Warning: Cannot read optional protein sequence file '/home/weger/snpEff/./data/dengue_DR59_2/protein.fa', nothing done.
            00:00:00 Protein check file: '/home/weger/snpEff/./data/dengue_DR59_2/genes.gbk'

            00:00:00 Checking database using protein sequences
            00:00:00 Reading proteins from file '/home/weger/snpEff/./data/dengue_DR59_2/genes.gbk'...
            00:00:00 done (1 Proteins).
            00:00:00 Comparing Proteins...
            '+' : OK
            '.' : Missing
            '*' : Error

            Protein check: dengue_DR59_2 OK: 1 Not found: 0 Errors: 0 Error percentage: 0.0%
            00:00:00 Saving database
            00:00:00 [Optional] Reading regulation elements: GFF
            00:00:00 Warning: Cannot read optional regulation file '/home/weger/snpEff/./data/dengue_DR59_2/regulation.gff', nothing done.
            00:00:00 [Optional] Reading regulation elements: BED
            00:00:00 Cannot find optional regulation dir '/home/weger/snpEff/./data/dengue_DR59_2/regulation.bed/', nothing done.
            00:00:00 [Optional] Reading motifs: GFF
            00:00:00 Warning: Cannot open PWMs file /home/weger/snpEff/./data/dengue_DR59_2/pwms.bin. Nothing done
            00:00:00 Done
            00:00:00 Logging
            00:00:01 Checking for updates...
            00:00:02 Done.

            At first it says it finds one transcript and then it says 0 are marked.

            Have you seen this before?


            • #7
              I have not used snpEff in this particular fashion.

              I still have a feeling that there is a mismatch between your files (gb:AB122022|Organism: Dengue seems to be in your VCF (and BAM?) where as snpEff probably is using just AB122022.


              • #8
                @GenoMax, (gb:AB122022|Organism: Dengue is the header for the fasta file. I will try to do a new alignment and variant calling with a fasta that has the exact same header as the .gbk. I'll let you know what happens. Thanks!


                • #9
                  Unfortunately it does not change the result.
                  weger@IDANGS01:~/snpEff$ java -Xmx4g -jar snpEff.jar dengue_DR59_2 /data/ebel_lab/Weger/Experiment_27/36_Exp/test/323.vcf > 323_ann.vcf -noGenome
                  Error: Error while processing VCF entry (line 93) :
                          AB122022        3152    .       C       CA      36.96   PASS    SN=0;STA=3152;STO=3152;TYP=INS;R1P=126;R1M=101;R2P=0;R2M=0;PPC=0;LS=31259;MQS=8723;MQM=42;BQS=7920;BQM=36;EDS=8773;EDM=75;IDS=221837;IDM=992;COV=3602;MCOV=-1;CED=601;HMP=0;DP=3602;AF=0.0630;RAF=0.0630;SB=0.9996;DP4=1675,1700,126,101    GT:DP:AD:AF:RAF:SB:SC:PF        1:3602:227:0.0630:0.0630:0.9996:36.96:PASS
                  Error: Error while processing VCF entry (line 138) :
                          AB122022        5915    .       C       CA      37.37   PASS    SN=0;STA=5915;STO=5915;TYP=INS;R1P=128;R1M=115;R2P=0;R2M=0;PPC=0;LS=33701;MQS=9517;MQM=42;BQS=8444;BQM=36;EDS=9499;EDM=75;IDS=238270;IDM=992;COV=2101;MCOV=-1;CED=601;HMP=0;DP=2101;AF=0.1157;RAF=0.1157;SB=0.9998;DP4=923,935,128,115      GT:DP:AD:AF:RAF:SB:SC:PF        1:2101:243:0.1157:0.1157:0.9998:37.37:PASS
                  Error: Error while processing VCF entry (line 157) :
                          AB122022        6711    .       G       GT      36.58   PASS    SN=0;STA=6711;STO=6711;TYP=INS;R1P=73;R1M=102;R2P=0;R2M=0;PPC=0;LS=24461;MQS=6085;MQM=42;BQS=5861;BQM=36;EDS=7838;EDM=75;IDS=167900;IDM=992;COV=3686;MCOV=-1;CED=601;HMP=0;DP=3686;AF=0.0475;RAF=0.0475;SB=0.9994;DP4=1770,1741,73,102      GT:DP:AD:AF:RAF:SB:SC:PF        1:3686:175:0.0475:0.0475:0.9994:36.58:PASS
                  Error: Error while processing VCF entry (line 197) :
                          AB122022        10389   .       C       CA      43.47   PASS    SN=0;STA=10389;STO=10389;TYP=INS;R1P=207;R1M=213;R2P=0;R2M=0;PPC=0;LS=59099;MQS=13515;MQM=42;BQS=14722;BQM=36;EDS=17184;EDM=75;IDS=399426;IDM=992;COV=526;MCOV=-1;CED=334;HMP=0;DP=526;AF=0.7985;RAF=0.7985;SB=0.9999;DP4=56,50,207,213     GT:DP:AD:AF:RAF:SB:SC:PF        1:526:420:0.7985:0.7985:0.9999:43.47:PASS
                  4 errors.
                  Last edited by GenoMax; 06-01-2017, 04:49 AM.


                  • #10
                    Are you using Java 1.8?


                    • #11
                      Hmm, nope its 1.7.0_25. I wasn't aware that snpEff needed 1.8. I will update and get back to you with results. Thanks!


                      • #12
                        @Genomax, awesome, with those changes it worked! I really appreciate your help.


                        Latest Articles


                        • seqadmin
                          Exploring the Dynamics of the Tumor Microenvironment
                          by seqadmin

                          The complexity of cancer is clearly demonstrated in the diverse ecosystem of the tumor microenvironment (TME). The TME is made up of numerous cell types and its development begins with the changes that happen during oncogenesis. “Genomic mutations, copy number changes, epigenetic alterations, and alternative gene expression occur to varying degrees within the affected tumor cells,” explained Andrea O’Hara, Ph.D., Strategic Technical Specialist at Azenta. “As...
                          07-08-2024, 03:19 PM
                        • seqadmin
                          Exploring Human Diversity Through Large-Scale Omics
                          by seqadmin

                          In 2003, researchers from the Human Genome Project (HGP) announced the most comprehensive genome to date1. Although the genome wasn’t fully completed until nearly 20 years later2, numerous large-scale projects, such as the International HapMap Project and 1000 Genomes Project, continued the HGP's work, capturing extensive variation and genomic diversity within humans. Recently, newer initiatives have significantly increased in scale and expanded beyond genomics, offering a more detailed...
                          06-25-2024, 06:43 AM





                        Topics Statistics Last Post
                        Started by seqadmin, 07-10-2024, 07:30 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 07-03-2024, 09:45 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 07-03-2024, 08:54 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 07-02-2024, 03:00 PM
                        0 responses
                        Last Post seqadmin