Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • GATK - GenomeAnalysisTK - option -R

    Hello,

    I am using GATK for the first time. I try to use SomaticIndelDetectorWalker. I have difficulties with the option -R ref.fasta.

    I read:
    --refseq String NA Name of RefSeq transcript annotation file. If specified, indels will be annotated with GENOMIC/UTR/INTRON/CODING and with the gene name
    From this, I understand that this option is not mandatory. When I try:
    java -Xmx2g -jar GenomeAnalysisTK.jar -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
    I have the error:
    ##### ERROR ------------------------------------------------------------------------------------------
    ##### ERROR A USER ERROR has occurred (version 1.4-15-gcd43f01):
    ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ##### ERROR Please do not post this error to the GATK forum
    ##### ERROR
    ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
    ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
    ##### ERROR
    ##### ERROR MESSAGE: Walker requires a reference but none was provided.
    ##### ERROR ------------------------------------------------------------------------------------------
    It seems that I should provide a reference fasta file. Then, I tried:

    java -Xmx2g -jar GenomeAnalysisTK.jar -R ~/hg19.fasta -T SomaticIndelDetector -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
    but I have this error:
    ##### ERROR A USER ERROR has occurred (version 1.4-15-gcd43f01):
    ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ##### ERROR Please do not post this error to the GATK forum
    ##### ERROR
    ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ##### ERROR Visit our wiki for extensive documentation http://www.broadinstitute.org/gsa/wiki
    ##### ERROR Visit our forum to view answers to commonly asked questions http://getsatisfaction.com/gsa
    ##### ERROR
    ##### ERROR MESSAGE: Invalid command line: Failed to load reference dictionary
    I think the problem is my fasta file. The one that I provide contains only:
    Code:
    chr1	249250621	6	50	51	
    chr2	243199373	254235646	50	51	
    chr3	198022430	502299013	50	51	
    chr4	191154276	704281898	50	51	
    chr5	180915260	899259266	50	51	
    chr6	171115067	1083792838	50	51	
    chr7	159138663	1258330213	50	51	
    chr8	146364022	1420651656	50	51	
    chr9	141213431	1569942965	50	51	
    chr10	135534747	1713980672	50	51	
    chr11	135006516	1852226121	50	51	
    chr12	133851895	1989932775	50	51	
    chr13	115169878	2126461715	50	51	
    chr14	107349540	2243934998	50	51	
    chr15	102531392	2353431536	50	51	
    chr16	90354753	2458013563	50	51	
    chr17	81195210	2550175419	50	51	
    chr18	78077248	2632994541	50	51	
    chr19	59128983	2712633341	50	51	
    chr20	63025520	2772944911	50	51	
    chr21	48129895	2837230949	50	51	
    chr22	51304566	2886323449	50	51	
    chrX	155270560	2938654113	50	51	
    chrY	59373566	3097030091	50	51	
    chrMt	16571	3157591135	50	51
    Could someone explains me what is this refseq.fasta file precisely?
    Thanks for any help

  • #2
    I would like to add this question: I intend to use only the nucleotides which have a PHRED over 30. I haven't found an option to do that. Is it possible?

    Comment


    • #3
      Any idea?
      I compared my refSeq.fasta file to the exampleFASTA.fasta file. I've seen that the example file contains the nucleotides of the chromosome 3. Which is definitively not the format that I've provided...
      In my data, I've a folder containing one file per chromosome. Each file contains the nucleotide for the given chromosome. Should I merge those files to get the one I need?
      Thank you

      Comment


      • #4
        The "-R" option is for your reference genome fasta file. This is not to be confused with the refseq ROD for GATK (http://www.broadinstitute.org/gsa/wiki/index.php/RefSeq)

        I think you will need to generate a sequence dictionary using Picard that has the name
        ~/hg19.dict.

        java -jar $PICARD/CreateSequenceDictionary R=~/hg19.fasta O=~/hg19.dict

        This may solve it, however later versions of GATK do it on-the-fly.

        The help desk at http://getsatisfaction.com/gsa are very responsive.

        Comment


        • #5
          So...
          It seems the file you provided os no fasta file, as fasta files should look like
          Code:
          >header
          ACGTACGACTCGACTACAGCAC
          >header2
          ACTAGCTGCATGCATC
          and so on. You provided the Dictionary. You may download the fasta files for the chromosomes here:


          And then unzip them and merge them into one fasta file (GATK is picky about the order of the fasta file, it should be chr1, chr2 ,chr3, ...chrX, chrY), for example using the cat command in Linux; You may also try to download the hg19.2bit file and extract the sequence using the twoBitToFa binary.

          Another source for a human reference genome would be the 1000genomes project:
          ftp://ftp.1000genomes.ebi.ac.uk/vol1...cal/reference/

          Hope that helps,
          Peter

          Comment


          • #6
            Thanks for the answers.
            I downloaded chr1.fa.gz from http://hgdownload.cse.ucsc.edu/golde...9/chromosomes/, I unzipped it. I tried to rerun both:

            java -Xmx2g -jar GenomeAnalysisTK.jar --refseq ~/chr1.fasta -T SomaticIndelDetector --minCoverage 10 -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
            java -Xmx2g -jar GenomeAnalysisTK.jar --refseq ~/chr1.fa -T SomaticIndelDetector --minCoverage 10 -o indels.vcf -verbose indels.txt -I:normal ../../../../data/patient1/s_garma-fibros_converted.bam -I:tumor ../../../../data/patient1/s_garma-296_converted_sorted.bam
            but I still have the same error message... Isn't it suppose to work, even with a part of the genome?

            Comment


            • #7
              Ok, so I made some progress! I changed the --refseq into -R option and I changed my s_garma-fibros_converted.bam file into s_garma-fibros_converted_sorted.bam file which is the sorted bam file.
              Now, I have a new error:

              ##### ERROR MESSAGE: SAM/BAM file ../../../../data/patient1/s_garma-fibros_converted_sorted.bam is malformed: SAM file doesn't have any read groups defined in the header. The GATK no longer supports SAM files without read groups
              I got the BAM file this way:
              samtools view -b -S -t /home/m/hg19.fa s_garma-fibros_converted.sam > s_garma-fibros_converted.bam
              samtools sort s_garma-fibros_converted.bam s_garma-fibros_converted_sorted
              samtools index s_garma-fibros_converted_sorted.bam
              Here http://www.broadinstitute.org/gsa/wi...s_for_the_GATK, I read that:
              The file must be binary (.bam).
              The file must be indexed.
              The file must be sorted in coordinate order with respect to the reference (i.e. the contig ordering in your bam must exactly match that of the reference you are using).
              The file must have a proper bam header with read groups. Each read group must contain the platform (PL) and sample (SM) tags. For the platform value, we currently support 454, LS454, Illumina, Solid, ABI_Solid, and CG (all case-insensitive).
              Each read in the file must be associated with exactly one read group.

              The first 2 steps are ok. For the index, I used the default option, it was enough to visualize the files in IGV but I don't know if it is the right index for this use.
              And I haven't done the last 2 steps: I don"t know if the file has a proper bam header with read groups and if each read in the file is associated with exactly one read group, because I cannot open a bam file:

              Code:
              more s_garma-fibros_converted_sorted.bam
              \�@��qrF�!���m>(Lj��m�c� 7�,7�cr"�1e��9���1X3�8�
                                                                   ��5�PCY�N(ǒ���,6��^L�%D�<CN8ψ�{�/vϘ!,�.�g*–��284��yf^L�Xa<s�S
              
              [m@U1009-PCJane patient1]$ more s_garma-fibros_converted.sam
              @PG	ID:illumina_export2sam.pl	VN:2.0.0	CL:/usr/local/bin/illumina_export2sam.pl --read1=s_gar
              ma-fibros_1_export.txt --read2=s_garma-fibros_2_export.txt
              HWI-ST584_81:4:1101:1198:2065	89	chr7	156837088	28	76M	*	0	0	GGCTTG
              AACAACGGAAATGTGTCAAATGTGTCAGCTCCCAGCTCAGAGACTGGGAGACCAGGCCGAGGCGCCGGCN	######################################
              ######################################	BC:Z:AGTCAA	XD:Z:75A	SM:i:28	AS:i:0
              HWI-ST584_81:4:1101:1198:2065	165	*	0	0	*	chr7	156837088	0	GGAGCC
              CTGCTGCGTAGTNNNNNNNNNNACACGGTGTATTATTACTTTCCCAGGACCACCGTAACAAAGTAGCACA	CCCFFFFFHHHHGJHGIH####################
              ######################################	BC:Z:AGTCAA
              HWI-ST584_81:4:1101:1174:2078	73	chr5	41048452	254	76M	*	0	0	AAACGT
              GTTTTCCATAGGTCTACCAATTTTGGGTGAATTATCTCAGGCAGTATCTTCAAAAGCCCTATTGCACCAG	CCCFFFFDHHHHHIIIJJIJJIJJJJJJJJJJJGHGJJ
              JIJIJJJJIJJJIIJJJJJJJIIJGIJJJIIIJJHHHA	BC:Z:AGTCAA	XD:Z:76	SM:i:359	AS:i:0
              I can only open the SAM file. How can I check if my bam file is correctly formated if I cannot open it?

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              52 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X