Header Leaderboard Ad


Picard Interval File



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Picard Interval File


    I have some Illumina TruSeq exome data and I want to use the picard tool CalculateHsMetrics.jar to look at the hybrid selection. I downloaded the TruSeq bed file from http://www.illumina.com/support/sequ...downloads.ilmn and the reference file for mapping from ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz

    My question is on the format of the interval file. I've looked at http://www.broadinstitute.org/gsa/wi...s_for_the_GATK and used that as a template for my header but picard is complaining about this header and that the sequence dictionaries are not the same size.

    Here's what my interval_list file looks like:

    @HD VN:1.0 SO:coordinate
    @SQ SN:1 LN:249250621 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:1b22b98cdeb4a9304cb5d48026a85128
    SP:Homo Sapiens
    @SQ SN:2 LN:243199373 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:a0d9851da00400dec1098a9255ac712e
    SP:Homo Sapiens
    @SQ SN:3 LN:198022430 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:fdfd811849cc2fadebc929bb925902e5
    SP:Homo Sapiens
    @SQ SN:4 LN:191154276 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:23dccd106897542ad87d2765d28a19a1
    SP:Homo Sapiens
    @SQ SN:5 LN:180915260 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:0740173db9ffd264d728f32784845cd7
    SP:Homo Sapiens
    @SQ SN:6 LN:171115067 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:1d3a93a248d92a729ee764823acbbc6b
    SP:Homo Sapiens
    @SQ SN:7 LN:159138663 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:618366e953d6aaad97dbe4777c29375e
    SP:Homo Sapiens
    @SQ SN:8 LN:146364022 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:96f514a9929e410c6651697bded59aec
    SP:Homo Sapiens
    @SQ SN:9 LN:141213431 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:3e273117f15e0a400f01055d9f393768
    SP:Homo Sapiens
    @SQ SN:10 LN:135534747 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:988c28e000e84c26d552359af1ea2e1d
    SP:Homo Sapiens
    @SQ SN:11 LN:135006516 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:98c59049a2df285c76ffb1c6db8f8b96
    SP:Homo Sapiens
    @SQ SN:12 LN:133851895 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:51851ac0e1a115847ad36449b0015864
    SP:Homo Sapiens
    @SQ SN:13 LN:115169878 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:283f8d7892baa81b510a015719ca7b0b
    SP:Homo Sapiens
    @SQ SN:14 LN:107349540 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:98f3cae32b2a2e9524bc19813927542e
    SP:Homo Sapiens
    @SQ SN:15 LN:102531392 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:e5645a794a8238215b2cd77acb95a078
    SP:Homo Sapiens
    @SQ SN:16 LN:90354753 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:fc9b1a7b42b97a864f56b348b06095e6
    SP:Homo Sapiens
    @SQ SN:17 LN:81195210 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:351f64d4f4f9ddd45b35336ad97aa6de
    SP:Homo Sapiens
    @SQ SN:18 LN:78077248 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:b15d4b2d29dde9d3e4f93d1d0f2cbc9c
    SP:Homo Sapiens
    @SQ SN:19 LN:59128983 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:1aacd71f30db8e561810913e0b72636d
    SP:Homo Sapiens
    @SQ SN:20 LN:63025520 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:0dec9660ec1efaaf33281c0d5ea2560f
    SP:Homo Sapiens
    @SQ SN:21 LN:48129895 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:2979a6085bfe28e3ad6f552f361ed74d
    SP:Homo Sapiens
    @SQ SN:22 LN:51304566 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:a718acaa6135fdca8357d5bfe94211dd
    SP:Homo Sapiens
    @SQ SN:X LN:155270560 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:7e0e2e580297b7764e31dbc80c2540dd
    SP:Homo Sapiens
    @SQ SN:Y LN:59373566 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:1fa3474750af0948bdf97d5a0ee52e51
    SP:Homo Sapiens
    @SQ SN:MT LN:16569 AS:GRCh37 UR:ftp://ftp.sanger.ac.uk/pub/1000genom...k_v37.fasta.gz M5:c68f52674c9fb33aef52dcf399755519
    SP:Homo Sapiens
    1 14362 14829 + chr1:14363-14829:WASH5P
    1 14969 15038 + chr1:14970-15038:WASH5P
    1 15795 15947 + chr1:15796-15947:WASH5P
    1 16606 16765 + chr1:16607-16765:WASH5P
    1 16857 17055 + chr1:16858-17055:WASH5P
    1 17232 17368 + chr1:17233-17368:WASH5P
    1 17605 17742 + chr1:17606-17742:WASH5P
    1 69090 70008 + chr1:69091-70008:OR4F5

    I guess I'm stuck and any help would be appreciated. Thanks.

  • #2
    Check the chromosome name in the two files. The sanger reference file uses (1, 2, 3, X, Y) while most illumina files follow the UCSC convention (chr1, chr2, chr3, chrX, chrY). So you might need to remove the chr from the bed file. This is a fairly common issue when people mix annotation sources.


    • #3
      Thanks, I made sure to check that and I have them so they are the same in both files but the error is still there


      • #4
        Try removing the first line of the header, I had to do this for Picards CollectRnaSeqMetrics application that uses the same style of list

        @HD VN:1.0 SO:coordinate


        • #5
          Unfortunately, that did not work either. Thank you for your help.


          • #6
            Originally posted by nexgengirl View Post
            Unfortunately, that did not work either. Thank you for your help.
            Did you ever find a solution for this?

            I am still having a similar error with this, in that picard yells that my interval list does not have a header, when I have followed all the intstructions I could find in order to make it properly.

            Ive tried it with and without the @HD line as well.


            • #7
              Yes, I did get this to work. The file is too large to attach so I'll show the header of the file below.

              Here's the command I used:

              java -Xmx10g -jar /path/to/picard-tools-1.54/CalculateHsMetrics.jar BAIT_INTERVALS=Truseq_for_picard_hs.bed TARGET_INTERVALS=Truseq_for_picard_hs.bed INPUT=sample.bam OUTPUT=sample.hybrid.stats.txt REFERENCE_SEQUENCE=/path/to/human_g1k_v37.fasta PER_TARGET_COVERAGE=sample.per.target.coverage.txt VALIDATION_STRINGENCY=LENIENT

              #head of the file
              head -130 Truseq_for_picard_hs.bed

              @SQ SN:1 LN:249250621
              @SQ SN:2 LN:243199373
              @SQ SN:3 LN:198022430
              @SQ SN:4 LN:191154276
              @SQ SN:5 LN:180915260
              @SQ SN:6 LN:171115067
              @SQ SN:7 LN:159138663
              @SQ SN:8 LN:146364022
              @SQ SN:9 LN:141213431
              @SQ SN:10 LN:135534747
              @SQ SN:11 LN:135006516
              @SQ SN:12 LN:133851895
              @SQ SN:13 LN:115169878
              @SQ SN:14 LN:107349540
              @SQ SN:15 LN:102531392
              @SQ SN:16 LN:90354753
              @SQ SN:17 LN:81195210
              @SQ SN:18 LN:78077248
              @SQ SN:19 LN:59128983
              @SQ SN:20 LN:63025520
              @SQ SN:21 LN:48129895
              @SQ SN:22 LN:51304566
              @SQ SN:X LN:155270560
              @SQ SN:Y LN:59373566
              @SQ SN:MT LN:16569
              @SQ SN:GL000207.1 LN:4262
              @SQ SN:GL000226.1 LN:15008
              @SQ SN:GL000229.1 LN:19913
              @SQ SN:GL000231.1 LN:27386
              @SQ SN:GL000210.1 LN:27682
              @SQ SN:GL000239.1 LN:33824
              @SQ SN:GL000235.1 LN:34474
              @SQ SN:GL000201.1 LN:36148
              @SQ SN:GL000247.1 LN:36422
              @SQ SN:GL000245.1 LN:36651
              @SQ SN:GL000197.1 LN:37175
              @SQ SN:GL000203.1 LN:37498
              @SQ SN:GL000246.1 LN:38154
              @SQ SN:GL000249.1 LN:38502
              @SQ SN:GL000196.1 LN:38914
              @SQ SN:GL000248.1 LN:39786
              @SQ SN:GL000244.1 LN:39929
              @SQ SN:GL000238.1 LN:39939
              @SQ SN:GL000202.1 LN:40103
              @SQ SN:GL000234.1 LN:40531
              @SQ SN:GL000232.1 LN:40652
              @SQ SN:GL000206.1 LN:41001
              @SQ SN:GL000240.1 LN:41933
              @SQ SN:GL000236.1 LN:41934
              @SQ SN:GL000241.1 LN:42152
              @SQ SN:GL000243.1 LN:43341
              @SQ SN:GL000242.1 LN:43523
              @SQ SN:GL000230.1 LN:43691
              @SQ SN:GL000237.1 LN:45867
              @SQ SN:GL000233.1 LN:45941
              @SQ SN:GL000204.1 LN:81310
              @SQ SN:GL000198.1 LN:90085
              @SQ SN:GL000208.1 LN:92689
              @SQ SN:GL000191.1 LN:106433
              @SQ SN:GL000227.1 LN:128374
              @SQ SN:GL000228.1 LN:129120
              @SQ SN:GL000214.1 LN:137718
              @SQ SN:GL000221.1 LN:155397
              @SQ SN:GL000209.1 LN:159169
              @SQ SN:GL000218.1 LN:161147
              @SQ SN:GL000220.1 LN:161802
              @SQ SN:GL000213.1 LN:164239
              @SQ SN:GL000211.1 LN:166566
              @SQ SN:GL000199.1 LN:169874
              @SQ SN:GL000217.1 LN:172149
              @SQ SN:GL000216.1 LN:172294
              @SQ SN:GL000215.1 LN:172545
              @SQ SN:GL000205.1 LN:174588
              @SQ SN:GL000219.1 LN:179198
              @SQ SN:GL000224.1 LN:179693
              @SQ SN:GL000223.1 LN:180455
              @SQ SN:GL000195.1 LN:182896
              @SQ SN:GL000212.1 LN:186858
              @SQ SN:GL000222.1 LN:186861
              @SQ SN:GL000200.1 LN:187035
              @SQ SN:GL000193.1 LN:189789
              @SQ SN:GL000194.1 LN:191469
              @SQ SN:GL000225.1 LN:211173
              @SQ SN:GL000192.1 LN:547496
              1 14362 14829 + chr1:14363-14829:WASH5P
              1 14969 15038 + chr1:14970-15038:WASH5P
              1 15795 15947 + chr1:15796-15947:WASH5P
              1 16606 16765 + chr1:16607-16765:WASH5P
              1 16857 17055 + chr1:16858-17055:WASH5P
              1 17232 17368 + chr1:17233-17368:WASH5P
              1 17605 17742 + chr1:17606-17742:WASH5P
              1 69090 70008 + chr1:69091-70008:OR4F5
              1 661139 665184 + chr1:661140-665184:LOC100133331
              1 761586 762902 + chr1:761587-762902:NCRNA00115
              1 763063 763155 + chr1:763064-763155:LOC643837
              1 783033 783186 + chr1:783034-783186:LOC643837
              1 787306 787490 + chr1:787307-787490:LOC643837
              1 788050 788146 + chr1:788051-788146:LOC643837
              1 788770 788902 + chr1:788771-788902:LOC643837
              1 788956 789740 + chr1:788957-789740:LOC643837
              1 803452 804055 + chr1:803453-804055:FAM41C
              1 809491 810535 + chr1:809492-810535:FAM41C
              1 812125 812182 + chr1:812126-812182:FAM41C
              1 852952 853100 + chr1:852953-853100:FLJ39609
              1 853401 853555 + chr1:853402-853555:FLJ39609
              1 854204 854295 + chr1:854205-854295:FLJ39609
              1 854714 854817 + chr1:854715-854817:FLJ39609
              1 861120 861180 + chr1:861121-861180:SAMD11
              1 861301 861393 + chr1:861302-861393:SAMD11
              1 865534 865716 + chr1:865535-865716:SAMD11
              1 866418 866469 + chr1:866419-866469:SAMD11
              1 871151 871276 + chr1:871152-871276:SAMD11
              1 874419 874509 + chr1:874420-874509:SAMD11
              1 874654 874840 + chr1:874655-874840:SAMD11
              1 876523 876686 + chr1:876524-876686:SAMD11
              1 877515 877631 + chr1:877516-877631:SAMD11
              1 877789 877868 + chr1:877790-877868:SAMD11
              1 877938 878438 + chr1:877939-878438:SAMD11
              1 878632 878757 + chr1:878633-878757:SAMD11
              1 879077 879188 + chr1:879078-879188:SAMD11
              1 879287 879583 + chr1:879288-879583:SAMD11
              1 879961 880180 + chr1:879962-880180:NOC2L
              1 880897 881033 + chr1:880898-881033:NOC2L
              1 881552 881666 + chr1:881553-881666:NOC2L
              1 881781 881925 + chr1:881782-881925:NOC2L
              1 883510 883612 + chr1:883511-883612:NOC2L
              1 883869 883983 + chr1:883870-883983:NOC2L
              1 886506 886618 + chr1:886507-886618:NOC2L
              1 887379 887519 + chr1:887380-887519:NOC2L
              1 887791 887980 + chr1:887792-887980:NOC2L
              Last edited by nexgengirl; 09-08-2012, 09:27 AM.


              • #8
                Since showing the file on here doesn't look so good I have also attached the first 130 lines as a file so you can see how it looks in the terminal.
                Attached Files


                • #9
                  I had the same problem, and I came across this small 2-liner (by a colleague) which worked for me:

                  samtools view -H input.bam > TruSeq-for-Picard.bed
                  gawk 'BEGIN {  OFS="\t"} {print $1,$2,$3,$6,$4 }' TruSeq-Exome-Targeted-Regions.bed >> TruSeq-for-Picard.bed
                  where TruSeq-Exome-Targeted-Regions.bed is the bed file downloaded off the Illumina website.


                  • #10
                    This two-liner is extremely helpful, many thanks for that.


                    • #11
                      Converting from BED to picard formats is a bit more complicated than the two-liner that got posted.

                      The BED format specification states that BED files are first-base-0 and the interval is exclusive of the last base:

                      Where the Picard interval list is first-base-1 and last base inclusive:

                      So a region defined in a BED file as:
                      1 14362 14829

                      Needs to become the following in a Picard interval list:
                      1 14363 14829


                      • #12
                        like mducar (no relation) said, the positions need to be adjusted for the different numbering schemes, so change $2 to $2+1 in your awk line. also, if your bed file has a "track" line, omit that for your intervals file. the revised two-liner would be

                        samtools view -H my.bam > my.1based.intervals

                        gawk 'BEGIN { OFS="\t"} {print $1,$2+1,$3,$6,$4 }' my.bed | grep -v ^track >> my.1based.intervals

                        to verify the results:
                        head my.1based.intervals
                        cat my.1based.intervals | grep -v ^@ | head
                        head my.bed


                        Latest Articles


                        • seqadmin
                          Improved Targeted Sequencing: A Comprehensive Guide to Amplicon Sequencing
                          by seqadmin

                          Amplicon sequencing is a targeted approach that allows researchers to investigate specific regions of the genome. This technique is routinely used in applications such as variant identification, clinical research, and infectious disease surveillance. The amplicon sequencing process begins by designing primers that flank the regions of interest. The DNA sequences are then amplified through PCR (typically multiplex PCR) to produce amplicons complementary to the targets. RNA targets...
                          03-21-2023, 01:49 PM
                        • seqadmin
                          Targeted Sequencing: Choosing Between Hybridization Capture and Amplicon Sequencing
                          by seqadmin

                          Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...
                          03-10-2023, 05:31 AM





                        Topics Statistics Last Post
                        Started by seqadmin, Yesterday, 11:44 AM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 03-24-2023, 02:45 PM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 03-22-2023, 12:26 PM
                        0 responses
                        Last Post seqadmin  
                        Started by seqadmin, 03-17-2023, 12:32 PM
                        0 responses
                        Last Post seqadmin