Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Where to obtain HumanNCBI37_UCSC reference sequence?

    Hi.

    Please can somebody let me know where I can obtain the HumanNCBI37_UCSC reference sequence? It is hg19 but standard hg19 reference files have incompatible dictionaries compared with BAM files aligned with this reference. Please tell me where I can download this reference file from. I've looked everywhere (including both NCBI and UCSC) and can't find it. Thanks for your help.

    Regards

    - Dave Curtis

  • #2
    Originally posted by davecurtis View Post
    Hi.

    Please can somebody let me know where I can obtain the HumanNCBI37_UCSC reference sequence? It is hg19 but standard hg19 reference files have incompatible dictionaries compared with BAM files aligned with this reference. Please tell me where I can download this reference file from. I've looked everywhere (including both NCBI and UCSC) and can't find it. Thanks for your help.

    Regards

    - Dave Curtis
    What does that (bold above) mean? All Hg19 files are in this directory at UCSC: http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/

    Comment


    • #3
      Thanks. I don't see the file I am looking for in that folder.

      I have a set of BAM files which have been aligned using this file:
      samtoolsRefFile=/illumina/scratch/services/Genomes/FASTA_UCSC/HumanNCBI37_UCSC/HumanNCBI37_UCSC_XX.fa

      I have a reference file called hg19_UCSC.fa and for most chromosomes HaplotypeCaller runs fine using this reference sequence. However with HaplotypeCaller for chromosomes 19, 21 and 22 I get this error message:
      WARN 08:38:22,963 SequenceDictionaryUtils - Input files reads and reference have incompatible contigs: The following contigs included in the intervals to process have different indices in the sequence dictionaries for the reads vs. the reference: [chr22]. As a result, the GATK engine will not correctly process reads from these contigs. You should either fix the sequence dictionaries for your reads so that these contigs have the same indices as in the sequence dictionary for your reference, or exclude these contigs from your intervals. This error can be disabled via -U ALLOW_SEQ_DICT_INCOMPATIBILITY, however this is not recommended as the GATK engine will not behave correctly..

      In fact, even if I set ALLOW_SEQ_DICT_INCOMPATIBILITY I still get the error and I don't get any calls for these chromosomes.

      It seems that there is some incompatibility in the dictionaries of the BAM and reference files which I have not been able to fix.

      Using google, I have seen other people refer to the HumanNCBI37_UCSC reference sequence so I assume it is a standard reference for hg19 but presumably with a slightly different dictionary from the file called hg19_UCSC.fa.

      Comment


      • #4
        Perhaps someone else will have a better answer ...

        You may have to ask whoever aligned those files in the first place as to where they got their reference from. With patches/releases it may be difficult to nail down an exact provenance for a file that claims to be HumanNCBI37_UCSC reference unless you know that it was obtained from the directory I posted above at UCSC.

        Comment


        • #5
          Thanks. I think I've worked it out. The BAM files I have were prepared with two different references - one with the Y chromosome and one without and this threw out the indexing for the chromosomes listed after.

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-25-2024, 11:49 AM
          0 responses
          19 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-24-2024, 08:47 AM
          0 responses
          18 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          62 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          60 views
          0 likes
          Last Post seqadmin  
          Working...
          X