Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Mucki0815
    Junior Member
    • Sep 2011
    • 2

    Human Reference Genome

    Hello everyone!

    Sorry to bother all the more experienced people with a dummy question like this but what is the Human reference genome one should use nowadays and where can I download it in fasta format? Are there different reference genomes that yield differing results when aligneing data?

    When browsing the NCBI homepage I found a remark somewhere that one should use the same reference genome as the 1000 genomes project but the links only led me to an ftp server page (ftp://ftp-trace.ncbi.nih.gov/1000gen...cal/reference/) looking like that:

    Oct 08 2009 00:00 579 README.human_g1k_v37.fasta.txt
    Aug 27 2009 00:00 136 README_gencode_gtf_format
    Aug 13 2009 00:00 4313 SNPChrPosAllele_b129.README
    Aug 13 2009 00:00 189073716 SNPChrPosAllele_b129.txt.gz
    Oct 29 2010 00:00 Directory ancestral_alignments
    Nov 03 2010 00:00 398589572 dbsnp132_20101103.vcf.gz
    Oct 13 2011 02:31 Directory exome_pull_down_targets
    Jul 22 2010 00:00 8930799 gencode.v4.pc_translations.fa.gz
    Jul 22 2010 00:00 594881 gencode.v4.polyAs.GRCh37.gtf.gz
    Jul 22 2010 00:00 15059 gencode.v4.tRNAs.GRCh37.gtf.gz
    Jul 02 2010 00:00 21227244 gencode_v4.annotation.GRCh37.gtf.gz
    Oct 27 2010 00:00 1396 human_ancestor_GRCh37_e59.README
    Oct 27 2010 00:00 794022511 human_ancestor_GRCh37_e59.tar.bz2
    May 17 2010 00:00 2746 human_g1k_v37.fasta.fai
    May 17 2010 00:00 892331003 human_g1k_v37.fasta.gz
    Nov 01 2010 00:00 33054817 merge_rs_b129_b132.txt.gz
    Sep 23 2011 02:32 Directory phase2_mapping_resources
    Jul 13 2011 02:34 Directory phase2_reference_assembly_sequence
    Jul 13 2011 02:34 Directory reference_assembly_sequence
    Feb 24 2010 00:00 22291 sample_genders.csv
    Nov 03 2010 00:00 33280 snp_info_tags_b132.xls

    Without further information.

    What do all these abbreviations mean? What's the difference between a fasta.fai and a fasta.gz file?

    The README.human_g1k_v37.fasta.txt file tells me to:

    1. Download individual chrs from ensembl ftp

    ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/

    2. Download the newer version of the MT (NC_012920) from:



    3. Create a reference with chrs1-22, X, Y, NC_012920 MT, and include the non-chromosomal supercontigs. The new single fasta is posted:

    ftp://ftp.sanger.ac.uk/pub/1000genom...ect_reference/

    The sanger homepage then shows me these files:

    Parent Directory

    Oct 07 2009 00:00 579 README
    Oct 08 2009 00:00 2746 human_g1k_v37.fasta.fai
    Oct 08 2009 00:00 67 human_g1k_v37.fasta.fai.md5
    Oct 07 2009 00:00 869925027 human_g1k_v37.fasta.gz
    Oct 07 2009 00:00 57 human_g1k_v37.fasta.gz.md5
    Oct 07 2009 00:00 Directory old

    So are "human_g1k_v37.fasta.fai" and "human_g1k_v37.fasta.gz" the complete reference genomes? What das the ending ".md5" mean?

    How can I fuse different fasta files to one big file?


    Thanks beforehand for your help.

    Greetings,

    Alexander
  • colindaven
    Senior Member
    • Oct 2008
    • 417

    #2
    Hi,

    for md5 google "md5 sum".

    The human genome should be around 3 - 3.2 Gb, depending, as you say, on if you include extra contigs

    You're partially right, human_g1k_v37.fasta.gz
    seems to me to be correct from this source.

    fai is a fasta index, which can be generated by Samtools.

    Most people seem to build a complete genome from the individual contigs.

    See the first post in
    Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

    for a nice manual on how to build your own human genome with "cat".

    Comment

    • rskr
      Senior Member
      • Oct 2010
      • 249

      #3
      Not a trivial question. It depends on what you want to do with it. Many people simply can't deal with the variations such as HLA-6 on chromosome six, or VDJ regions, so they choose to ignore them. Which is a bit sad because most people working with the human genome are in medicine and should be very interested in HLA-6 as it is crucial for the immune system functioning.

      Comment

      • vcguy
        Junior Member
        • Mar 2011
        • 1

        #4
        The reference genomes for human, mouse and zebrafish is improved, maintained and released by the Genome Reference Consortium (GRC)



        The last major release was GRCh37 which you see in most of the browsers. However since that release there have been regional fixes in the form of "patches". The latest asssembly in that case is GRCh37.p5. You can download the latest data from the above website. Other information including problematic regions or fixes are also displayed on the website.

        hope that helps.

        Comment

        • rosa_dentellare
          Member
          • Sep 2011
          • 10

          #5
          Hi,

          Need help from the sequencing community.

          I've downloaded all the GRCh37 assembled referance at ftp://ftp.ncbi.nlm.nih.gov/genbank/g...mosomes/FASTA/.

          But what i got was 48 files cosisting of individual chromosome. I was thingking of merging all the files together but then there was two types of files for each chromosome:
          1) chr*.fa.gz
          2) chr*.rm.out.gz

          Would it be ok if I merge it together with the repeat masker output (.rm.out.gz) files to build my referance chromosome?

          Also, does anyone know how to mask out the PAR from the referance?
          HTML Code:
          <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

          Comment

          • dpryan
            Devon Ryan
            • Jul 2011
            • 3478

            #6
            I expect merging the regular fasta files with the repeat masked files is not what you want to do, at least if you plan to use the resulting file for mapping or anything else that's standard. Just concatenate the various chr*.fa.gz files together.

            Comment

            • rosa_dentellare
              Member
              • Sep 2011
              • 10

              #7
              thanks for the input dpryan. appreciate it.

              am abit confused. what are the *.rm.out.gz files for, if I may ask?
              HTML Code:
              <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

              Comment

              • dpryan
                Devon Ryan
                • Jul 2011
                • 3478

                #8
                They're the output from repeatmasker, saying which regions are repeats and what type (LINEs, SINEs, LTRs, etc.). They aren't fasta files.

                Comment

                • rosa_dentellare
                  Member
                  • Sep 2011
                  • 10

                  #9
                  ok..got it now. thank you dpryan =) u've been a help.
                  HTML Code:
                  <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

                  Comment

                  • rosa_dentellare
                    Member
                    • Sep 2011
                    • 10

                    #10
                    oh..another question came to mind. how do I remove the PAR from the reference? or have it been removed already from the .fa files?
                    HTML Code:
                    <a href="http://www.mylivesignature.com" target="_blank"><img src="http://signatures.mylivesignature.com/54489/368/747C8ACDDDB7178899D9E6BAA765C3FC.png" style="border: 0 !important; background: transparent;"/></a>

                    Comment

                    Latest Articles

                    Collapse

                    • GATTACAT
                      Reply to Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by GATTACAT
                      Love this - good data definitely starts from good input, and poor input can only give relatively poor data. I particularly like the mention of Nanodrop/absorbance based methods for quantification. It's such a toss up if you'll get an accurate reading or what amounts to a randomly generated number, and a lot of library/sequencing related issues can be traced back to poor quant.
                      07-01-2026, 11:43 AM
                    • SEQadmin2
                      Nine Things a Sample Prep Scientist Thinks About Before Sequencing
                      by SEQadmin2


                      I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

                      Here are nine questions we think about, in roughly the order they matter, before...
                      06-18-2026, 07:11 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by SEQadmin2, 07-02-2026, 11:08 AM
                    0 responses
                    13 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-30-2026, 05:37 AM
                    0 responses
                    15 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-26-2026, 11:10 AM
                    0 responses
                    20 views
                    0 reactions
                    Last Post SEQadmin2  
                    Started by SEQadmin2, 06-17-2026, 06:09 AM
                    0 responses
                    54 views
                    0 reactions
                    Last Post SEQadmin2  
                    Working...