Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Haplotype and "random" chromosomes

    Hi all!

    I'm trying to make sense of some sequence data (50 base reads, Illumina) and I noticed an area of coverage gap on chromosome 6 - around the region that aligns very well to the haplotype chromosomes. I have some reads that mapped to the haplotype chromosomes (e.g. chr6_cox_hap1), not enough to explain the dip in coverage. I am worried that because of the high homology between the chromosomes, the "missing" reads might be "hiding" as ##:##:## (i.e., as not mappable due to the fact they map equally well to >1 locus).

    So basically what I am wondering - and I apologize if this is a very basic question - Do you align your reads to all the available chromosomes or do you omit the "haplotype" and "random" ones from your build? And if you are using all of the chromosomes, do you observe the same dip in coverage?

    I would be very grateful for any advice you might have...
    Thanks!!
    Popto

  • #2
    I always omit the haplotype sequences from the reference index, for precisely the reason you mention.

    Simon

    Comment


    • #3
      Thank you, Simon, this is very helpful.

      Comment


      • #4
        Originally posted by Simon Anders View Post
        I always omit the haplotype sequences from the reference index, for precisely the reason you mention.

        Simon
        How do you determine which region is haplotype sequence?

        Comment


        • #5
          I took my reference from Ensembl: ftp://ftp.ensembl.org/pub/current_fa...o_sapiens/dna/

          All the files with "HSCHR" in the file name are haplotype variants, e.g., the "HSCHR6_MHC" files contain variants to the the MHC region of chromosome 6. I suggest to simply not include these files when building the reference (unless, of course, you are specifically interested in them, but then you need to do some additional tweaking).

          The "nonchromosomal" file contains the "random" contigs. I usually include them, but these contigs are so short that it does not really matter.

          Do not take, by the way, the repeat masked ("rm" in the filename) sequences. You should leave checking for repeats to the aligner.

          Simon

          Comment


          • #6
            Simon,

            I presume that if you do exclude the haplotypes in the index then you remove those chromosomes from the GTF annotation file aswell? Right?

            So basically if I am understanding correctly the reason then Simon, you remove these haplotypes because there is going to be an alignment problem due to the high similarity between the two chromosomes and you may get false mapping to a chromosome?

            Thanks,

            Comment


            • #7
              Originally posted by pcg View Post
              I presume that if you do exclude the haplotypes in the index then you remove those chromosomes from the GTF annotation file aswell? Right?
              Actually, no. The aligner does not need a GTF file, and when counting later (e.g. with my htseq-count script), a feature in the GTF file with a chromosome name that does not appear in the SAM file will not collect any counts anyway.

              So basically if I am understanding correctly the reason then Simon, you remove these haplotypes because there is going to be an alignment problem due to the high similarity between the two chromosomes and you may get false mapping to a chromosome?
              Especially when looking for differential expression, it is a good idea to discount all non-unique alignments. Now, if the aligner sees several version of, e.g., the MHC, it does not know that these are all variants of the same region but rather treats them as paralogs at different places. So. if a read maps there, the aligner will think that there are multiple mappings, flag the read accordingly, and you will exclude it, ending up with no signal at all at the variant regions, even (or: especially) at the parts of the variant region that are actually conserved and would hence have posed no problem for mapping.

              Simon

              Comment


              • #8
                Thanks Simon for your reply.

                As you rightly point out you do not need a GTF for alignment but if you want to run a cufflinks analysis on the alignment and only want expression for what is currently annotated (in the GTF) then unless you remove those haplotypes from the GTF file you will still see hits to them and expression values?

                Thanks in advance,

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM
                • seqadmin
                  Techniques and Challenges in Conservation Genomics
                  by seqadmin



                  The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                  Avian Conservation
                  Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                  03-08-2024, 10:41 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 03-27-2024, 06:37 PM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-27-2024, 06:07 PM
                0 responses
                11 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-22-2024, 10:03 AM
                0 responses
                53 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 03-21-2024, 07:32 AM
                0 responses
                69 views
                0 likes
                Last Post seqadmin  
                Working...
                X