Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Strand specificity of genome fasta files (hg19)

    Hello everybody,

    do genomic reference FASTA files, such as famous hg19.fa,
    usually contain
    one continuous physical strand, meaning that sense (plus) and antisense (minus) sections are shown in their natural sequence,
    or a virtual concatenation of only the sense (plus) sections, which physically alternate between both single strands?

    Regards,
    SquirrelSeq

  • #2
    Reference fasta files contain (or rather, define) only the plus strand of a genome. The sequences are only concatenated where there is evidence that they are physically joined, so for example hg19.fa only contains 25 main sequences, one for each of the 22 autosomes, X, Y, and mitochondrial. There can also be some shorter additional sequences but those represent human variation or the imperfection of sequenced genome and can be safely ignored for the purpose of this discussion.

    The direction from which genes are read is totally unrelated to a fasta genome file; there will be some genes on the plus strand, and some on the minus strand, and of course (for human) the majority is non-coding anyway. A transcriptome fasta is different, though - it should have one sequence per gene or gene isoform, representing the sense in which it is read, rather than however it appears in the genome.
    Last edited by Brian Bushnell; 07-21-2014, 09:06 AM.

    Comment


    • #3
      Hello Brian,

      thank you for the answer.

      The whole confusion is caused by the mix-up of two different definitions of "plus"/"+" and "minus"/"-", when people are talking about single genes on the one hand and genes on whole chromosomal fasta references on the other hand.

      My scenario of concatenation did of course not mean to join physically unjoined sequences, as I know very well the typical chomosome format structure of FASTA files. Instead, thinking that the usage of "+"/"-" terminology was consistently used, I imagined a concatenation of adjacent sequence regions in sense (protein-coding) orientation w.r.t. the alteration of strand usage for coding along the chromosome as a complicated, but consistent solution. This would of course mean an arbitrary definition of which strand is "+" and which one "-" in noncoding regions and furthermore positional jumping w.r.t. the dsDNA molecule.

      From the perspective of FASTA files and gene annotation practices, your explanation/understandig is useful. However, since it is not consistent, with the independent definition of "plus" and "minus", people are confused and misunderstand each other as I saw in many cases.

      Therefore, some facts:

      1. "In genomic FASTA reference files, all lines are from the same strand".

      Which molecular strand sequence is selected for a genomic dsDNA chromosomal FASTA reference, is fully arbitrary and has nothing to do with "+" or "-".

      2. The only way to however give the published strand of the assembly an identity, is to refer to the GENES that are coded in "sense" on this selected strand, meaning that...

      3. if gene annotation tools specify a gene as coded on the "-"strand, it solely means that this gene is coded on the strand antisense to the arbitrarily published one. This is not to be confused with the reference-independent definition of "+" and "-" strand, which is...

      4. "Molecular biologists call a single strand of DNA sense (or positive (+) ) if an RNA version of the same sequence is translated or translatable into protein.”
      “The two complementary strands of double-stranded DNA (dsDNA) are usually differentiated as the "sense" strand and the "antisense" strand. The DNA sense strand looks like the messenger RNA (mRNA) and can be used to read the expected protein code by human eyes (e.g. ATG codon = Methionine amino acid).”
      http://en.wikipedia.org/wiki/Sense_(molecular_biology)

      I hope, this helps for future questions on this topic.

      Best regards,
      SquirrelSeq

      Comment


      • #4
        Hmm, I guess "+" and "-" are overloaded terms. When dealing with the human genome, the people I worked with generally talked about reads mapping to the plus strand, or having the 'A' allele of a SNP on the plus strand, with the assumption that plus meant the strand represented in the fasta file, NOT any gene that happened to be at that location. Because a majority of the human genome is noncoding, most of it cannot be described as plus or minus using a gene-centric definition, but the strands still need to be described somehow for clarity.

        Comment


        • #5
          Originally posted by SquirrelSeq View Post
          Hello everybody,

          do genomic reference FASTA files, such as famous hg19.fa,
          usually contain
          one continuous physical strand, meaning that sense (plus) and antisense (minus) sections are shown in their natural sequence,
          or a virtual concatenation of only the sense (plus) sections, which physically alternate between both single strands?

          Regards,
          SquirrelSeq



          This is a question that confused me for quite some time. Thanks for the discussion above. I want to share a little experiment I did which convinced myself that in the reference FASTA, all the sequence are from the same strand.



          I did a few experiments.

          I found a sense gene in the GRCM38 (mm10), APC, which locates on chromosome 18. And just a random piece of its reference seq, say "TCCAGATAGTCCTGGGCAGACCATGCCACCAA"

          You can actually see it in the fasta file to verify that it is actually from the reference FASTA
          Click image for larger version

Name:	Screen Shot 2023-11-06 at 10.24.05 AM.png
Views:	195
Size:	41.2 KB
ID:	325170


          And I will BLAST it now

          Click image for larger version

Name:	Screen Shot 2023-11-06 at 10.24.31 AM.png
Views:	176
Size:	115.6 KB
ID:	325171
          I got result and can see it in the "Graphics" mode:


          Click image for larger version

Name:	Screen Shot 2023-11-06 at 10.25.09 AM.png
Views:	177
Size:	75.8 KB
ID:	325172​As you can see our random piece "TCCAGATAGTCCTGGGCAGACCATGCCACCAA" can be seen from the top strand (5' -> 3') And the green track marks the sense gene Apc.


          What about an anti-sense gene in the same chromosome? Let's say Smad4

          Again I took a random piece "CCCCACCTTGTCTATGACACATCAAACTAT" from chromosome 18, where Smad4 is located.

          to verify let's see

          Click image for larger version

Name:	Screen Shot 2023-11-06 at 11.08.36 AM.png
Views:	177
Size:	39.8 KB
ID:	325174



          And this random piece is still seen in the top strand
          Click image for larger version

Name:	Screen Shot 2023-11-06 at 11.07.59 AM.png
Views:	175
Size:	83.4 KB
ID:	325173


          So my conclusion is for both scenarios, the reference is on the same strand, 5'-3'
          In the case of sense gene (which uses the other strand of the DNA, 3'-5' as the template in transcription and translation), the mRNA would have the same information as the reference gene DNA (except for T/U difference), and the reference sequence would be called the coding strand.
          In the case of anti-sense gene, it has the reference sequence as the template so the corresponding mRNA would have the same information as the other strand, and the reference sequence would be the template strand.


          I hope this helped.





          Comment

          Latest Articles

          Collapse

          • seqadmin
            Essential Discoveries and Tools in Epitranscriptomics
            by seqadmin




            The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
            04-22-2024, 07:01 AM
          • seqadmin
            Current Approaches to Protein Sequencing
            by seqadmin


            Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
            04-04-2024, 04:25 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 04-25-2024, 11:49 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-24-2024, 08:47 AM
          0 responses
          20 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-11-2024, 12:08 PM
          0 responses
          62 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 04-10-2024, 10:19 PM
          0 responses
          61 views
          0 likes
          Last Post seqadmin  
          Working...
          X