Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Naive question about read mapping, where is intron in genome.fa data

    Dear all.

    I have a question. I found the genome fastq data only contained sequence "ATCGG...", How the mapping softwre, such as tophat, decide where is the intron or exon?

  • #2
    Any help????

    Comment


    • #3
      Genome fastqs are generally not annotated for what sequence is intron and exon. You need some other file that says where introns and exons are, like a Gff.

      Comment


      • #4
        I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual

        TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.
        Look at the manual for more help.

        Comment


        • #5
          Originally posted by westerman View Post
          I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual



          Look at the manual for more help.
          http://tophat.cbcb.umd.edu/manual.html
          Thank you. That's exactly my question. Why reads contiguously align to the genome can define a exon? How define "congiguous" ?

          Comment


          • #6
            I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

            As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

            a) Split up the reads into small segments ... say 40 bases.

            b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

            c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

            d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

            Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.

            Comment


            • #7
              Originally posted by westerman View Post
              I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

              As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

              a) Split up the reads into small segments ... say 40 bases.

              b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

              c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

              d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

              Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.
              Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?

              Comment


              • #8
                Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?
                Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

                Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.

                Comment


                • #9
                  Originally posted by westerman View Post
                  Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

                  Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.
                  Thank you. One thing confusing me is the defination of proper pair. It's a pair read which were aligned with the defined distance (I believe it's the fragment size). However, if the pair reads were aligned to two exons, then their distance should + intron length, so the distance must be much larger than the predifined fragment size. How tophat identify it's a proper pair reads?

                  Comment


                  • #10
                    Does tophat use the term 'proper pair' anywhere? If so could you please give a reference to its use.

                    In samtools there is "proper pair". If you are talking about this, then I am suspecting that tophat marks reads as "proper pair" inside the bam format if the pairs do indeed span a junction. That is, pairs that contribute to a junction call are good and thus "proper".

                    As far as I know there is no one definition of a "proper pair" in BAM/SAM. A pair is "proper" if the program that makes up the BAM/SAM file deems the pair as proper.

                    Once again I put my normal disclaimers about not being a Tophat expert.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Understanding Genetic Influence on Infectious Disease
                      by seqadmin




                      During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                      Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                      09-09-2024, 10:59 AM
                    • seqadmin
                      Addressing Off-Target Effects in CRISPR Technologies
                      by seqadmin






                      The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                      08-27-2024, 04:44 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 06:25 AM
                    0 responses
                    13 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, Yesterday, 01:02 PM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 09-18-2024, 06:39 AM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 09-11-2024, 02:44 PM
                    0 responses
                    14 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X