Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Naive question about read mapping, where is intron in genome.fa data

    Dear all.

    I have a question. I found the genome fastq data only contained sequence "ATCGG...", How the mapping softwre, such as tophat, decide where is the intron or exon?

  • #2
    Any help????

    Comment


    • #3
      Genome fastqs are generally not annotated for what sequence is intron and exon. You need some other file that says where introns and exons are, like a Gff.

      Comment


      • #4
        I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual

        TopHat finds splice junctions without a reference annotation. By first mapping RNA-Seq reads to the genome, TopHat identifies potential exons, since many RNA-Seq reads will contiguously align to the genome. Using this initial mapping, TopHat builds a database of possible splice junctions, and then maps the reads against this junction to confirm them.
        Look at the manual for more help.

        Comment


        • #5
          Originally posted by westerman View Post
          I do agree with swbarnes2 that genome fastqs are not annotated and that generally you need a gff file for proper and verified splice sites. But since you asked about Tophat and presumably de-novo detection of junctions, I quote from the Tophat manual



          Look at the manual for more help.
          http://tophat.cbcb.umd.edu/manual.html
          Thank you. That's exactly my question. Why reads contiguously align to the genome can define a exon? How define "congiguous" ?

          Comment


          • #6
            I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

            As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

            a) Split up the reads into small segments ... say 40 bases.

            b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

            c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

            d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

            Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.

            Comment


            • #7
              Originally posted by westerman View Post
              I believe that you have the meaning of the word 'contiguous' correct -- that is the reads have to match exactly the genome.

              As I said, look at the manual for more help. The part I quoted was just a small introduction to how Tophat works. Now I am far away from being a Tophat expert but basically the idea is to:

              a) Split up the reads into small segments ... say 40 bases.

              b) Align these splits contiguously (e.g., exactly) to the genome; many will align but many will not because they span junctions.

              c) Where there are many reads aligning then consider this an 'island' which represents correct alignments. An island will not contain a junction because otherwise the split would not align.

              d) Stitch these islands together to cover junctions. The strongest evidence of a junction is where a read has two different 'splits' in two different islands. In other words the only way a read could be in two islands is if the read spans a junction. There are other avenues of evidence as well (e.g., you can slowly build out from each island via adding parts of non-island reads to the island until a junction border is reached.)

              Now as I said I am far from an Tophat expert. If someone with a better understanding can chime in then that would be great. In the meantime studying the manual is your (and my) best option.
              Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?

              Comment


              • #8
                Thank you very much! It's pretty clear. I also have a question. For pair end reads, what if one read mapped to one exon and the other mapped to the other exon? How define this kind of alignment, is it proper pair map or not ?
                Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

                Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.

                Comment


                • #9
                  Originally posted by westerman View Post
                  Yes, the pairs would be correct. In fact such information can be used to determine junctions. In other words if the pairs are mapped to parts of the genome that are, say 5KB away from each other but you know that the ends should be within 200 bases of each other, give or take a 100 bases, then those pairs must be spanning a junction.

                  Once again I am not a tophat expert nor do I know the internals to Tophat, but I believe that tophat uses the above reasoning as part of its junction finding strategy.
                  Thank you. One thing confusing me is the defination of proper pair. It's a pair read which were aligned with the defined distance (I believe it's the fragment size). However, if the pair reads were aligned to two exons, then their distance should + intron length, so the distance must be much larger than the predifined fragment size. How tophat identify it's a proper pair reads?

                  Comment


                  • #10
                    Does tophat use the term 'proper pair' anywhere? If so could you please give a reference to its use.

                    In samtools there is "proper pair". If you are talking about this, then I am suspecting that tophat marks reads as "proper pair" inside the bam format if the pairs do indeed span a junction. That is, pairs that contribute to a junction call are good and thus "proper".

                    As far as I know there is no one definition of a "proper pair" in BAM/SAM. A pair is "proper" if the program that makes up the BAM/SAM file deems the pair as proper.

                    Once again I put my normal disclaimers about not being a Tophat expert.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      Best Practices for Single-Cell Sequencing Analysis
                      by seqadmin



                      While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                      06-06-2024, 07:15 AM
                    • seqadmin
                      Latest Developments in Precision Medicine
                      by seqadmin



                      Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                      Somatic Genomics
                      “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                      05-24-2024, 01:16 PM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 07:23 AM
                    0 responses
                    8 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-17-2024, 06:54 AM
                    0 responses
                    12 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-14-2024, 07:24 AM
                    0 responses
                    24 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 06-13-2024, 08:58 AM
                    0 responses
                    18 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X