Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • jp.
    replied
    Hi kenietz
    I have read you post and like to ask you about tophat2 commands. Actually, my problem is that I am using mouse RNAseq and run the tophat2 and cufflinks but when I analysed the data in cummeRbund it gives me :
    CuffSet instance with:
    4 samples
    39098 genes
    0 isoforms
    0 TSS
    0 CDS
    0 promoters
    0 splicing
    0 relCDS

    I think my problem is not considering the read length (101 x 2) and insert size 80-380(mean 150).
    How can I solve this problem ? I read your post and saw that you have solved it. May you please let me the solution.
    Thank you very much in advance



    Originally posted by kenietz View Post
    Hi,
    i have a question on that topic.
    I am trying to map some 76x2 Illumina data with tophat-2.0.4. My reads are 76bp and the fragment size is around 280bp. So i choose '-r 128 --mate-std-dev 40' as i have a pdf with graph for the fragment size. I took only 1000 reads and mapped them with tophat. From samtools flagstat i get:

    1622 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    1622 + 0 mapped (100.00%:-nan%)
    1622 + 0 paired in sequencing
    763 + 0 read1
    859 + 0 read2
    656 + 0 properly paired (40.44%:-nan%)
    676 + 0 with itself and mate mapped
    946 + 0 singletons (58.32%:-nan%)
    16 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    properly paired only 40%. i ran cufflinks on the accepted.bam and see this:

    Warning: Could not connect to update server to verify current version. Please check at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
    [15:16:58] Inspecting reads and determining fragment length distribution.
    > Processed 86 loci. [*************************] 100%
    Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
    > Map Properties:
    > Normalized Map Mass: 225.85
    > Raw Map Mass: 225.85
    > Fragment Length Distribution: Truncated Gaussian (default)
    > Default Mean: 200
    > Default Std Dev: 80
    [15:16:58] Assembling transcripts and estimating abundances.
    > Processed 86 loci. [*************************] 100%


    Fragment length mean 200 and Std.Dev 80.

    Well i dont get it. Does that mean that my fragment length is actually around 200 and not around 280 as i am told by the graph? And should i use that value for mapping with tophat?

    Because of that result i tried mapping with tophat with -r 48 (200 - 2*76) but that yeilded 31.7% properly paired reads.

    Im a bit confused. Can someone please explain to me what is going on?

    Thank you in advance.

    Leave a comment:


  • csmatyi
    replied
    Hello, check the appropriate sra file at ncbi, they will give you the length of the left and right reads and the insert mean and std!

    Leave a comment:


  • kenietz
    replied
    Hi,
    i have a question on that topic.
    I am trying to map some 76x2 Illumina data with tophat-2.0.4. My reads are 76bp and the fragment size is around 280bp. So i choose '-r 128 --mate-std-dev 40' as i have a pdf with graph for the fragment size. I took only 1000 reads and mapped them with tophat. From samtools flagstat i get:

    1622 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    1622 + 0 mapped (100.00%:-nan%)
    1622 + 0 paired in sequencing
    763 + 0 read1
    859 + 0 read2
    656 + 0 properly paired (40.44%:-nan%)
    676 + 0 with itself and mate mapped
    946 + 0 singletons (58.32%:-nan%)
    16 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    properly paired only 40%. i ran cufflinks on the accepted.bam and see this:

    Warning: Could not connect to update server to verify current version. Please check at the Cufflinks website (http://cufflinks.cbcb.umd.edu).
    [15:16:58] Inspecting reads and determining fragment length distribution.
    > Processed 86 loci. [*************************] 100%
    Warning: Using default Gaussian distribution due to insufficient paired-end reads in open ranges. It is recommended that correct parameters (--frag-len-mean and --frag-len-std-dev) be provided.
    > Map Properties:
    > Normalized Map Mass: 225.85
    > Raw Map Mass: 225.85
    > Fragment Length Distribution: Truncated Gaussian (default)
    > Default Mean: 200
    > Default Std Dev: 80
    [15:16:58] Assembling transcripts and estimating abundances.
    > Processed 86 loci. [*************************] 100%


    Fragment length mean 200 and Std.Dev 80.

    Well i dont get it. Does that mean that my fragment length is actually around 200 and not around 280 as i am told by the graph? And should i use that value for mapping with tophat?

    Because of that result i tried mapping with tophat with -r 48 (200 - 2*76) but that yeilded 31.7% properly paired reads.

    Im a bit confused. Can someone please explain to me what is going on?

    Thank you in advance.

    Leave a comment:


  • jpitt
    replied
    Originally posted by qqtwee View Post
    Hello, Do you know how can I know the insert size? I see you say can use cufflinks, could you tell that how does cufflinks to estimate the distribution of fragment lengths in detail? Thank you very much!
    I cannot speak for Gjin, but I assume what she/he meant is that you can use Tophat to align a subset of your reads. Then after running the resulting bam file in Cufflinks you can obtain an estimate of your insert size as Cufflinks provides the following output:

    > Fragment Length Distribution: Empirical (learned)
    > Estimated Mean: 202.50
    > Estimated Std Dev: 69.92

    Then re-run Tophat using all of your reads and use your estimated mean for -r and your estimated standard deviation for --mate-std-dev

    Although, I do think it's advisable to read other threads on SEQanswers discussing this topic as there are differing opinions of how much these parameters matter (in terms of successful mapping) as well as how much they make sense (i.e. if you are mapping to the genome vs transcriptome).

    Leave a comment:


  • qqtwee
    replied
    how does cufflinks to estimate the distribution of fragment lengths?

    Originally posted by GJin View Post
    Maybe neither 500 nor 300 are correct. It's better to try to use cufflinks to estimate the distribution of fragment lengths.
    Hello, Do you know how can I know the insert size? I see you say can use cufflinks, could you tell that how does cufflinks to estimate the distribution of fragment lengths in detail? Thank you very much!

    Leave a comment:


  • GJin
    replied
    Maybe neither 500 nor 300 are correct. It's better to try to use cufflinks to estimate the distribution of fragment lengths.
    Last edited by GJin; 04-30-2012, 11:12 AM.

    Leave a comment:


  • lk0515
    replied
    Hello everyone!

    I also met a similar problem when dealling the insert size using TopHat, I am not sure the reason. Wish you could help.

    (a). Before I got my solexa paired end insert size is about 300 bp, I mistook it for 500 bp. And my pair-end reads lengths were 80 bp, so at first I chose the -r at 340 (=500-80*2), after running the TopHat with the commend line:
    tophat -o /output/ -i 30 --solexa1.3-quals -p 10 -r 140 --closure-search --coverage-search --microexon-search --butterfly-search new_scaffold TR_1.fq TR_2.fq >tophat.log

    I got the bam information there was about 74.82% paried properly :
    36749126 in total
    0 QC failure
    0 duplicates
    36749126 mapped (100.00%)
    36749126 paired in sequencing
    18449324 read1
    18299802 read2
    27494744 properly paired (74.82%)
    32515892 with itself and mate mapped
    4233234 singletons (11.52%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)


    (b) When I realized that I used the wrong insert size parameter, which should be 140 bp (=300-80*2), I run the TopHat with the same other options for the second time,
    but the information was that only 47.01% reads were properly paired. But other information was similar compared with (a).

    36520363 in total
    0 QC failure
    0 duplicates
    36520363 mapped (100.00%)
    36520363 paired in sequencing
    18334751 read1
    18185612 read2
    17167904 properly paired (47.01%)
    32285318 with itself and mate mapped
    4235045 singletons (11.60%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)

    I don't know why when I give the right insert size I would got the worse mapping result for overall paired reads.

    Thank you very much !

    Leave a comment:


  • ozs2006
    replied
    It seems very strange, I ran tophat on other lanes of my experiment, and it ran well (more than 50% of properly paired reads according to samtools flagstat). I still don't know why lane 1 running went wrong.
    1. Do you have any suggestion on this kind of problem?
    2. Do I need in the downstream analysis take in account only properly paired read?, or in different words: I want to know how tophat assigns this flag in order to make the proper decision.

    Leave a comment:


  • ozs2006
    replied
    Thanks for your replay :-)
    I use tophat v1.1.0 .
    Maybe this problem happen because I took the s_1_1_sequence.txt s_1_2_sequence.txt and changed their file extension to s_1_1.fq s_1_2.fq . As far as I know (but maybe I wrong) these file have the same format as any other file in FastQ.
    Do I need to take the data straight from the .qseq files, and convert them to fastq, or to do some text manipulations in order to use the s_1_1_sequence.txt s_1_2_sequence.txt ?
    (When I checked earlier it seems to me that tophat can't handle the _sequence.txt file without changing their file extension).*

    Another curious question about tophat: Is there a simple way to know which reads in the .bam file support the junctions in the bed file, or I need to write a script that check this property (I know this script is also simple).

    *If this is the reason for my odd output I'm sorry I bother all the members of seqanswers with my stupid questions.

    Leave a comment:


  • GKM
    replied
    1. Size selection (or, more accurately, the intended size selected for) and actual library fragment length distribution are often very different things; you should always check it yourself. Which it seems you did, so so far so good

    2. The reason this parameter has to be specified is that TopHat produces a probabilistic estimate for the best mapping of your reads as a pair based on a number of criteria, one of which is the fragment length distribution. There are situations where it matter very much whether the fragment length is 200 or 600 with respect to proper assignment of read pairs to the genome.

    3. You problem seems to be that your fastq files can't be matched by TopHat, which is extremely odd, even more so given that it seems like both ends are present with the correct IDs in what you posted, and it is not as if your are mixing lanes. Which version of TopHat are you using? Is your data stranded?

    Leave a comment:


  • ozs2006
    replied
    The command I used:
    tophat -r 83 -p 8 --solexa1.3-quals /path_to_bowtie_index/bowtie-0.12.5/indexes/h_sapiens_hg18 s_1_1.fq s_1_2.fq

    I have paired-end reads with 76 cycles for each side.

    1. The other side of the pair is mapped to the reference genome (5' to 3' match to + strand).

    2. I assigned the distance between ends to be 83bp (-r 83).
    Prior to tophat I used eland and aligned these reads. So the summary.htm for this lane detected the insert to be 235bp.
    I will check what the deviation of the cut from the gel.

    3. Another related question: I read tophat manual and paper, and as I understand it,
    tophat shows all results from bowtie alignment in the .bam file + all the reads that align to junctions (its unique property).
    I checked this by running bowtie separately from tophat and compare their output.
    I find that the same reads align to the same place, but have different samtools flag
    (actually, in bowtie I have got "better" flags)
    I can't understand why when I align with bowtie without any inner-distance as input,
    I get results better than tophat. Why do we need this inner-distance in order to run tophat on paired-end?

    From bowtie:
    HWUSI-EAS753_0001:1:4:93:588#0 99 chr2 38562474 255 76M = 38562580 182
    8@>:@8==ACBA@B??3B@BBB?9B79@69@A<?@;793679717;8'5151:;;1B@;=<@=0:@;=9=454? XA:i:0 MD:Z:76 NM:i:0
    HWUSI-EAS753_0001:1:4:93:588#0 147 chr2 38562580 255 76M = 38562474 -182 CATCGTGAGAGCAGACCATGTGGGCCCCAAGCAGATGCAGCAGATCCGCATGTCCCTTCGCGGGAAGGCTGTGGTG ###############################<<:?39=>?=8A<?6=??8+/=?2::4A5>9<;15?=>?96B?>B XA:i:2 MD:Z:7G8A8T42T0C6 NM:i:5

    From tophat:
    HWUSI-EAS753_0001:1:4:93:588#0 129 chr2 38562474 3 76M = 38562474 0 GCAATCCCTGACGCACCGCCGTGATGCCCAGGGAAGACAGGGCGACCTGGAAGTCCAACTACTTCCTTAAGATCAT 8@>:@8==ACBA@B??3B@BBB?9B79@69@A<?@;793679717;8'5151:;;1B@;=<@=0:@;=9=454? NM:i:0 NH:i:2
    HWUSI-EAS753_0001:1:4:93:588#0 65 chr2 38562474 3 76M = 38562474 0 GCAATCCCTGACGCACCGCCGTGATGCCCAGGGAAGACAGGGCGACCTGGAAGTCCAACTACTTCCTTAAGATCAT 8@>:@8==ACBA@B??3B@BBB?9B79@69@A<?@;793679717;8'5151:;;1B@;=<@=0:@;=9=454? NM:i:0 NH:i:2

    Thanks and sorry for this long letter,
    Oz
    Last edited by ozs2006; 11-25-2010, 02:46 PM. Reason: Misspellings

    Leave a comment:


  • kopi-o
    replied
    Do you have sizable adapters on your fragments? (Still, even after accounting for adapter sequence, you should get more than zero properly paired read pairs, so I guess there is some more fundamental issue ...)

    Leave a comment:


  • shurjo
    replied
    Originally posted by ozs2006 View Post
    Thanks for the quick reply.

    1. here is samtools flagstat output:

    23722180 in total
    0 QC failure
    0 duplicates
    23722180 mapped (100.00%)
    23722180 paired in sequencing
    11861090 read1
    11861090 read2
    0 properly paired (0.00%)
    23722180 with itself and mate mapped
    0 singletons (0.00%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)

    2. And about the stdev. of inserts from the manual I saw that the default value is 20bp,
    and it seems very far from the band size in bp we used.
    Do I need to use the band size?
    So you actually have zero reads properly paired, which seems very strange. A few very fundamental questions to consider:
    1. Were the input fastq files in the correct orientation (i.e. read2 was in the 5'-3' orientation and the sequence in the fastq file would match the plus strand in the reference genome where it mapped)?
    2. What was your syntax for running tophat?
    3. What value did you use for -r (inner distance between mate pairs) and what was the size range of the band excised during library prep?


    Also, --mate-std-dev can be estimated from the width of the band, not the size. So, if you cut a band at 300+/-25bp on the gel, you have an expected std. dev. of 25.

    Leave a comment:


  • ozs2006
    replied
    Thanks for the quick reply.

    1. here is samtools flagstat output:

    23722180 in total
    0 QC failure
    0 duplicates
    23722180 mapped (100.00%)
    23722180 paired in sequencing
    11861090 read1
    11861090 read2
    0 properly paired (0.00%)
    23722180 with itself and mate mapped
    0 singletons (0.00%)
    0 with mate mapped to a different chr
    0 with mate mapped to a different chr (mapQ>=5)

    2. And about the stdev. of inserts from the manual I saw that the default value is 20bp,
    and it seems very far from the band size in bp we used.
    Do I need to use the band size?

    Leave a comment:


  • shurjo
    replied
    Originally posted by ozs2006 View Post
    I used tophat to align paired-end 76bp RNA-seq.
    The fragments selected at 300bp.
    As the example in tophat manual shows- I used: tophat -r 148 ...
    When I inspected the .bam with:
    samtools view accepted_hits.bam
    or- samtools faststat accepted_hits.bam
    I relized that none of the reads was align with correct insert size.
    1. What to put as insert?
    2.what is the best size of the std. to put in --mate-std-dev ?
    Thanks in advance,
    Oz
    Can you reproduce the samtools flagstat output here? That would help diagnose a few things with your run.
    Also, --mate-std-dev could roughly be thought of as the width in bp of the band that was excised after adapter ligation during library prep.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Recent Advances in Sequencing Technologies
    by seqadmin







    Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

    Long-Read Sequencing
    Long-read sequencing has...
    12-02-2024, 01:49 PM
  • seqadmin
    Genetic Variation in Immunogenetics and Antibody Diversity
    by seqadmin



    The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
    11-06-2024, 07:24 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 12-02-2024, 09:29 AM
0 responses
140 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-02-2024, 09:06 AM
0 responses
50 views
0 likes
Last Post seqadmin  
Started by seqadmin, 12-02-2024, 08:03 AM
0 responses
38 views
0 likes
Last Post seqadmin  
Started by seqadmin, 11-22-2024, 07:36 AM
0 responses
70 views
0 likes
Last Post seqadmin  
Working...
X