Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • cufflinks incorrectly specifies read type as single end

    Hi,

    I am using cufflinks to analyze a samfile produced using tophat. The data are 75bp paired-end reads and the samfile includes information about location of mate pairs etc. But cufflinks appears to be confused about the nature of the data. It prints that the read type is single-end, yet still estimates a fragment length distribution (see below). Anyone know what the problem is? Should I just ignore the fact that it says the reads are single-end?

    > Processed 259464 loci. [*************************] 100%
    > Map Properties:
    > Total Map Mass: 286777666.76
    > Read Type: 75bp single-end
    > Fragment Length Distribution: Gaussian (default)
    > Estimated Mean: 209.55
    > Estimated Std Dev: 70.54
    Lindy McBride - Rockefeller University

  • #2
    Hi Lindy,

    Can you post a few lines of the input SAM file you're feeding Cufflinks? You can retrieve it from a BAM file with:

    Code:
    samtools view accepted_hits.bam | tail -n +25 | head -n 10
    It may be that the read names in the alignment are not matching up, or are not being properly reported as paired.


    Andrew

    Comment


    • #3
      Hi Andrew,
      Thanks so much for your response. Here are the first 15 lines from one of the many samfiles that give me this problem. It might easier to read if you cut and paste into a text editor. It looks to me like the 5th and 7th reads are paired

      Thanks!
      Lindy
      @HD VN:1.0 SO:sorted
      @PG ID:TopHat VN:1.0.14 CL:/usr/local/genome/bin/tophat -p 8 -a 12 -m 1 -r 56 -o /mnt/analyses/LVPant/ AaegL1 LVPa2_R1_fastq_prefilter.txt,LVPa2_B_read_1_fastq_prefilter.txt,LVPa3_A_read_1_fastq_prefilter.txt,LVPa3_B_read_1_fastq_prefilter.txt,LVPa4_A_read_1_fastq_prefilter.txt,LVPa4_B_read_1_fastq_prefilter.txt,LVPa5_A_read_1_fastq_prefilter.txt,LVPa5_B_read_1_fastq_prefilter.txt LVPa2_R2_fastq_prefilter.txt,LVPa2_B_read_2_fastq_prefilter.txt,LVPa3_A_read_2_fastq_prefilter.txt,LVPa3_B_read_2_fastq_prefilter.txt,LVPa4_A_read_2_fastq_prefilter.txt,LVPa4_B_read_2_fastq_prefilter.txt,LVPa5_A_read_2_fastq_prefilter.txt,LVPa5_B_read_2_fastq_prefilter.txt
      ILLUMINA-300160:LVPa4_B_read_2:1:6:78:12045:15339#0 137 CH477186.1 702 1 76M * 0 0 CCAAAAAGAGATCAAGATTTATCAAAATGTTATTGTGCAATGGTACAGCCGTTTTGAGAAAGAATTGTTTGGCATC hhhhfffhhhhhhhchhhhhdhhffhhhghhchhehhhghhhfghgchhchhhhhhcfah_fW]chhhhehahhhg NM:i:2 NH:i:4 CC:Z:CH477219.1 CP:i:2212927
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:1:8:21:9291:2197#0 137 CH477186.1 934 1 75M * 0 0 GGAACGTCAGCGTAGCAGCTCACTAAACAGCGGTGATCACTCCAGTGCTCAGTGTACCGGACAACATAGTGCCGG bdfchdaefffgeghhghcfhfhghhfhhhhghghhhhghghhhhhhhhhhhhhhhghhhhhhhhhghhhhhhhh NM:i:1 NH:i:3 CC:Z:CH477299.1 CP:i:2381918
      ILLUMINA-300160:LVPa3_A_read_2:1:2:114:7815:7323#0 137 CH477186.1 1102 3 76M * 0 0 GCACGACGATTGGAATCGACCTCAACAAAGTGAACAATTTCGGTCACAGAGTGTGGCCAAGATTCGGTGCCCGCCC d`^`R`feedd[a]c[bb]b]ccc]_fdadaffdfd_Q^a_fdbb]fdWfa[dchafffdaafahghh`_gfhefh NM:i:2 NH:i:2 CC:Z:CH477284.1 CP:i:705773
      ILLUMINA-300160:LVPa3_A_read_2:1:2:5:9651:17717#0 163 CH477186.1 2520 255 76M = 2641 0 TGCAACTGTAGTAGCTATCCCCAAGGCCAAAAAAGACGTAACGCTGCCTACAAACTACCGTCCCATCAGCCTGCTA gfhhhhhghfh_fhhhgahhhhhhhchghhgahhhhchhhhhhhhdhchhfgfgge_ce[bdfdcahgaga]aaaa NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_1:1:8:116:1810:15456#0 99 CH477186.1 2567 255 76M = 2708 0 CTACAAACTACCGTCCCATCAGCCTGCTAAGTAGCCTGAGCAAAGTGTTTGAACGAATCATCTTAAACCTGTTTCA gggggggggggggggggaggdcfffgggggggggggggggfffaeaebefgggggcgggegggdgdfJeddadefg NM:i:1 NH:i:1
      ILLUMINA-300160:LVPa3_A_read_1:1:2:5:9651:17717#0 83 CH477186.1 2641 255 76M = 2520 0 CAAAGCCGGTCACTCCACCAACCACCAGTTAGCACGGATAACCAAAATCGTAAAGGACGGTTTCTCTGCAAGAAAA dhcgegdfddfdfcfce]hfehhghhdgghhhfhhhghhhhhhghhhfhhhhhhhhghhhhhhhhhhh_hhhhhhh NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_2:1:8:116:1810:15456#0 147 CH477186.1 2708 255 76M = 2567 0 GCAAGAAAATCGACTGGTATGATTATGCTTGACGTCGAAAAGGCATATAATTCCGTCTGGCAGGACGCGATCATCT dbbghggdadghadghgggefdfdfc_dhhhdhhceghhhhhhhhfgchhhcfhhhhfhhhhhhhhhhhhghhhhh NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_2:1:8:10:14657:2302#0 163 CH477186.1 2743 255 76M = 2877 0 CGAAAAGGCATATAATTCCGTCTGGCAGGACGCGATCATCTACAAACTACGTAAATGTAACTGTCCTTTCTACATT hhfhhhhhhhhhhhhhhhhhhhhhhhghhfheghhhhchhhhhhh_hghghhhdfhgfhggghdfahhgghhdcgh NM:i:0 NH:i:1
      ILLUMINA-300160:LVPa5_B_read_1:1:8:10:14657:2302#0 83 CH477186.1 2877 255 76M = 2743 0 CTGTATCTGGGAAATTCGACATCCCTTGTGGCGTCCCACAGCGATCAGTGTTGAGCCCAACACTCTACAATGTTTT ggdggdgghhhgggWhfhfhhggcgehhghghhffcaggdhhhchhfhhhghhghgghhhgghhhhhhhghhhhhh NM:i:2 NH:i:1
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:1:8:3:14836:3337#0 163 CH477186.1 4492 0 75M = 4603 0 ATTCGCCCCATTGCGGGGTGAATAGAAACATACAACATTTTCAATAATATTTGTACGATATAATTTTTATTTTTT hhhhhhhhhhhhhhhhhcbh`ffdehhhhhfhhhhhhhhhhhehhhhhhhhhhgghhghhhhghhhhhghhghhe NM:i:2 NH:i:5 CC:Z:CH477198.1 CP:i:431872
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:1:8:47:7635:10354#0 163 CH477186.1 4495 0 75M = 4606 0 TGCCCCATTGCGGGGTGAATAGAAACATACAACATTTTCATAAATATTTGTACGATATAATTTTTATTTTTTAAC hhhhhhhhhhhhhhh`f]dffffffhhhhfhhhhghhhhhhhhdhhhhhhghhhhhghhghhhhhghhhhhhhhh NM:i:1 NH:i:5 CC:Z:CH477198.1 CP:i:431869
      ILLUMINA-300160:LVPa2_B_read_2:0:1:53:19937:17604#0 137 CH477186.1 4588 0 76M * 0 0 AAAGGAACACGCGACCCATATTTAAGACTAATAAAATTGGATTTTATCCACACGGTTTAAGTTGTACCCATTCAAA hhfhhZ]GSWRXWXZhhagfhhhhcggfghhhhaeegd]e_afffhgghhahhchhhfedcacfa[VaZZ[[[[Q[ NM:i:2 NH:i:19 CC:Z:CH477198.1 CP:i:431775
      HWUSI-EAS1600:LVPa2_R2:10_15_2010:0:8:91:10481:4293#0 137 CH477186.1 4596 0 75M * 0 0 ACGCGACCCATATTTAAGACTAATAAAATTGGATTTTATCCACACGGTTTAAGTTGTACACATTCAAACATGTCG f_[]feae`d^Qb]]c^[caZZZZZZWMWY\aYY\Z_ZUOed]]`VUQTV``WX]d`^Wdb`d^_WUU[QbdZd` NM:i:0 NH:i:16 CC:Z:CH477198.1 CP:i:431768
      Lindy McBride - Rockefeller University

      Comment


      • #4
        From http://samtools.sourceforge.net/SAM1.pdf, the name of both mates in a pair must be the same. So while the two reads you mentioned are certainly paired, they aren't matched up in Cufflinks because the names differ by the number marked with a carrot below. If you preprocess the SAM files to eliminate the mate field given by that number (maybe turning it into an X or something?), the library should be recognized as paired-end.

        Code:
        ILLUMINA-300160:LVPa5_B_read_1:1:8:116:1810:15456#0 99 CH477186.1 2567 255 76M =
                                     ^                                                 
        ILLUMINA-300160:LVPa5_B_read_2:1:8:116:1810:15456#0 147 CH477186.1 2708 255 76M =
                                     ^
        Lemme know how it goes!


        Andrew
        Last edited by polyatail; 12-16-2010, 10:11 PM. Reason: misinfo re: sam specification

        Comment


        • #5
          I have only just now come back to this issue, and your suggestion worked - thanks! I changed both read_1 and read_2 in the read names in my samfile to read_X, and cufflinks now reports my library as paired end.

          I have a second issue now however - less significant than the first. When I run cufflinks with a reference annotation it correctly reports the mean fragment length as ~200 bp. But when I run it without the reference annotation it reports the fragment length as 0 bp (see two outputs below). I assume this is because my samfile was generated by a tophat run which used known junctions from the same reference annotation? It seems like the most likley answer, though I see no reason why cufflinks would not be able to parse the mapping coordinates of each read in a pair and report the correct fragment length distribution regardless of whether a reference annotation is used...

          > cufflinks -G aaegypti.BASEFEATURES_Liverpool-AaegL1.2.gff3 LVPa2_abbr.RX.bam
          [21:13:01] Inspecting reads and determining fragment length distribution.
          > Processed 16360 loci. [*************************] 100%
          > Map Properties:
          > Total Map Mass: 123784.26
          > Read Type: 75bp paired-end
          > Fragment Length Distribution: Empirical (learned)
          > Estimated Mean: 197.41
          > Estimated Std Dev: 13.71
          [21:13:04] Calculating estimated abundances.
          > Processed 16360 loci. [*************************] 100%


          > cufflinks LVPa2_abbr.RX.bam
          [21:05:59] Inspecting reads and determining fragment length distribution.
          > Processed 1389 loci. [*************************] 100%
          > Map Properties:
          > Total Map Mass: 150829.93
          > Read Type: 75bp paired-end
          > Fragment Length Distribution: Gaussian (default)
          > Estimated Mean: 0.02
          > Estimated Std Dev: 2.17
          [21:06:01] Assembling transcripts and estimating abundances.
          > Processed 1389 loci. [*************************] 100%
          Lindy McBride - Rockefeller University

          Comment


          • #6
            It looks like when you didn't provide an annotation, there was not enough available information to determine the fragment length distribution (see http://cufflinks.cbcb.umd.edu/howitworks.html#hdis). I'm not sure, however, why it defaulted to such low values. Try specifying the mean with -m and the std-dev with -s.

            -Adam

            Comment


            • #7
              Hi Adam - Thanks. I will try that. I guess I can run it with the reference to get the values and then rerun it without the reference specifying the values estimated in the first run. I had not seen the "how cufflinks works" page. That is helpful.
              Lindy
              Lindy McBride - Rockefeller University

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Non-Coding RNA Research and Technologies
                by seqadmin




                Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                Nobel Prize for MicroRNA Discovery
                This week,...
                10-07-2024, 08:07 AM
              • seqadmin
                Recent Developments in Metagenomics
                by seqadmin





                Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                09-23-2024, 06:35 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 10-11-2024, 06:55 AM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-02-2024, 04:51 AM
              0 responses
              110 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 10-01-2024, 07:10 AM
              0 responses
              114 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 09-30-2024, 08:33 AM
              1 response
              119 views
              0 likes
              Last Post EmiTom
              by EmiTom
               
              Working...
              X