Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • dnusol
    Senior Member
    • Jul 2009
    • 136

    cufflinks detecting single-end when paired-end used

    Dear users,

    I am running the same commands for a set of four paired-end lanes in tophat and cufflinks. But when running cufflinks on the tophat .bam output, some of the lanes are detected as single-end while others are correctly detected as paired-end. How could this be? I show you the commands for the first lane which is wrongly detected with cufflinks stdout

    $ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s1 -g 1 a_thaliana /Data/s_1_1_QC_sequence.fq /Data/s_1_2_QC_sequence.fq

    $ cufflinks -o ./tophat_s1/cufflinks -p 8 -N --library-type fr-unstranded ./tophat_s1/accepted_hits.bam
    [12:02:53] Inspecting reads and determining fragment length distribution.
    > Processed 35625 loci. [*************************] 100%
    > Map Properties:
    > Upper Quartile Mass: 444119.98
    > Read Type: 105bp single-end
    > Fragment Length Distribution: Gaussian (default)
    > Estimated Mean: 217.34
    > Estimated Std Dev: 65.65
    [12:03:32] Assembling transcripts and estimating abundances.
    > Processed 35657 loci. [*************************] 100%

    Your comments are greatly appreciated.
    Thanks

    D.
    Edit: my program versions:
    TopHat (v1.1.4)
    Bowtie version: 0.12.7.0
    Samtools version: 0.1.11.0
    Cufflinks version: 0.9.2
    Last edited by dnusol; 01-31-2011, 04:21 AM. Reason: include version info
  • adarob
    Member
    • Jul 2010
    • 71

    #2
    Are you sure that all of your paired fastq files have the same name for both reads?

    Comment

    • Jon_Keats
      Senior Member
      • Mar 2010
      • 279

      #3
      From manual
      When running TopHat with paired ends, it is critical that the *_1 files an the *_2 files appear in separate comma separated lists, and that the order of the files in the two lists is the same.
      Thus your file names must end with myfile_1.fq and myfile_2.fq

      Comment

      • dnusol
        Senior Member
        • Jul 2009
        • 136

        #4
        Thanks adarob and John for your answers but that does not solve the error I am finding. I have renamed the files so that they are called s_1_1.fq and s_1_2.fq and I still get that cufflinks reports the reads as single-end. Furthermore this issue does not happen all the time because for other lanes with the same naming convention, i.e. s_N_1_QC_sequence.fq and s_N_"_QC_sequence.fq, cufflinks reports them as paired-end

        So this is a new try modifying the name to follow "convention":
        $ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s1 -g 1 a_thaliana /Data/s_1_1.fq /Data/s_1_2.fq

        $ cufflinks -o ./tophat_s1/cufflinks -p 8 -N --library-type fr-unstranded ./tophat_s1/accepted_hits.bam
        [11:57:53] Inspecting reads and determining fragment length distribution.
        > Processed 35812 loci. [*************************] 100%
        > Map Properties:
        > Upper Quartile Mass: 445776.00
        > Read Type: 105bp single-end
        > Fragment Length Distribution: Gaussian (default)
        > Estimated Mean: 217.34
        > Estimated Std Dev: 65.65
        [11:58:32] Assembling transcripts and estimating abundances.
        > Processed 35845 loci.

        And this is an old try with other lane that gives good results

        $ tophat -r 200 -p 8 --mate-std-dev 40 --solexa1.3-quals --library-type fr-unstranded -o ./tophat_s3 -g 1 a_thaliana /Data/s_3_1_QC_sequence.fq /Data/s_3_2_QC_sequence.fq

        $cufflinks-0.9.2/cufflinks -o ./tophat_s3/cufflinks -I 11000 -p 8 -N --library-type fr-unstranded ./tophat_s3/accepted_hits.bam
        [12:05:41] Inspecting reads and determining fragment length distribution.
        > Processed 34538 loci. [*************************] 100%
        > Map Properties:
        > Upper Quartile Mass: 498434.94
        > Read Type: 105bp paired-end
        > Fragment Length Distribution: Gaussian (default)
        > Estimated Mean: 217.34
        > Estimated Std Dev: 65.65
        [12:06:24] Assembling transcripts and estimating abundances.
        > Processed 34563 loci. [*************************] 100%


        Can you suggest anything else I can try?

        D.

        Comment

        • Jon_Keats
          Senior Member
          • Mar 2010
          • 279

          #5
          The only suggestions I'd have left are:

          1) update to the most recent versions of tophat, cufflinks, bowtie
          2) double check the integrity of the input files that are failing.
          - Are they the same size in bytes?
          - Are they in the same order, run head and tail on both files and compare
          - Does your core generate MD5 sums on the original .fq if so run a checksum
          3) though things are not working I also be worried that the mean and STD for both runs are identical, that seems hard to believe.

          Comment

          • adarob
            Member
            • Jul 2010
            • 71

            #6
            Can you paste the first few reads from both ends of the fastq files that are giving you problems?

            Comment

            • dnusol
              Senior Member
              • Jul 2009
              • 136

              #7
              Thanks for your suggestions. I think I see where you are aming at. So the problem could be that after doing QC and removing reads with low quality values on each file separately, the two files from a lane ended up with different number of reads and this is not correct . So initially both s_1_1_sequence.fq and s_1_2_sequence.fq had 32193587 reads and after QC, the numbers are as follow:
              s_1_1_sequence.fq -- 15590194 reads
              s_1_2_sequence.fq -- 16385365 reads

              So I guess the read-pairing is lost for some reads. And these are the first and last few reads for both files

              $ head s_1_1_QC_sequence.fq
              @ILLUMINA-GA_191110:1:1:4938:1104#0/1
              CGACAATCCTGTAGGTAAGGGCATGATTACCATATTTGATGAGATTTTCCTTATCAATAAGGAAATCATGGTCGCCATCGAGTTCCCAAAATCTGCAGTATATCA
              +ILLUMINA-GA_191110:1:1:4938:1104#0/1
              hhhhhhhhhhgehhhWeeedffgchghhhchggfhhhhhehhgghhhhaghh_hhhhhhfhcghhcehbgdaceahdaedac__acfcdceaaecaac\aac_aa
              @ILLUMINA-GA_191110:1:1:5113:1105#0/1
              TTCACTTTTCTTTGATTCCCTTCTAGACTGTTGTCAATAATGACTCTGATGAATAAATCTTCGTCTAGCTCTGATCCTATGGATTGCCATCTATGAACAAGCTTG
              +ILLUMINA-GA_191110:1:1:5113:1105#0/1
              hhhhhhhghghehhhgghehhgddhhfhghhfcfhghfhfggdhghdh]ghghhfhh`ehhhgfgfWghdgfeWfeg_daadRbdabdd_[dadaaba[b]_BBB
              @ILLUMINA-GA_191110:1:1:5938:1104#0/1
              GGAGAACTATGTGAATAGCCGTTCCTTGAAGCATGCTCGTGACATATATAGGCAAATTCGTGAACATGTTGAACAGATAGGCTTTAATGTGTCTTCATGCGGAAA

              $ head s_1_2_QC_sequence.fq
              @ILLUMINA-GA_191110:1:1:5233:1161#0/2
              ATGCTAATGAGGTCATTGCTAATAGAGCAGCTGAGANTCTTGGTCGCAAACGTGGTGAGAAATGTGTGCACCCAAATGACCATGTGAACAGATCACAATCTTCTA
              +ILLUMINA-GA_191110:1:1:5233:1161#0/2
              fhhhhhchhahdehhhhhhhhhhhehdfcgddfaf_F^a_\^]Z]X[YXYZ[YZ\\daaa``_`[daebade_edaa_a]W^]_Sa_a_ad``Z__b___aaaa]
              @ILLUMINA-GA_191110:1:1:6770:1157#0/2
              TGACGTCTGGTGAAGCTGTCAAGTACAAGAGTTCTTNGGACGCCTTCAAGCAGATCCTCANCCTTCAAGAGCTCCCTCTGGTGCCTGGATCTCAACCATGTAANC
              +ILLUMINA-GA_191110:1:1:6770:1157#0/2
              ggaggggggeggagggggeg_dfdaff[facafcfdEdd^bbbdbgg]Rcebe_W]^\^WGYa^^_[^^Zc[]dbc_aaabMdba_aa\Yaaca]baabd]X]G^
              @ILLUMINA-GA_191110:1:1:8784:1163#0/2
              GCTTGTTCTTGCGTAGCCACATTAATGAACCTAGTGNAGCTAGAGATATCGTGAACACCAGATTTGCATTGGCTTCAAGCCCTTGCAAAACATAATCCCAAGTGA


              $ tail s_1_1_QC_sequence.fq
              +ILLUMINA-GA_191110:1:100:18210:21435#0/1
              hghhhhhchheghhhhghhhhhhhffhhfhhhgfhchhhhhhhhhhfghghhhhhhhfgehaghhghggghhgdgeghgdhgcgdfgfggeggghfffgcffggg
              @ILLUMINA-GA_191110:1:100:18834:21433#0/1
              AGAGGATTATTGCTGAGAGGTTTGGGAAGAAGAAAGGTGGGAGTAGTGTTAAGAAGACGAGTTCGGTTAGGAAGAAGAAGAAGCGTGTTTCTGATGAGTCTGATT
              +ILLUMINA-GA_191110:1:100:18834:21433#0/1
              hcfhhhhhghhhhfhchhgghhhhhgfffahg]hgfdVff_\e^dfbfdfhhghg]cffcf_cfchcdhecehgfd]hhedceffTb^bbaaaR[a_aZabbabd
              @ILLUMINA-GA_191110:1:100:19124:21432#0/1
              GTAAATGATAAGGAAATCATTCATTTTTGTCATGAAGAACATAATCTAAGACTTGGTGGTGATGATGATGTTACCCGTTATGAAAAGATGCTGTGCGACGCTTGC
              +ILLUMINA-GA_191110:1:100:19124:21432#0/1
              f`fdfdff[f^faffdffcffffcff^ffVfffdffafffaafffgfgfggggg_f_faJa_W_[K^^Z\YU\aVQKWUUW^Z^^[^WI^YYWU\decebeabac

              $ tail s_1_2_QC_sequence.fq
              +ILLUMINA-GA_191110:1:100:18069:21438#0/2
              hhhhhhhhhghhhghhghhhhhfhghhghhhfhhhhhghhhghhhhhhhhfehghehhghhhhhggfhfhfgghghhhhhgehhhggghghggehgggfhhggga
              @ILLUMINA-GA_191110:1:100:18210:21435#0/2
              TGCCAGTCTTAGCTTGCATAGATTTGATTGTTTCACCTCCTTTACCAATTATCAAACCAACCTTGTTATTCGGAATTTTCATAACAAATTGATCAGCCCCTGCTT
              +ILLUMINA-GA_191110:1:100:18210:21435#0/2
              hghhhhhhhhaffffhhhhggehhhhdhghhhghhhhhhhhhghhfhhhhhhhhhghhhhhhhhhhhhhhdhahghhfhhhhcggghhhhfhhggghhchgghh]
              @ILLUMINA-GA_191110:1:100:18834:21433#0/2
              CAGACTCATCAGAATCATCTTCATCTCTCTTTCTTCCTCTCCTCTCCTTTCTTCTCTTACTCCTTCCCTCTTCCTCATCCTCAGACTCACTCAAACTCCTTCTCT
              +ILLUMINA-GA_191110:1:100:18834:21433#0/2
              hhchhahhghhhghhhgghghhgdhhggfhgfgeggegcddfd_aedcdde_ebdggdbddffbdcdgddcfffcda]_d[^ZXYa\^Z_ad_^bc\_`VZ^^^^

              I understand that having different reads may affect the analysis but remember that my analysis pipeline is the same for other lanes and other lanes are detected as paired-ends!
              For s_3_1_sequence.fq and s_3_2_sequence.fq had 36272340 reads and after QC, the numbers are as follow:
              s_3_1_sequence.fq -- 15265617 reads
              s_3_2_sequence.fq -- 19079615 reads

              and this time they are detected as paired-end reads by cufflinks.

              As an aside, coming from this discussion, do you know of any piece of software that may do QC filtering on paired-end files maintaining the read-pairing?

              Comment

              • Jon_Keats
                Senior Member
                • Mar 2010
                • 279

                #8
                I'm guessing you found the problem. I'm not sure how much more help I can be as I just use the fastq files from our core were each file is identical. How are you generating these files? I'm a bit ignorant of the on machine steps, but my understanding was that the illumina qseq files are the unfiltered files and the "s_1_sequence.txt" files are those passing illumina filtering and the reads are paired properly. If you are filtering the "s_1_sequence.txt" files this may not be necessary and I doubt it is a common step (not that it would be incorrect to do so).

                In regards to the one that is working, I still don't have a good answer but perhaps it keeps the correct order for a part of the file?

                I'd suggest getting the input files to be the same number of reads with matching read IDs (ie. the s_1_sequence.txt files) renaming to the expected tophat convention (s_1_sequence_1.fq and s_1_sequence_2.fq) and that should work as expected.

                Comment

                • dnusol
                  Senior Member
                  • Jul 2009
                  • 136

                  #9
                  Hi Jon, thanks for your input.
                  If I am not wrong, illumina filtering step is different from fastx QC filtering. Actually, if you have a look at the Phred scores for the raw _sequence.txt files in my experiment, supposing they are already quality filtered by Illumina, I shouldn't see the gradual drop in QC values down to below Q20 that I am actually seeing. This is why I filtered out some reads.

                  In this link


                  you could see in the first image an example of the kind of graph I am getting. They suggest trimming but I rather filter complete reads below a threshold this time.

                  I am now trying with the complete raw _sequence.txt file from illumina without filtering to see if the error comes up again.

                  Thanks

                  Dave

                  Comment

                  Latest Articles

                  Collapse

                  • SEQadmin2
                    From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                    by SEQadmin2


                    Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                    The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                    ...
                    Yesterday, 10:05 AM
                  • SEQadmin2
                    Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                    by SEQadmin2


                    With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                    Introduction

                    Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                    05-22-2026, 06:42 AM
                  • SEQadmin2
                    Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                    by SEQadmin2

                    Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                    Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                    05-06-2026, 09:04 AM

                  ad_right_rmr

                  Collapse

                  News

                  Collapse

                  Topics Statistics Last Post
                  Started by SEQadmin2, Yesterday, 12:03 PM
                  0 responses
                  19 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, Yesterday, 11:40 AM
                  0 responses
                  14 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-28-2026, 11:40 AM
                  0 responses
                  29 views
                  0 reactions
                  Last Post SEQadmin2  
                  Started by SEQadmin2, 05-26-2026, 10:12 AM
                  0 responses
                  31 views
                  0 reactions
                  Last Post SEQadmin2  
                  Working...