Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Bobbieshaban
    Junior Member
    • Sep 2012
    • 1

    Split fastq files for tophat analysis

    Hi,

    Does anyone see anything wrong with splitting of fastq files for alignment to tophat then to merge them together afterwards?

    The reason why I want to split them is to be able to make greater use of the cluster we have available.

    I am able to split the fastq files using an algorithm I created in perl, the merging of the files seems to work except I am getting a few missing reads when I compare the merged output from my split fastq as compared to when I run the file in tophat as a whole.

    For example the split paired end tophat run produces a samtools flagstat of

    $ samtools flagstat merged_accepted_hits.bam
    37716745 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    37716745 + 0 mapped (100.00%:nan%)
    37716745 + 0 paired in sequencing
    19017603 + 0 read1
    18699142 + 0 read2
    35853292 + 0 properly paired (95.06%:nan%)
    35974826 + 0 with itself and mate mapped
    1741919 + 0 singletons (4.62%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    While the full fastq filed paired end run from tophat produces

    $ samtools flagstat accepted_hits.bam
    37739551 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    37739551 + 0 mapped (100.00%:nan%)
    37739551 + 0 paired in sequencing
    19028732 + 0 read1
    18710819 + 0 read2
    35896074 + 0 properly paired (95.12%:nan%)
    36017796 + 0 with itself and mate mapped
    1721755 + 0 singletons (4.56%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    The difference is only 0.06% of properly paired reads, but may be missing some useful information. I have checked the splitting of the files and the numbers of the lines are exactly the same.

    http://seqanswers.com/forums/showthr...at+fastq+split. this thread suggests that some "low abundance splice sites" are lost.

    Would anyone have anymore information about this?

    Thanks for the help,

    Bobbie.
  • dGho
    Member
    • Jan 2013
    • 43

    #2
    split files

    I have split fastq files to run Tophat. From what I understand is that this is a fairly common practice. Here is a hypothetical example:

    #split read 1 into smaller files after every 40,000,000 lines
    split -l 40000000 wholefile_read1.fastq ;
    #rename resulting files
    mv xaa wholefile_read1_1.fastq
    mv xab wholefile_read1_2.fastq
    .
    .
    #split read 2 into smaller files after every 40,000,000 lines
    split -l 40000000 wholefile_read2.fastq
    #rename resulting files
    mv xaa wholefile_read2_1.fastq
    mv xab wholefile_read2_2.fastq
    .
    .
    #align split files with tophat
    tophat -o out_1 -G mm10.gtf mm10 wholefile_read1_1.fastq wholefile_read2_1.fastq
    tophat -o out_2 -G mm10.gtf mm10 wholefile_read1_2.fastq wholefile_read2_2.fastq
    .
    .
    #use samtools to put the bam files back together
    Samtools merge out.bam out_1 out_2

    Comment

    • dGho
      Member
      • Jan 2013
      • 43

      #3
      I didn't answer your question

      I guess I did not exactly answer your question though. I do not know if there is any difference in results when the files are split. I do know that my very experienced co-worker does it all the time. That does not necessarily help.

      Comment

      Latest Articles

      Collapse

      • SEQadmin2
        Nine Things a Sample Prep Scientist Thinks About Before Sequencing
        by SEQadmin2


        I’m not a sequencing expert. I’m a purification scientist who uses NGS to evaluate workflows my group develops. With this perspective, we think about the sample first and the NGS workflow second. The sequencer is an exceptionally honest reporter, but it can only report on what you give it, so whether you get clean, interpretable data from an NGS workflow is largely determined before you begin.

        Here are nine questions we think about, in roughly the order they matter, before...
        06-18-2026, 07:11 AM
      • SEQadmin2
        From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
        by SEQadmin2


        Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


        The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
        ...
        06-02-2026, 10:05 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by SEQadmin2, 06-17-2026, 06:09 AM
      0 responses
      36 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-09-2026, 11:58 AM
      0 responses
      99 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-05-2026, 10:09 AM
      0 responses
      120 views
      0 reactions
      Last Post SEQadmin2  
      Started by SEQadmin2, 06-04-2026, 08:59 AM
      0 responses
      113 views
      0 reactions
      Last Post SEQadmin2  
      Working...