Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split fastq files for tophat analysis

    Hi,

    Does anyone see anything wrong with splitting of fastq files for alignment to tophat then to merge them together afterwards?

    The reason why I want to split them is to be able to make greater use of the cluster we have available.

    I am able to split the fastq files using an algorithm I created in perl, the merging of the files seems to work except I am getting a few missing reads when I compare the merged output from my split fastq as compared to when I run the file in tophat as a whole.

    For example the split paired end tophat run produces a samtools flagstat of

    $ samtools flagstat merged_accepted_hits.bam
    37716745 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    37716745 + 0 mapped (100.00%:nan%)
    37716745 + 0 paired in sequencing
    19017603 + 0 read1
    18699142 + 0 read2
    35853292 + 0 properly paired (95.06%:nan%)
    35974826 + 0 with itself and mate mapped
    1741919 + 0 singletons (4.62%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    While the full fastq filed paired end run from tophat produces

    $ samtools flagstat accepted_hits.bam
    37739551 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    37739551 + 0 mapped (100.00%:nan%)
    37739551 + 0 paired in sequencing
    19028732 + 0 read1
    18710819 + 0 read2
    35896074 + 0 properly paired (95.12%:nan%)
    36017796 + 0 with itself and mate mapped
    1721755 + 0 singletons (4.56%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    The difference is only 0.06% of properly paired reads, but may be missing some useful information. I have checked the splitting of the files and the numbers of the lines are exactly the same.

    http://seqanswers.com/forums/showthr...at+fastq+split. this thread suggests that some "low abundance splice sites" are lost.

    Would anyone have anymore information about this?

    Thanks for the help,

    Bobbie.

  • #2
    split files

    I have split fastq files to run Tophat. From what I understand is that this is a fairly common practice. Here is a hypothetical example:

    #split read 1 into smaller files after every 40,000,000 lines
    split -l 40000000 wholefile_read1.fastq ;
    #rename resulting files
    mv xaa wholefile_read1_1.fastq
    mv xab wholefile_read1_2.fastq
    .
    .
    #split read 2 into smaller files after every 40,000,000 lines
    split -l 40000000 wholefile_read2.fastq
    #rename resulting files
    mv xaa wholefile_read2_1.fastq
    mv xab wholefile_read2_2.fastq
    .
    .
    #align split files with tophat
    tophat -o out_1 -G mm10.gtf mm10 wholefile_read1_1.fastq wholefile_read2_1.fastq
    tophat -o out_2 -G mm10.gtf mm10 wholefile_read1_2.fastq wholefile_read2_2.fastq
    .
    .
    #use samtools to put the bam files back together
    Samtools merge out.bam out_1 out_2

    Comment


    • #3
      I didn't answer your question

      I guess I did not exactly answer your question though. I do not know if there is any difference in results when the files are split. I do know that my very experienced co-worker does it all the time. That does not necessarily help.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM
      • seqadmin
        Strategies for Sequencing Challenging Samples
        by seqadmin


        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
        03-22-2024, 06:39 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 12:08 PM
      0 responses
      11 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      17 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      14 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      43 views
      0 likes
      Last Post seqadmin  
      Working...
      X