Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split fastq files for tophat analysis

    Hi,

    Does anyone see anything wrong with splitting of fastq files for alignment to tophat then to merge them together afterwards?

    The reason why I want to split them is to be able to make greater use of the cluster we have available.

    I am able to split the fastq files using an algorithm I created in perl, the merging of the files seems to work except I am getting a few missing reads when I compare the merged output from my split fastq as compared to when I run the file in tophat as a whole.

    For example the split paired end tophat run produces a samtools flagstat of

    $ samtools flagstat merged_accepted_hits.bam
    37716745 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    37716745 + 0 mapped (100.00%:nan%)
    37716745 + 0 paired in sequencing
    19017603 + 0 read1
    18699142 + 0 read2
    35853292 + 0 properly paired (95.06%:nan%)
    35974826 + 0 with itself and mate mapped
    1741919 + 0 singletons (4.62%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    While the full fastq filed paired end run from tophat produces

    $ samtools flagstat accepted_hits.bam
    37739551 + 0 in total (QC-passed reads + QC-failed reads)
    0 + 0 duplicates
    37739551 + 0 mapped (100.00%:nan%)
    37739551 + 0 paired in sequencing
    19028732 + 0 read1
    18710819 + 0 read2
    35896074 + 0 properly paired (95.12%:nan%)
    36017796 + 0 with itself and mate mapped
    1721755 + 0 singletons (4.56%:nan%)
    0 + 0 with mate mapped to a different chr
    0 + 0 with mate mapped to a different chr (mapQ>=5)

    The difference is only 0.06% of properly paired reads, but may be missing some useful information. I have checked the splitting of the files and the numbers of the lines are exactly the same.

    http://seqanswers.com/forums/showthr...at+fastq+split. this thread suggests that some "low abundance splice sites" are lost.

    Would anyone have anymore information about this?

    Thanks for the help,

    Bobbie.

  • #2
    split files

    I have split fastq files to run Tophat. From what I understand is that this is a fairly common practice. Here is a hypothetical example:

    #split read 1 into smaller files after every 40,000,000 lines
    split -l 40000000 wholefile_read1.fastq ;
    #rename resulting files
    mv xaa wholefile_read1_1.fastq
    mv xab wholefile_read1_2.fastq
    .
    .
    #split read 2 into smaller files after every 40,000,000 lines
    split -l 40000000 wholefile_read2.fastq
    #rename resulting files
    mv xaa wholefile_read2_1.fastq
    mv xab wholefile_read2_2.fastq
    .
    .
    #align split files with tophat
    tophat -o out_1 -G mm10.gtf mm10 wholefile_read1_1.fastq wholefile_read2_1.fastq
    tophat -o out_2 -G mm10.gtf mm10 wholefile_read1_2.fastq wholefile_read2_2.fastq
    .
    .
    #use samtools to put the bam files back together
    Samtools merge out.bam out_1 out_2

    Comment


    • #3
      I didn't answer your question

      I guess I did not exactly answer your question though. I do not know if there is any difference in results when the files are split. I do know that my very experienced co-worker does it all the time. That does not necessarily help.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Recent Advances in Sequencing Analysis Tools
        by seqadmin


        The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
        05-06-2024, 07:48 AM
      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        04-22-2024, 07:01 AM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, Yesterday, 06:57 AM
      0 responses
      12 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-06-2024, 07:17 AM
      0 responses
      16 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 05-02-2024, 08:06 AM
      0 responses
      19 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-30-2024, 12:17 PM
      0 responses
      24 views
      0 likes
      Last Post seqadmin  
      Working...
      X