Splitting a FastQ file into two

Thias

Member

Join Date: Mar 2013

Posts: 45
Share

Post
#1

Splitting a FastQ file into two

07-25-2020, 01:21 PM

Hello folks,

I happen to have a small problem which seemed to be trivial at first, but keeps me busy for a while now already. Maybe you can help...

Problem:

I need to split 615M paired reads currently in two FastQ files into two file pairs with 308M reads each.

Solution attempt A:

I unsuccessfully tried to use line count based tools like split or awk, but since newline characters occur in the quality scores, these tools respectively I screwed up badly.

Solution attempt B:

Code:

bbmap/reformat.sh in=... in2=... out=... out2=... reads=308000000 bbmap/reformat.sh in=... in2=... out=... out2=... skipreads=308000000

resulted in

Input is being processed as paired
Input: 615307122 reads 91326404105 bases
Output: 615307122 reads (100.00%) 91326404105 bases (100.00%)

for the first command and in

Input is being processed as paired
Input: 615307122 reads 91326404105 bases
Output: 0 reads (0.00%) 0 bases (0.00%)

for the second command. Effectively those read commands seem to be ignored (BBMap Version 38.76).

Solution attempt C:

Code:

famas --in=... --in2=... --out=.XXXXXX.fq.gz --out2=.XXXXXX.fq.gz -x 308000000

flooded the output directory with thousands of subfiles files (instead of the actually needed two files each) until the file system couldn't cope with the number of open files anymore and ran out of file descriptors (famas version 0.0.12).

ERROR(famas.c|open_output_one:1056): Couldn't open =...compressed.065534.fq.gz
ERROR(famas.c|main:1163): Couldn't open output files. Exiting...

Since it took me a while to clean that mess up on the cluster again, I am somewhat reluctant to try out more now. Any ideas what I made wrong or suggestions which tools work better?

Thanks a lot for reading and help!
Thias
Tags: awk, bbmap, famas, fastq, split
GenoMax

Senior Member

Join Date: Feb 2008

Posts: 7142
Share

Post
#2

07-27-2020, 05:15 AM

Cross-posted and answered at: https://www.biostars.org/p/451453/
Comment
Thias

Member

Join Date: Mar 2013

Posts: 45
Share

Post
#3

07-27-2020, 09:15 AM

Indeed, the issue is solved by now using seqkit split2.

My apologies for not indicating this here. I had posted here first but the thread was lingering in moderation for ~24h and thus I decided to ask for help on Biostars. It had not yet shown up when I got the answer on Biostars and I subsequently forgot to check back here. Thanks a lot to everyone none the less!
Comment

Previous template Next

Beyond CRISPR/Cas9: Understand, Choose, and Use the Right Genome Editing Tool

by SEQadmin2

CRISPR/Cas9 sparked the gene editing revolution for both research and therapeutics.¹ But this system still showed severe issues that limited its applications. The most prominent were the heavy reliance on PAM sequences, delivery limitations, double-stranded breaks that prompt unintended edits and cell death, and editing inefficiency (both in targeting and in knock-in reliability).

Despite this, “CRISPR helped turn genome editing from a specialized technique into...
- Channel: Articles
07-31-2026, 11:01 AM
Proteomic Platforms: How to Choose the Right Analytical Strategy to Improve Detection and Clinical Applications

by SEQadmin2

Proteomics platforms are evolving rapidly, with advances in mass spectrometry and affinity-based approaches expanding what researchers can detect and at what scale. As the field moves toward deeper proteome coverage and clinical applications, scientists face an increasingly complex landscape of tools. This article will explore how researchers are navigating these choices to find the right platform for their work.

The systematic characterization of the human proteome has...
- Channel: Articles
07-20-2026, 11:48 AM
Advanced Sequencing Platforms Tackle Neuroscience’s Toughest Genomics Problems

by SEQadmin2

Genomics studies in neuroscience face a special challenge due to the brain’s complexity and scarcity of samples. Mapping changes in cell type and state using conventional next-generation sequencing methods remains challenging. Advances in technologies like single-cell sequencing, spatial transcriptomics, and long-read sequencing have opened the door to deeper studies of the brain and diseases like Alzheimer’s, amyotrophic lateral sclerosis (ALS), and schizophrenia.
...
- Channel: Articles
07-09-2026, 11:10 AM

Topics	Statistics	Last Post
New Genomic Method Uncovers Ancient Hominin DNA by SEQadmin2 Started by SEQadmin2, 07-31-2026, 02:55 AM	0 responses 17 views 0 reactions	Last Post by SEQadmin2 07-31-2026, 02:55 AM
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, 07-24-2026, 12:17 PM	0 responses 15 views 0 reactions	Last Post by SEQadmin2 07-24-2026, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, 07-23-2026, 11:41 AM	0 responses 14 views 0 reactions	Last Post by SEQadmin2 07-23-2026, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM

Unconfigured Ad

Splitting a FastQ file into two

Comment

Comment

Latest Articles

ad_right_rmr

News