Hi I would like to know is there any tool for removing the unpaired fastq reads from a huge set of paired end fastq data?
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
You said huge, but how many reads are there (roughly)? One million? Ten million? More?
Would a Python script using Biopython be useful? It would help to show a small sample of the reads (say the first ten) to ensure I understand your read naming convention. You can wrap the example with [ code ] and [ /code ] tags to format it nicely on the forum.
Comment
-
About 50 million reads... that would make any approach using a list of IDs in memory rather tricky.
Can we assume the read pairs are next to each other, e.g. you started out with this:
Code:read1.f read1.r read2.f read2.r read3.f read3.r read4.f read4.r read5.f read5.r
Code:read1.f read1.r read2.r read3.f read4.f read4.r read5.r
Code:read1.f read1.r read4.f read4.r
If the reads are in a random order, things will be much harder...
Comment
-
This is a general slow version (you can switch "fastq" to any other supported file format like "fasta" or "qual"):
Code:from Bio import SeqIO mixed_file = "mixed.fastq" paired_file = "paired.fastq" def get_paired(iterator): prev = None for curr in iterator: if curr.id.endswith("/1"): prev = curr elif not curr.id.endswith("/2"): raise ValueError("Expect IDs to end /1 and /2") elif prev and prev.id == curr.id[:-2] + "/1": yield prev yield curr prev = None records = get_paired(SeqIO.parse(mixed_file, "fastq")) count = SeqIO.write(records, paired_file, "fastq") print "Saved %i records (%i pairs)" % (count, count/2)
Code:from Bio.SeqIO.QualityIO import FastqGeneralIterator mixed_file = "mixed.fastq" paired_file = "paired.fastq" out_handle = open(paired_file, "w") prev = None for curr in FastqGeneralIterator(open(mixed_file, "rU")): if curr[0].split()[0].endswith("/1"): prev = curr elif not curr[0].split()[0].endswith("/2"): raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0]) elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1": out_handle.write("@%s\n%s\n+\n%s\n" % prev) out_handle.write("@%s\n%s\n+\n%s\n" % curr) prev = None out_handle.close()
This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.
If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).
Comment
-
Another way to do this: first run tophat, and check the accepted_hits.sam and delete the line with "*"
HWI-EAS266_0005:1:32:2465:21083#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ^^[YVRK[bb^bbbb_babbb^bbbbbbaabbcbbb`bbbba NM:i:1
HWI-EAS266_0005:1:59:17297:6482#0 177 chr1 120 255 42M = 199385343 0 AAACCCTAACCCTAACCCTAACCCTAACCCCTAACCCTAACC ORXYaa\__bbbTbabbb`b[bbb``abbbbbb_bbb`b^bb NM:i:1
HWI-EAS266_0005:1:3:4093:8164#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG Rbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb NM:i:1
HWI-EAS266_0005:1:60:8918:19240#0 145 chr1 550 255 42M = 241327181 0 GTGCAGAGGAGAACGCAGCTCCGCCCTCGCGGTGCTCTCCGG BBBBa_\`a_`a[a[_aaaXTa```a_a[a_]]_`T]`__aa NM:i:1
HWI-EAS266_0005:1:79:5108:21091#0 73 chr1 1146 1 42M * 0 0 GCGCCCCCTGCTGGCGCCGGGGCGCTGCAGGGCCCTCTTGCT aQ_a`^ca_`a__c]`\cbbbbbb^bbb]ba^abbbbbbbbb NM:i:1
HWI-EAS266_0005:1:8:7442:18432#0 73 chr1 1168 1 42M * 0 0 CACTGCAGGGCCCTCTTGCTTACTGTATAGTGGTGGCACGCC _Sb_ab]bbb^_ba^b_b\`V]\b``b_ababbbababbbb` NM:i:0
HWI-EAS266_0005:1:35:18364:16230#0 147 chr1 1303 255 42M = 1458 0 TTCTTGCTCCAACAGTAGTGGCGGATTATAGGGAAACACCCG B_bb`bbcbbbbbcbbbbbcbbbbb`bbbbbbbabababbbb NM:i:2
HWI-EAS266_0005:1:26:17783:8912#0 137 chr1 1585 0 42M * 0 0 GGTATTTTTTTAAATTTCCACTGATGATTTTGCTGCATGGCC BBBBBababaaaa_bb]bb`bbbacbbbcc_bc`bbcbbbbb NM:i:1
Is this method reasonable?
Comment
-
I have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like:
File 1:
@M00289:3:000000000-A752F:1:1102:13093:3364 1:N:0:20
GGCTGTGATCACGGGGCAAAACAGCCACTTCATTAACTTTCAGTAAGGGCTGCAAGGTAACTCCAGCAGGTGCCTTTTGTGTTTCACTCAACCCTAGGTCAAAACGTTGTTCACTCATGGCTTGTTCCAACCAAGGCGACTCTAAGG
+
BCCCBCFFFFFFGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHGHHHHHGHGGHHHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHHHHGHHHGHHHHHHHHHHHHHGGHHHHHHGGHHGHGGGGGHHHHH
@M00289:3:000000000-A752F:1:1110:7454:18361 1:N:0:20
ATCTCAACGCGTCCAAATGAGGCATCGCTGTATTCAGGTTACTTTACATAAGAGTTTTTATGTTAAATAGGACTAAAAATATACTCTAATTTTAGAGTTTTCTTTTAGGTATGATGTAAAAACATACAAGCCTAAGAGTTTAATTTAAAGG
+
CCCCCFFFDDDDGGGGGGGGGGGGHHGGHGGHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGHHHHHHHHGHHHHHHHHHHGHHHHHHGG
FIle 2:
@M00289:3:000000000-A752F:1:1112:9370:8062 2:N:0:20
CCACACCCCCTGCAGAGCGTTCTCGCAGACACAGTCCGCAAAGCCAGTGCCGACTTGAGCCACCTTGACCAGTGTTTTTATTAGAACTAGAAACTAGAGGATTTGTTGCAC
+
BBCDDCCEDEEEGGGGGGGGGGGHGGGGGHHHHHHHHGGGGGHHHFHGHHHGCGGGHHHHHHHHHHHGHHHHHGGHHHGHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
@M00289:3:000000000-A752F:1:1114:14694:7357 2:N:0:20
TAATTTAACTGTAGTTCATCCGCATTTTCCCCTCTAATGCCCCAGTTTTCTTTAGGGGTTTCAAATATAGTTATCTCAACATCAT
+
BBBBCFFFFFFFGGGGGGGGGGGGGGHHHHHHHGHHHHHHHHGGGHHHHHHHHHHHGGEGGHHHHHHHHHHHHHHHHHHHHHHFH
@M00289:3:000000000-A752F:1:1102:13093:3364 2:N:0:20
CCTTAGAGTCGCCTTGGTTGGAACAAGCCATGAGTGAACAACGTTTTGACCTAGGGTTGAGTGAAACACAAAAGGCACCTGCTGGAGTTACCTTGCAGCCCTTACTGAAAGTTAATGAAGTGGCTGTTTTGCCCCGTGATCACAGCC
+
BBBBCFFFFFCDGGGGGGGGGGHHHHHHGHHHHHHHHHHHHHGHHHGHGHHGHHHHHHGHHGHHHHHHHHGHHHGHHHGHHHHHHHHHHHHHHHHHFHHHGGHHHHHHHHHHHHHHHHHGHHHGHG0FGH03GHGF<DC2FFGHHHH
As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.
This next script should be about four times faster, but is only suitable for FASTQ files:
Code:from Bio.SeqIO.QualityIO import FastqGeneralIterator mixed_file = "mixed.fastq" paired_file = "paired.fastq" out_handle = open(paired_file, "w") prev = None for curr in FastqGeneralIterator(open(mixed_file, "rU")): if curr[0].split()[0].endswith("/1"): prev = curr elif not curr[0].split()[0].endswith("/2"): raise ValueError("Expect IDs to end /1 and /2,\n%s" % curr[0]) elif prev and prev[0].split()[0] == curr[0].split()[0][:-2] + "/1": out_handle.write("@%s\n%s\n+\n%s\n" % prev) out_handle.write("@%s\n%s\n+\n%s\n" % curr) prev = None out_handle.close()
This was written and tested using Biopython 1.54beta, should be fine on Biopython 1.51 or later. I've only used a 500MB test file - it took about 30s.
If you are willing to make more assumptions about the file layout, or skip the minimal error checking, I could make this faster still - but I would need to see a sample of the data as suggested before (e.g. the first ten reads).[/QUOTE]Last edited by chariko; 08-05-2014, 07:15 AM.
Comment
-
Originally posted by chariko View PostI have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files which look like: ...
As you can see the reads are not correlative in each file so I supposse this makes the thing more complicated. My files are not very big (5.2Mb and 5.4 Mb respectively) so I wonder if you know a way for selecting only the ones which have data for both ends.
I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:
Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.
Do you have the original Illumina FASTQ files before whatever processing was done to them?
Peter
Comment
-
Originally posted by maubp View PostIllumina changed their naming schema. There are scripts out there to convert this back to the /1 and /2 style (even as Unix one-line commands with sed or awk).
I have a related Python script (which has a Galaxy wrapper) which knows more naming schemes, but it expects the reads to be sorted/interleaved - yours are more mixed up:
Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.
Do you have the original Illumina FASTQ files before whatever processing was done to them?
Peter
Comment
-
Either use a pair-aware filtering pipeline, or, I suggest:
1. Interleave the pairs (one FASTQ file with r1, r2, r1, r2, etc)
2. Filter the interleaved file (preserving the original order)
3. Run https://github.com/peterjc/pico_gala...aired_unpaired
Comment
-
Originally posted by chariko View PostI have tried to run your script but the problem is that my file does not have the /1 or /2 ending. In my case the file is told with 1 or 2 before the N on the last letters of the file (bold) have two separate fastq files
Comment
-
Thanks SES for your suggestion. I will try it also.
Anyway, I finally made it work by doing the following:
First I changed my ending form 1:N to /1 and 2:N to /1 and /2 with:
gawk '{print((NR % 4 == 1) ? $1"/1" : $0)}' Sample1_R1.fq > Sample1_newtags_R1.fq
gawk '{print((NR % 4 == 1) ? $1"/2" : $0)}' Sample1_R2.fq > Sample1_newtags_R2.fq
(https://wikis.utexas.edu/display/bio...nux+one-liners).
I interleaved the pairs using the script interleave_fastq.py (https://gist.github.com/ngcrawford/2232505).
I filter my fastq file using the following script:
(https://github.com/brentp/bio-playgr...r/reads-utils/)
the problem is that the filtered file does not mantain the order of paired reads so you will have it to do it manually (in my case with Excel).
And finally after installing galaxy locally I run the galaxy script made by maubp.
Galaxy tools and wrappers for sequence analysis. Contribute to peterjc/pico_galaxy development by creating an account on GitHub.
When running again fastqc with my files, the duplication levels where ok.
Thanks a lot for your suggestions.Last edited by chariko; 08-08-2014, 06:44 AM.
Comment
Latest Articles
Collapse
-
by seqadmin
Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...-
Channel: Articles
09-23-2024, 06:35 AM -
-
by seqadmin
During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.
Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...-
Channel: Articles
09-09-2024, 10:59 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 10-02-2024, 04:51 AM
|
0 responses
13 views
0 likes
|
Last Post
by seqadmin
10-02-2024, 04:51 AM
|
||
Started by seqadmin, 10-01-2024, 07:10 AM
|
0 responses
22 views
0 likes
|
Last Post
by seqadmin
10-01-2024, 07:10 AM
|
||
Started by seqadmin, 09-30-2024, 08:33 AM
|
0 responses
26 views
0 likes
|
Last Post
by seqadmin
09-30-2024, 08:33 AM
|
||
Started by seqadmin, 09-26-2024, 12:57 PM
|
0 responses
19 views
0 likes
|
Last Post
by seqadmin
09-26-2024, 12:57 PM
|
Comment