Header Leaderboard Ad
Collapse
matching up paired-end reads after fastx-toolkit filtering
Collapse
Announcement
Collapse
No announcement yet.
X
-
Download BBMap suite. "repair.sh" is the program that will sync paired-end data files. It is part of BBtools ahd has a guide available.
-
Originally posted by Armin_P View PostSince this has generally remained a common problem
There's no harm in rolling your own solution, but people should know (if they haven't followed the whole discussion) that there are working (and tested, documented) solutions.
Leave a comment:
-
Perl script to handle to issue
Since this has generally remained a common problem, I wrote a Perl script that removes the unpaired reads and matches the paired ones. Just generate a Perl file by copying and pasting the code below into a .pl file and run it. I hope, it will help as it helped me :-)
Code:#!/usr/bin/perl use strict; use warnings; use File::Basename; ######################## sub suffix_remover($); ################ ########################### ############################ ########################### =cut Should you decide to make this script publicly available, copy the suffix_remover() subroutine into here. This script takes as input two fastq files with the forward and reverse unmatched mates for a paired-end read data set. Input: provide on the command line, the paired-end read files (forward and reverse) containing reads that need to be matched. Output: the forward and reverse mate files will be matched and written to READNAME_sorted.fastq, where 'READNAME' is simply the name of the file provided. The READNAME_singletons.fastq files contain the singleton reads (reads with no matching mates) for both the input files. by: Armin PEYMANN =cut ################ ########################### ############################ ########################### sub sequence_separator ($); sub hash_maker_using_fastq_sequences (\@); ############################################ my $path_read1 = shift @ARGV; my $path_read2 = shift @ARGV; my $qx_output_read1 = qx(wc -l $path_read1); my ($number_of_lines_read1) = split(/\s/, $qx_output_read1); my $qx_output_read2 = qx(wc -l $path_read2); my ($number_of_lines_read2) = split(/\s/, $qx_output_read2); if ($number_of_lines_read1 % 4 != 0 || $number_of_lines_read2 % 4 != 0){ die "The number of lines in one of the files is not dividable by 4.\nFor each sequence in your fastq files, you must have" . " the following lines:\n" . "\@HEADER\nSEQUENCE\n+QUALITY-HEADER\nQUALITY VALUES\n" . "Also make sure there is no empty line at the end of your file.\n"; } my @read1 = @{sequence_separator($path_read1)}; my %read1 = %{hash_maker_using_fastq_sequences(@read1)}; undef @read1; my @read2 = @{sequence_separator($path_read2)}; my %read2 = %{hash_maker_using_fastq_sequences(@read2)}; undef @read2; my $dir_read1 = dirname($path_read1); my $read1_name_without_suffix = suffix_remover($path_read1); my $path_out_read1 = $dir_read1 . "/" . $read1_name_without_suffix . "_sorted.fastq"; open(FH_OUT1, ">$path_out_read1"); my $dir_read2 = dirname($path_read2); my $read2_name_without_suffix = suffix_remover($path_read2); my $path_out_read2 = $dir_read2 . "/" . $read2_name_without_suffix . "_sorted.fastq"; open(FH_OUT2, ">$path_out_read2"); my $path_out_read1_singletons = $dir_read1 . "/" . $read1_name_without_suffix . "_singletons.fastq"; open(FH_OUT3, ">$path_out_read1_singletons"); foreach my $key (sort keys %read1){ if (exists $read2{$key}){ print FH_OUT1 $read1{$key}; print FH_OUT2 $read2{$key}; }else{ print FH_OUT3 $read1{$key}; } } close FH_OUT1; close FH_OUT2; close FH_OUT3; print "sorted reads were written to:\n"; print "Check out $path_out_read1" . "\n"; print "Check out $path_out_read2". "\n" . "\n"; my $path_out_read2_singletons = $dir_read2 . "/" . $read2_name_without_suffix . "_singletons.fastq"; open(FH_OUT4, ">$path_out_read2_singletons"); foreach my $key (sort keys %read2){ unless (exists $read1{$key}){ print FH_OUT4 $read2{$key}; } } close FH_OUT4; print "single reads with no mate were written to:\n"; print "Check out $path_out_read1_singletons" . "\n"; print "Check out $path_out_read2_singletons" . "\n"; sub hash_maker_using_fastq_sequences (\@){ my $array_ref = shift; my @read = @{$array_ref}; my %read; foreach my $sequence (@read){ my $copy_sequence = $sequence; my ($first_header) = split(/\n|\r/, $copy_sequence); $first_header =~ /(.+)[# ]/; # one single space or '#' should cover most of fastq files. my $pair_id = $1; $read{$pair_id} = $sequence; } return \%read; } sub sequence_separator ($){ my $path_read = shift; my $line_counter = 0; my $event; my @events; open(FH_IN, $path_read) or die "unable to open FH_IN1\n"; while(my $line = <FH_IN>){ $line_counter++; if ($line_counter <= 4){ $event .= $line; } if ($line_counter == 4){ push(@events, $event); undef $event; $line_counter = 0 } } close FH_IN; return \@events; } sub suffix_remover($){ #get rid of the .<format> in the file name. (If there are more my $pathOfDesiredFile = shift; #than one dot in the file name, the shortest part of the file name will be taken!) $pathOfDesiredFile =~ /([^\/]+)(?!\/)$/; my $desiredFileName = $1 if $1; $desiredFileName =~ s/(\..+)(?!\.)// if $desiredFileName =~ /\./ ; return $desiredFileName; }
Leave a comment:
-
qsub is just for clusters using UGE/SGE (as far as I know). If you are trying to operate on a cluster with a different scheduler, you would need a different command; but if you just want to run in Linux without submitting a job to a scheduler, then your command in this case would simply be "sh test.sh".
Leave a comment:
-
Originally posted by luqmanaslam View PostThanks for this, yes these IDs are not generated by any sequencing platform. These come out of Stacks pipeline when you use demultiplexing and quality filtering of RAD data.
Actually, I got another different issue. I started using a Linux machine from a different lab and I tried to submit script using qsub command and it says
-bash: qsub: command not found
when I write
which qsub
I get following
/usr/bin/which: no qsub in (/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/user/bin)
When I tried
man qsub
it opens qsub help and instruction.
I don't understand why I am not able to submit my shell script
I am trying to use following codes for submission
qsub -q all.q -cwd test.sh
I will appreciate any help or guidance in this regard.
Thanks,
Leave a comment:
-
Thanks for this, yes these IDs are not generated by any sequencing platform. These come out of Stacks pipeline when you use demultiplexing and quality filtering of RAD data. It will be good to consider this in the script which will widen the usage of program straight away for many other people like me.
Actually, I got another different issue. I started using a Linux machine from a different lab and I tried to submit script using qsub command and it says
-bash: qsub: command not found
when I write
which qsub
I get following
/usr/bin/which: no qsub in (/usr/lib64/qt-3.3/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/user/bin)
When I tried
man qsub
it opens qsub help and instruction.
I don't understand why I am not able to submit my shell script
I am trying to use following codes for submission
qsub -q all.q -cwd test.sh
I will appreciate any help or guidance in this regard.
Thanks,
Leave a comment:
-
The expected pair information is missing from these reads, at least I have not seen data of this format. The solution is to add the pair information and then pair the reads.
Code:pairfq addinfo -i Read1.fastq -o Read1_pairinfo.fastq -p 1 pairfq addinfo -i Read2.fastq -o Read2_pairinfo.fastq -p 2
Code:pairfq makepairs -f Read1_pairinfo.fastq \ -r Read2_pairinfo.fastq \ -fp Read1_p.fastq \ -rp Read2_p.fastq \ -fs Read1_s.fastq \ -rs Read2_s.fastq \ --stats
Leave a comment:
-
Pairfq
Hi! I am having issue in using pairfq. Probably, it does not identify Illumina 1.9 fastq format and gives me following error.
ERROR: Could not determine FASTA/Q format. Please see https://github.com/sestaton/Pairfq or the README for supported formats. Exiting.
could you please update it to solve this problem?
If someone has and can share tool and/or script to sort and separate paired-end common and unique reads.
My reads and header identifiers look like this
Read1.fastq
@6_1101_1236_2393_1
AATTCATAAAAAACAACTGAATTGGATCACTGTAGTACAAATTGCAAACATTCACTATGGGCAAACAAGNNNNNNNNNNNNNNNNNNNNNNNNNN
+
FDDHHHHFJJJIIJJIIIJGHHIJJEHHJJIJHIGIJJJJJHIIIJJJIJIJGIIJJJJIJHIIJJGHH##########################
@6_1101_1728_2439_1
AATTCATCGGGAACGACCTGTTGCCTTTTCAAATCTTTCTAAGTTATTCTGTTCAGCTGCAAAGACTTGACATATTTNNNNNNNNNNNNNNNNNN
+
FFFHHGHHJJJGIJJJJJJJJJJJJJJJFGGIJJIIIIJJJIJIIGJJJJJJJJJJIIJJJFJGFHDHHFFFFCEEE##################
Read2.fastq
@6_2316_20850_100973_2
AATAAAATCAATCCAGACTAGCGGCACATTTTGACTGTTAAGTTTGAACTTCCTAAAATCTGTGCAGGCTTTTAAGCTGTACTGTTTTTCTT
+
HFHHHIJJIJJIJJIDHIJJFHIBHGJIBHIJICHIIJJ>CBFBFH=CFIIIJI)[email protected]@>@CDCCDBCBCDDDDDD
@6_2316_21147_100977_2
GCTGTTTTTAAATTTAAAATTTAAAAAGTGCTTTTTGTTGTTGTTTTTTTAAAAGGAAAAGGCTTCCTATAACTTCTCATGCTGGACAACAC
+
HHHHHJJJJEGHJIJJJJJJJJIIHHIJHGHIJJJJJIJJJJGHIIJJJJJJEH=AHFF>DFDEDEEDDD>CDDEDDDDAACDCCBDAAABD
Thanks
Leave a comment:
-
SES,
I uploaded v32.27 which fixes the bug you found in repair.sh. I'd be interested in knowing whether it still prints directory information on your system.
-Brian
Leave a comment:
-
Originally posted by Brian Bushnell View PostSES,
Thanks for reporting that! I'll have a fixed version of the shellscript up later today. If you want to test it in the meantime, you can replace
repair.sh
with
java -ea -Xmx8g -cp /path/to/bbmap/current/ jgi.SplitPairsAndSingles
...where "-Xmx8g" can be adjusted to be around 80% of your node's physical memory.
For a menu, "repair.sh" or "repair.sh -h" will still work and they should not run ulimit.
But I'm a bit confused about your comment on calculating files in mounted file systems. My cluster is attached to several multi-petabyte file systems containing hundreds of millions of files, a couple of which are very slow, and "ulimit -v" always returns an answer instantly even if "ls" takes several seconds to respond.
Leave a comment:
-
SES,
Thanks for reporting that! I'll have a fixed version of the shellscript up later today. If you want to test it in the meantime, you can replace
repair.sh
with
java -ea -Xmx8g -cp /path/to/bbmap/current/ jgi.SplitPairsAndSingles
...where "-Xmx8g" can be adjusted to be around 80% of your node's physical memory.
For a menu, "repair.sh" or "repair.sh -h" will still work and they should not run ulimit.
But I'm a bit confused about your comment on calculating files in mounted file systems. My cluster is attached to several multi-petabyte file systems containing hundreds of millions of files, a couple of which are very slow, and "ulimit -v" always returns an answer instantly even if "ls" takes several seconds to respond.
Leave a comment:
-
Originally posted by Brian Bushnell View PostBBTools now has a tool to quickly re-pair arbitrarily disordered reads based on their names.
For interleaved reads:
repair.sh -Xmx29g in=reads.fq out=fixed.fq outsingle=single.fq
For paired reads in two files:
repair.sh -Xmx29g in1=read1.fq in2=read2.fq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq
You can also repair simple broken interleaving much faster and with less memory, but this will not fix arbitrarily disordered reads, just reads that were interleaved and had some of the reads thrown away:
bbsplitpairs.sh in=reads.fq out=fixed.fq outsingle=single.fq fixinterleaving
For reference, I was simply trying to compare this to Pairfq (a tool for pairing reads, which was mentioned above), since I am the author of that tool.
Leave a comment:
-
BBTools has a tool to quickly re-pair arbitrarily disordered reads based on their names.
For interleaved reads:
repair.sh in=reads.fq out=fixed.fq outsingle=single.fq
For paired reads in two files:
repair.sh in1=read1.fq in2=read2.fq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq
You can also repair simple broken interleaving much faster and with less memory, but this will not fix arbitrarily disordered reads, just reads that were interleaved and had some of the reads thrown away:
bbsplitpairs.sh in=reads.fq out=fixed.fq outsingle=single.fq fixinterleavingLast edited by Brian Bushnell; 02-13-2015, 10:31 AM.
Leave a comment:
-
Originally posted by ranu1 View PostAfter removing the adapters from cutadapt i got unsymmetrical pair end file so I want to know the script that could remove the orphan reads and make the data symmetric although I made it using hash but its very slow.The above mention script is showing error..
Also, what do you mean when you say the script is showing error? It is not possible to know what the issue is based on that information alone.
Leave a comment:
Latest Articles
Collapse
-
by seqadmin
Targeted sequencing is an effective way to sequence and analyze specific genomic regions of interest. This method enables researchers to focus their efforts on their desired targets, as opposed to other methods like whole genome sequencing that involve the sequencing of total DNA. Utilizing targeted sequencing is an attractive option for many researchers because it is often faster, more cost-effective, and only generates applicable data. While there are many approaches...-
Channel: Articles
03-10-2023, 05:31 AM -
-
by seqadmin
Using automation to prepare sequencing libraries isn’t a new concept, and most researchers are aware that there are numerous benefits to automating this process. However, many labs are still hesitant to switch to automation and often believe that it’s not suitable for their lab. To combat these concerns, we’ll cover some of the key advantages, review the most important considerations, and get real-world advice from automation experts to remove any lingering anxieties....-
Channel: Articles
02-21-2023, 02:14 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 03-17-2023, 12:32 PM
|
0 responses
7 views
0 likes
|
Last Post
by seqadmin
03-17-2023, 12:32 PM
|
||
Started by seqadmin, 03-15-2023, 12:42 PM
|
0 responses
17 views
0 likes
|
Last Post
by seqadmin
03-15-2023, 12:42 PM
|
||
Started by seqadmin, 03-09-2023, 10:17 AM
|
0 responses
66 views
1 like
|
Last Post
by seqadmin
03-09-2023, 10:17 AM
|
||
Started by seqadmin, 03-03-2023, 12:03 PM
|
0 responses
64 views
0 likes
|
Last Post
by seqadmin
03-03-2023, 12:03 PM
|
Leave a comment: