Seqanswers Leaderboard Ad

**carmeyeii** · 01-17-2013, 05:52 PM

Originally posted by bioliyezhang View Post

maybe try FixMateInformation of picard tools.
Not sure whether it will work.

I believe FixMateInformation is to be used on ALIGNED SAM files, to make sure both mates of a paired-end read have the same info, and not on pre-mapped fastq files to ensure both files contain the same mate pairs and in the same order

**carmeyeii** · 01-17-2013, 06:29 PM

Dear btmb,
I'm afraid I still cannot run it. Sorry to keep bothering?

I have corrected tabs and spaces to avoid getting the Unexpected indent Error,

but now I get:

Traceback (most recent call last):
File "fastq_mate_pair", line 61, in <module>
if count==1:
NameError: name 'count' is not defined

Thanks again for any help,

Carmen

**a_mt** · 01-17-2013, 10:09 PM

Hi bmtb,

I hope this is what you are looking for. Even I had the same problem where reads in forward and reverse files were not in order. I use this shell script to extract common reads and orphan reads. It produces 4 files forward_matched, reverse_matched, forward_orphan and reverse_orphan.

Usage is simple

Code:

filteresCommonPEreads.sh forward.fastq reverse.fastq

Hope this helps.

P.S : change file extension to sh.

Attached Files

filteredCommonPEreads.txt (1.8 KB, 876 views)

**etwatson** · 10-21-2013, 02:11 PM

btmb,

I got your edited script to run today, thank you. However, it prints nothing out to the created files. I get the following error.

finished reading in: 46na_1_trimmed_clipped.fastq
finished reading in: 46na_1_trimmed_clipped.fastq
number of paired reads: 37819703
number of single reads left from 46na_1_trimmed_clipped.fastq: 3984165
number of single reads left from 46na_2_trimmed_clipped.fastq: 4582643
Traceback (most recent call last):
File "/usr/bin/fastqcombinepairedend_edit.py", line 106, in <module>
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')
KeyError: 'HWI-ST594:1:2110:2163:8239#TTAGGCTTAGGC'

Here are the corresponding lines in the script:

print 'number of single reads left from %s: %d'%(setoffiles[1],len(single2))
for label in single1:
SIN.write('@' + str(label) + '1\n' + str(seqs1[label]) + '\n+' + str(label) + '1\n' + str(quals1[label]) + '\n')

It looks like the dictionary (label) contains nothing, and I can not find it anywhere else in the script. This is my first Python script to use and I would like to get it running. How would I go about fixing this?

**sphil** · 10-28-2013, 05:19 AM

Hi, maybe I'm tot late but for someone looking into this. this neat one liner does the job as well:

Code:

paste file1_PE_DATA.fastq file2_PE_DATA.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'

it joins, sorts and splits again.

best,

Phil

**Bo-AAU** · 12-11-2013, 11:48 AM

Originally posted by sphil View Post

Hi, maybe I'm tot late but for someone looking into this. this neat one liner does the job as well:

Code:

paste file1_PE_DATA.fastq file2_PE_DATA.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | shuf  | head | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'

it joins, sorts and splits again.

best,

Phil

Hey Phil

Maybe I didn't understand but I used the "cat" command in front of the line and ran it. It produced two files, but the script just finished without error (thought it was done) after writing only a couple of reads in file1 and just some "2:N:0:ACTGAT" lines in file2. What is the problem here do you think?

-Bo

**sphil** · 12-12-2013, 05:45 AM

try this one:
paste pe_file1.fastq pe_file2.fastq | awk '{ printf("%s",$0); n++; if(n%4==0) { printf("\n");} else { printf("\t\t");} }' | sed 's/\t\t/\n/g' | awk '{print $1 > "file1.fastq"; print $2 > "file2.fatsq"}'

the shuf and head command kinda crashed it. sry..

**newuserofrnaseq** · 01-13-2014, 10:58 AM

count and name are not defined

I have the same problem as carmeyeii.
After declare count=0 name=None in fastqcombinepairedend.py, it gives me the empty files as following:

./PECombiner.sh
finished reading in: /fastqData/SRR579533_1_trimmed_clipped.fastq
finished reading in: /fastqData/SRR579533_2_trimmed_clipped.fastq
number of paired reads: 0
number of single reads left from /fastqData/SRR579533_1_trimmed_clipped.fastq: 0
number of single reads left from /fastqData/SRR579533_2_trimmed_clipped.fastq: 0

Is the way I declare the variable is not right?
Something went wrong here, is anybody can help?

Thank you in advance,

Originally posted by carmeyeii View Post

Dear btmb,

I have tried the code and it seems not to be working

'name' and 'count' are not defined before they are evaluated it seems, and the 'count' block seems to me as it sets to 0 when the header is encountered, but I'm not sure. Are there any python-saavy bioinformaticians that would care to help us out with this code?

Thanks!

**SES** · 02-24-2014, 07:40 AM

I also struggled to get that script to work and I was a little frustrated once I did get it working. Mainly because it uses a lot of memory, as someone commented previously, but also it strips the pair information off the output and creates hardcoded file names.

I ended up writing my own tool for pairing reads called Pairfq. The problem I kept running into is that most approaches assume 4 line Fastq as input and the sequence name has to be in a certain format. That means you have to come up with different ways to solve this simple task if you are using Fasta or your sequence names are a little different. It was my aim to try and solve these problems.

Here is an example of the usage:

Code:

$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq

My observations are that the above command uses about 43% as much memory as the Python script listed above in the thread. This command is a bit slower because it is not making any assumptions about the format (see below). It is also possible to specify that an index should be used. For example,

Code:

$ pairfq makepairs -f s_1_1_trimmed.fq \
-r s_1_2_trimmed.fq \
-fp s_1_1_trimmed_p.fq \
-rp s_1_2_trimmed_p.fq \
-fs s_1_1_trimmed_s.fq \
-rs s_1_2_trimmed_s.fq \
--index

This will result in almost no memory being used (15 MB RAM actually). The execution will be much slower with this option, but this is the only method to my knowledge that can handle pairing really large sequence sets without a big memory machine.

The input can be Fasta or Fastq, compressed (with gzip or bzip2) or uncompressed, and the sequence identifiers can be in Casava 1.4 or 1.8+ format as explained on the project wiki (note that pairing the reads is just one of the functions of Pairfq). The outputs are separate files of paired and unpaired forward and reverse reads (which can be optionally compressed).

Hopefully, this will save you some time and help to avoid crafting custom shell commands for this task.

**Brian Bushnell** · 02-24-2014, 10:26 AM

I also have a tool for fixing paired reads from interleaved files. It's extremely fast (over 200MB/s) and uses little memory, but cannot fix 2-file input, only single-file interleaved. It's based on the fact that if you corrupt an interleaved file by throwing away some of the reads, all remaining pairs will still be adjacent to their mate. So, you never need to store more than 1 read in memory.

To run it, download BBMap and us use this:
bbsplitpairs.sh in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

It reads/writes fasta or fastq, raw or gzipped. It can also do trimming and quality filtering, for example:

bbsplitpairs.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlen=40
This will quality-trim the right and left sides of reads to Q10 and discard anything that ends up less than 40bp. For pairs, if one read is shorter than 40bp and the mate is longer, the longer one will be written to the 'outsingle' destination and the shorter will be discarded. in2 and out2 are only needed for paired 2-file input. If a single file is input, then whether it is interleaved or single-ended will be autodetected based on read names, but this can be overridden with the 'int=t' or 'int=f' flag.

**SES** · 02-24-2014, 10:58 AM

Originally posted by Brian Bushnell View Post

I also have a tool for fixing paired reads from interleaved files. It's extremely fast (over 200MB/s) and uses little memory, but cannot fix 2-file input, only single-file interleaved. It's based on the fact that if you corrupt an interleaved file by throwing away some of the reads, all remaining pairs will still be adjacent to their mate. So, you never need to store more than 1 read in memory.

To run it, download BBMap and us use this:
bbsplitpairs.sh in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

It reads/writes fasta or fastq, raw or gzipped. It can also do trimming and quality filtering, for example:

bbsplitpairs.sh in=read1.fq in2=read2.fq out=trimmed1.fq out2=trimmed2.fq outsingle=singletons.fq qtrim=rl trimq=10 minlen=40
This will quality-trim the right and left sides of reads to Q10 and discard anything that ends up less than 40bp. For pairs, if one read is shorter than 40bp and the mate is longer, the longer one will be written to the 'outsingle' destination and the shorter will be discarded. in2 and out2 are only needed for paired 2-file input. If a single file is input, then whether it is interleaved or single-ended will be autodetected based on read names, but this can be overridden with the 'int=t' or 'int=f' flag.

I tried the command above, and I can't get it to run. Here is the output:

Code:

$ ./bbsplitpairs.sh in=bbsplittest_trimmed_cat.fastq out=bbsplit_test_p.fq outsingle=bbsplit_test_s.fq fixpairs
: invalid option
Usage:	/bin/bash [GNU long option] [option] ...
	/bin/bash [GNU long option] [option] script-file ...
GNU long options:
	--debug
	--debugger
	--dump-po-strings
	--dump-strings
	--help
	--init-file
	--login
	--noediting
	--noprofile
	--norc
	--posix
	--protected
	--rcfile
	--rpm-requires
	--restricted
	--verbose
	--version
	--wordexp
Shell options:
	-irsD or -c command or -O shopt_option		(invocation only)
	-abefhkmnptuvxBCHP or -o option

If I try to execute with bash:

Code:

$ bash bbsplitpairs.sh in=bbsplittest_trimmed_cat.fastq out=bbsplit_test_p.fq outsingle=bbsplit_test_s.fq fixpairs
: command not foundne 3: 
: command not foundne 6: 
'bsplitpairs.sh: line 8: syntax error near unexpected token `{
'bsplitpairs.sh: line 8: `calcXmx () {

As a side note, you might want to put your scripts in directory because people get kind of annoyed when you untar a project and it inflates a bunch of scripts.

**Brian Bushnell** · 02-24-2014, 11:36 AM

Originally posted by SES View Post

As a side note, you might want to put your scripts in directory because people get kind of annoyed when you untar a project and it inflates a bunch of scripts.

Ahh, sorry about that; I'll change it.

The shellscripts are for bash, so you may have to say "bash bbsplitpairs.sh". If your environment does not have bash, then instead of the shellscript, you would do this:

java -Xmx200m -cp /path/to/current/ jgi.SplitPairsAndSingles in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

**SES** · 02-24-2014, 12:14 PM

Originally posted by Brian Bushnell View Post

The shellscripts are for bash, so you may have to say "bash bbsplitpairs.sh". If your environment does not have bash, then instead of the shellscript, you would do this:

java -Xmx200m -cp /path/to/current/ jgi.SplitPairsAndSingles in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

Thanks for the response. As you can see in my post above, that is how I executed the script ("bash bbsplitpairs.sh"; on Linux with bash 3.2.25) and it produced the error I posted. As far as the Java command, I will experiment and see what I find...

**SES** · 02-24-2014, 12:19 PM

Originally posted by Brian Bushnell View Post

Ahh, sorry about that; I'll change it.

The shellscripts are for bash, so you may have to say "bash bbsplitpairs.sh". If your environment does not have bash, then instead of the shellscript, you would do this:

java -Xmx200m -cp /path/to/current/ jgi.SplitPairsAndSingles in=broken.fq out=pairs.fq outsingle=singletons.fq fixpairs

Okay, I have tried the Java command and I got it run but it did not produce any pairs, whereas Pairfq and the Python script mentioned above produced the same result. What is the expected input?

**Brian Bushnell** · 02-24-2014, 12:27 PM

Can you post the top 16 or so lines from the input file?

The expected input is something like this:

@blahA /1
ACGT
+
????
@blahA /2
ACGT
+
????
@blahB /2
ACGT
+
????
@blahC /1
ACGT
+
????
@blahC /2
ACGT
+
????

In this case, "blahA /1" and "blahA /2" would be output as a pair, as would "blahC /1" and "blahC /2", while "blahB /2" would be output to singletons.

If the reads don't contain a " /1" and " /2" or a " 1:" and a " 2:", it won't work.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 30 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 32 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News