Seqanswers Leaderboard Ad

**GenoMax** · 06-23-2015, 02:00 PM

You can use one of the shell script ideas here to iterate over the set of files you have: http://stackoverflow.com/questions/1...-files-in-unix
and http://stackoverflow.com/questions/1...s-in-directory

If you are submitting to a job scheduler on a cluster a similar loop can be used.

**mgogol** · 06-24-2015, 06:48 AM

I bet you could do something snazzy with gnu parallel.

GNU parallel for Bioinformatics: my notebook

http://figshare.com/articles/GNU_parallel_for_Bioinformatics_my_notebook/822138

This tutorial follows Ole Tange’s GNU parallel tutorial ( http://www.gnu.org/software/parallel/parallel_tutorial.html ) but I tried to use some bioinformatics-related examples (align with BWA, Samtools, etc.. ).

**slefevre** · 08-20-2015, 04:13 AM

Hi - for starters, I am a completely newbie at this, so bare with me if it's basic.

I understand how to do a for loop with one object, but if you have paired-end reads, trim-galore takes two files as input, and I just can't figure out how to make a for loop including a list of pairs.... Would be easy if trim-galore accepted wildcards in the two filenames, but I am not sure of it does? Or is it possible to specify two objects to use as input? i.e.
for read1 in /*/xxxxR1.fastq
for read2 in /*/xxxxR2.fastq
do trim-galore etc etc "$read1$ "$read2"
done

Any thoughts?

**Michael.Ante** · 08-20-2015, 05:39 AM

You can deduce the read2 variable from read1:

Code:

for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done

**mgogol** · 08-20-2015, 05:40 AM

Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.

**slefevre** · 08-20-2015, 05:46 AM

Originally posted by Michael.Ante View Post

You can deduce the read2 variable from read1:

Code:

for read1 in */xxxxR1.fastq; do read2=$(echo $read1| sed 's/R1.fastq/R2.fastq/'); trim-galore etc etc $read1 $read2 ; done

That makes sense - and even better, it seems to be working. Thank you so much!

**Michael.Ante** · 08-20-2015, 07:44 AM

Originally posted by mgogol View Post

Ah. The way I usually handle stuff like this is just I have a text file listing all the files - in this case, in two columns. Then I have a perl script with a for loop going through each line in the text file and generating the commands I want to run. It's useful later on in R or whatever as well.

In case you want to speed up a bit, you can use such a file as input for GNU parallel:

Code:

parallel -j 8 --colsep '\t'  trim-galore etc etc {1} {2} :::: read-pairs-location.tsv

Just adjust the number of threads (-j) according to your CPUs.

**yaseen.ladak** · 11-13-2015, 01:44 PM

Hello Michael,

I am trying to do exactly as you suggested just need a bit of help with that code.

what is --colsep and when you say '\t' means that you read-pairs-location.tsv is tab separated if it was a CSV you would write --colsep ',' please correct me if I am wrong. Also it means that it will use 8 threads for each of the trim_galore command?

I am currently testing on a Quad code i7 on my mac that has hyper threading so in total I have 8 cores 4 physical and 4 virtual so I can use -j8 although I need to test for what value of thread I get the best performance as using maximum threads generally gives lower performance.

Thanks,
Yaseen

**GenoMax** · 11-13-2015, 02:02 PM

That --colsep option should work. Give it a try

Otherwise convert your file into tsv format and then use the original \t option.

Having 8 cores does not mean that you will be able to use them efficiently. They are connected to your storage subsystem through a common bus which can only allow a certain amount of data to be read/written.

Again experiment with a small subset of data (test file) starting with 4 cores and go up and down in number to see what number works best for your setup.

**yaseen.ladak** · 11-13-2015, 03:45 PM

Thank you GenoMax.

I ran the following command:

parallel -j 3 --colsep '\t' trim_galore --output_dir tg/3 --paired --fastqc -a CTGTCTCTTATA -a2 CTGTCTCTTATA {1} {2} :::: tg.tsv

Is there a way that I can give a name to this command? and how would i check the number of threads being used. I am using a mac and when I check the activity monitor i cannot see 3 i.e. number of threads. Had i named this command it would have shown up in the activity monitor and or using top? Indeed this command is running as it is generating output. The tg.tsv has 4 files names with their complete paths i.e. 2 paired end samples.

Can someone please advise? thanks

**yaseen.ladak** · 11-13-2015, 03:52 PM

In my activity monitor when i run the above command its shows gzip command with 1 thread as i think it uncompressed the fastq.gz file initially? Whats the best to see the number of threads run by this command on a mac or a linux. I have a mac.
Can I name this command to something so i can quickly find it in top etc?

**GenoMax** · 11-13-2015, 05:54 PM

When you run "top" processes that are actively consuming CPU cycles should show up at the top of the list. You can also use the activity monitor: https://support.apple.com/en-us/HT201464

Your file contains the two file name pairs separated by a tab on each line, right?

**yaseen.ladak** · 11-14-2015, 03:11 AM

When I do a top or a activity monitor it shows gzip most of the time and the number of threads as 1 even when i run with 3 threads.

Yes tsv in a text editor, with two reads each separated by a tab. I did not put two samples as my mac gives me out of space error so for testing I can only use 1 sample with paired end read. I then labelled the extension of the file with .tsv The below is the section how my tab separated file looks like, the gap between the two files is a tab.

/Users/Yaseen/S1R1.fastq.gz /Users/Yaseen/S1R2.fastq.gz

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 17 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

cutadapt trimming for multiple files

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News