Seqanswers Leaderboard Ad

**pepperoni** · 10-17-2011, 11:27 AM

Originally posted by maubp View Post

Very misguided given PHRED zero would have been fine for this

Thanks for the information. I'm not sure what off the shelf solution to recommend here - personally I'd write a Python script to filter out these duff reads...

Being a non-programmer I used the tools form scriptome. Here is my pipeline to get rid of those reads:

1Fasta to tab
perl -e ' $count=0; $len=0; while(<>) { s/\r?\n//; s/\t/ /g; if (s/^>//) { if ($. != 1) { print "\n" } s/ |$/\t/; $count++; $_ .= "\t"; } else { s/ //g; $len += length($_) } print $_; } print "\n"; warn "\nConverted $count FASTA records in $. lines to tabular format\nTotal sequence length: $len\n\n"; ' Corrida_16_01RMDSPFR004_2_F3.csfasta > Trepo_2.tab

2Choose lines not containing "."
perl -e ' $string=q{.}; $count=0; while(<>) { if (!/\Q$string\E/) { print $_; $count++ } } warn "\nChose $count lines with string [$string] out of $. total lines.\n"; ' Trepo_2.tab > TrepoNoDot_2.tab

3Choose col 0
perl -e ' @cols=(0); while(<>) { s/\r?\n//; @F=split /\t/, $_; print join("\t", @F[@cols]), "\n" } warn "\nChose columns ", join(", ", @cols), " for $. lines\n\n" ' TrepoNoDot_2.tab > Trepo2ID.lst

4 tab to csfasta
perl -e ' $len=0; while(<>) { s/\r?\n//; @F=split /\t/, $_; print ">$F[0]"; if (length($F[1])) { print " $F[1]" } print "\n"; $s=$F[2]; $len+= length($s); $s=~s/.{60}(?=.)/$&\n/g; print "$s\n"; } warn "\nConverted $. tab-delimited lines to FASTA format\nTotal sequence length: $len\n\n"; ' TrepoNoDot_2.tab > TrepoNoDot_2_F3.csfasta

Qual file

5 Extract the corresponding .qual values from a list of IDs
perl -e ' ($id,$fasta)=@ARGV; open(ID,$id); while (<ID>) { s/\r?\n//; /^>?(\S+)/; $ids{$1}++; } $num_ids = keys %ids; open(F, $fasta); $s_read = $s_wrote = $print_it = 0; while (<F>) { if (/^>(\S+)/) { $s_read++; if ($ids{$1}) { $s_wrote++; $print_it = 1; delete $ids{$1} } else { $print_it = 0 } }; if ($print_it) { print $_ } }; END { warn "Searched $s_read FASTA records.\nFound $s_wrote IDs out of $num_ids in the ID list.\n" } ' Trepo2ID.lst Corrida_16_01RMDSPFR004_2_F3_QV.qual > TrepoNoDot_2_F3_QV.qual

cs · 07-05-2012, 01:42 PM

Originally posted by maubp View Post

How would you supply a potentially very long list of IDs to grep?

using the '-f FILE' option, where FILE contains the list of (regexps matching the) readnames, or, if the readnames are all fixed strings, then also use '-F' switch to grep in addition to '-f FILE'.

cdbfasta/cdbyank is very useful but there seems to be an upper limit on the number of reads that can be indexed, and hence retrieved.

**NicoBxl** · 09-09-2012, 10:48 PM

to extract reads from fastq file based on a id list (one id per line)

Code:

cat in.fastq | grep -f id.txt -A 3 | grep -v "\-\-" > out.fastq

**jrvalverde** · 04-25-2013, 07:53 AM

Originally posted by pmiguel View Post

cdbfasta -Q and cdbyank are the best solutions I have found. Fast!

But also, would not the GNU version of grep work since it allows you to return a number of lines after the line that matches with the -A parameter?

--
Phillip

You can specify multiple patterns with the -f option. Place the IDs in a file and run

grep file.fastq -F -A 3 -f filter.ids | grep -v '^--$' > output.fastq

You need the second grep because if two matches are consecutive they will be separated by a '---' line. Using -F speeds up the search as it specifies that the IDs are fixed strings.

Whether that will be faster or slower than other options is open to debate, but as it does not require file parsing I'd bet it is fastest.

And I concur with Phillip: whenever there is a basic standalone command that can do the job, it is generally best to use it directly instead of making new, more specialized and unneeded programs. Of course, writing programs is fun and helps you learn, but I can think of better ways for using my scarce time.

**syfo** · 04-26-2013, 07:33 AM

Don't forget the "-w" option (otherwise looking for "@HWI-ST314:228:C1NU0ACXX:6:1101:15011:2455" will pull you "@HWI-ST314:228:C1NU0ACXX:6:1101:15011:24553").

Code:

grep -F -w -A3 -f IDs.list all.fastq | grep -v '^--$'

**syfo** · 05-03-2013, 01:54 AM

Grep mystery: no output if input file too long?

Well, the grep command above seems to work only with a list of limited length. With a subset of the list many records are retrieved but with the full list the output is an empty file. Any idea why grep could just fail with no error message?

**syfo** · 05-06-2013, 01:38 AM

I eventually used fqextract.c and it worked even on a large list.

Topics	Statistics	Last Post
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, Today, 10:49 AM	0 responses 9 views 0 likes	Last Post by seqadmin Today, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 21 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News