Seqanswers Leaderboard Ad

**rahularjun86** · 03-13-2012, 04:12 AM

Dear Alex,
You can use perl scripting, read the files, Split line if it is starting with @HWI or +HWI and print only the first part after splitting. And use else statement for printing rest of the sequence and quality lines as such.
Or you can use unix 'awk' set FS in the BEGIN and then print $1 part if line is starting with seq Id @HWI or +HWI.
Best wishes,
Rahul

**ehlin** · 03-13-2012, 05:56 AM

Originally posted by alexd106 View Post

Dear all,

I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

i.e.

change

@HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
+HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

to

@HWI-ST182_0249:5:1101:1093:2017#GTATGACG
NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
+HWI-ST182_0249:5:1101:1093:2017#GTATGACG
BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
Thanks a million
Alex

Hi Alex, while perl scripting is a good option, if you are new to bioinformatics there might be easier options for you. For example, FASTX-Toolkit:

FASTX-Toolkit

http://hannonlab.cshl.edu/fastx_toolkit/

**alexd106** · 03-13-2012, 06:00 AM

Hi Rahul,

Thank you very much for your suggestions. As i mentioned, I am new to bioinformatics and am just trying to teach myself some perl (and have never used awk). Would you mind providing a little more detail of the perl code you would use? No worries if not.

Cheers
Alex

**kmcarr** · 03-13-2012, 06:18 AM

awk is good but sed might be faster and easier to learn.

Code:

sed -i.bak -e '/^[@+]HWI/ s/\/[12]$//' <yourFileName>

This sed script will look for lines starting with @HWI or +HWI, strip off either a /1 or /2 from the ends of those lines and save the result to the same file name as the original. The original file will be saved as <yourFileName>.bak.

**alexd106** · 03-13-2012, 06:52 AM

Thanks very much for the info.

All the best
Alex

**rahularjun86** · 03-13-2012, 07:01 AM

Hi Alex,

Following is the perl code:

Code:

  1 use strict;
  2 use warnings;
  3 
  4 my $file_in=$ARGV[0];
  5 my $file_out=$ARGV[1];
  6 
  7 my $num=0;
  8 open I,"<$file_in" or die $!;
  9 open O,">$file_out" or die $!;
 10 
 11 do{
 12 
 13 my $f =<I>;
 14 chomp $f;
 15 
 16 if(($f =~ /^\@HWI/)||($f =~ /^\+HWI/))
 17      { $num++;
 18        my @s=split(/\//, $f);
 19        print O"$s[0]\n";
 20      }
 21 
 22 else
 23      {
 24        print O "$f\n";
 25         }
 26 
 27 }until eof(I);
 28 my $pr=$num/2;
 29 print "\nProcessed reads: $pr\n"
 30 
 31 
~                                                                                                                                                                    
~

Usage: perl program_name.pl Input_file.fq Out_file.fq

**alexd106** · 03-13-2012, 07:34 AM

Dear all, thanks for all the really useful suggestions. What a great community this is. I hope I can contribute sometime in the future when i have a little more experience.

[ehlin] I thought of using FASTX-Toolkit but couldn't see the appropriate tool. I looked at

$ fastx_renamer -h
usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
Part of FASTX Toolkit 0.0.10 by A. Gordon ([email protected])

[-n TYPE] = rename type:
SEQ - use the nucleotides sequence as the name.
COUNT - use simply counter as the name.

but it looks like the renaming is restricted to either a sequence or counter.

The sed and seemed to do the trick and I will look at the perl solution in an attempt the educate myself.
Cheers again
Alex

Topics	Statistics	Last Post
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, 06-07-2024, 06:58 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-07-2024, 06:58 AM
New Method for DNA Sequence Amplification by seqadmin Started by seqadmin, 06-06-2024, 08:18 AM	0 responses 20 views 0 likes	Last Post by seqadmin 06-06-2024, 08:18 AM
New Tools Enhance Single-Molecule DNA Analysis with Minimal Samples by seqadmin Started by seqadmin, 06-06-2024, 08:04 AM	0 responses 18 views 0 likes	Last Post by seqadmin 06-06-2024, 08:04 AM
SIX2 Protein Identified as a Key Player in Prostate Cancer Treatment Resistance by seqadmin Started by seqadmin, 06-03-2024, 06:55 AM	0 responses 13 views 0 likes	Last Post by seqadmin 06-03-2024, 06:55 AM

Seqanswers Leaderboard Ad

Announcement

remove suffix from fastq sequence ID

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News