Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • remove suffix from fastq sequence ID

    Dear all,

    I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

    i.e.

    change

    @HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
    NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
    +HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
    BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

    to

    @HWI-ST182_0249:5:1101:1093:2017#GTATGACG
    NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
    +HWI-ST182_0249:5:1101:1093:2017#GTATGACG
    BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

    I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
    Thanks a million
    Alex

  • #2
    Dear Alex,
    You can use perl scripting, read the files, Split line if it is starting with @HWI or +HWI and print only the first part after splitting. And use else statement for printing rest of the sequence and quality lines as such.
    Or you can use unix 'awk' set FS in the BEGIN and then print $1 part if line is starting with seq Id @HWI or +HWI.
    Best wishes,
    Rahul
    Rahul Sharma,
    Ph.D
    Frankfurt am Main, Germany

    Comment


    • #3
      Originally posted by alexd106 View Post
      Dear all,

      I have paired end illumina sequences in two large (20GiB) fastq files, one containing the forward reads, the other the reverse reads. Each file contains sequence IDs with either a /1 or /2 suffix. I would like to remove these suffixes (for some downstream analysis) from all reads and output 2 fastq files.

      i.e.

      change

      @HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
      NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
      +HWI-ST182_0249:5:1101:1093:2017#GTATGACG/1
      BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

      to

      @HWI-ST182_0249:5:1101:1093:2017#GTATGACG
      NCAGCTGCAGGGAGTTAATTCACAGCAGTTGAGAGCCCTTGCTGTACCAACAAAGGGATGTGTGATCTCCCGGTCCCTCTGCCCCCTCCCCTCCCAGCCGC
      +HWI-ST182_0249:5:1101:1093:2017#GTATGACG
      BS\cacccegggehgghhhhh_ghhhhhhhhhhhhghhhhhhhhgghhhhhhhhhhhbghghghhhgeggedd`bb^bbbbbbaaaaaa_abaaabbaaaa

      I am new to bioinformatics and would appreciate a few pointers on the best way to get this done.
      Thanks a million
      Alex
      Hi Alex, while perl scripting is a good option, if you are new to bioinformatics there might be easier options for you. For example, FASTX-Toolkit:

      Comment


      • #4
        Hi Rahul,

        Thank you very much for your suggestions. As i mentioned, I am new to bioinformatics and am just trying to teach myself some perl (and have never used awk). Would you mind providing a little more detail of the perl code you would use? No worries if not.

        Cheers
        Alex

        Comment


        • #5
          awk is good but sed might be faster and easier to learn.

          Code:
          sed -i.bak -e '/^[@+]HWI/ s/\/[12]$//' <yourFileName>
          This sed script will look for lines starting with @HWI or +HWI, strip off either a /1 or /2 from the ends of those lines and save the result to the same file name as the original. The original file will be saved as <yourFileName>.bak.

          Comment


          • #6
            Thanks very much for the info.

            All the best
            Alex

            Comment


            • #7
              Hi Alex,

              Following is the perl code:
              Code:
                1 use strict;
                2 use warnings;
                3 
                4 my $file_in=$ARGV[0];
                5 my $file_out=$ARGV[1];
                6 
                7 my $num=0;
                8 open I,"<$file_in" or die $!;
                9 open O,">$file_out" or die $!;
               10 
               11 do{
               12 
               13 my $f =<I>;
               14 chomp $f;
               15 
               16 if(($f =~ /^\@HWI/)||($f =~ /^\+HWI/))
               17      { $num++;
               18        my @s=split(/\//, $f);
               19        print O"$s[0]\n";
               20      }
               21 
               22 else
               23      {
               24        print O "$f\n";
               25         }
               26 
               27 }until eof(I);
               28 my $pr=$num/2;
               29 print "\nProcessed reads: $pr\n"
               30 
               31 
              ~                                                                                                                                                                    
              ~
              Usage: perl program_name.pl Input_file.fq Out_file.fq
              Last edited by rahularjun86; 03-13-2012, 07:04 AM.
              Rahul Sharma,
              Ph.D
              Frankfurt am Main, Germany

              Comment


              • #8
                Dear all, thanks for all the really useful suggestions. What a great community this is. I hope I can contribute sometime in the future when i have a little more experience.

                [ehlin] I thought of using FASTX-Toolkit but couldn't see the appropriate tool. I looked at

                $ fastx_renamer -h
                usage: fastx_renamer [-n TYPE] [-h] [-z] [-v] [-i INFILE] [-o OUTFILE]
                Part of FASTX Toolkit 0.0.10 by A. Gordon ([email protected])

                [-n TYPE] = rename type:
                SEQ - use the nucleotides sequence as the name.
                COUNT - use simply counter as the name.

                but it looks like the renaming is restricted to either a sequence or counter.

                The sed and seemed to do the trick and I will look at the perl solution in an attempt the educate myself.
                Cheers again
                Alex

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Best Practices for Single-Cell Sequencing Analysis
                  by seqadmin



                  While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                  06-06-2024, 07:15 AM
                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 06-07-2024, 06:58 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:18 AM
                0 responses
                20 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-06-2024, 08:04 AM
                0 responses
                18 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 06-03-2024, 06:55 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Working...
                X