Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • lisann_5
    Junior Member
    • Jan 2012
    • 6

    manipulate sequences in Fastq files

    Dear All,

    I have 20x illumina sequences data in large fastq files. Each file contains a sequence length of 21 nucleotides. I would like to remove the first 4 nucleotides from all reads in the files.

    i.e.

    @D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
    CATGATTTGATATTTAGGGCTT
    +
    HIFHIEGHIIFHGIIGHIIIDH
    @D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
    CATGATGACATAGAAATAATTT
    +
    IIFIIIIIIIIIIIFIFIIIFI
    @D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
    CATGAAGACAAAGCCTCTATGA

    to

    @D5N3XBQ1:129:C0T9LACXX:1:1101:1227:2122 1:N:0:
    ATTTGATATTTAGGGCTT
    +
    HIFHIEGHIIFHGIIGHIIIDH
    @D5N3XBQ1:129:C0T9LACXX:1:1101:1150:2163 1:N:0:
    ATGACATAGAAATAATTT
    +
    IIFIIIIIIIIIIIFIFIIIFI
    @D5N3XBQ1:129:C0T9LACXX:1:1101:1155:2248 1:N:0:
    AAGACAAAGCCTCTATGA

    I am new to bioinformatics and would appreciate a few pointers on the best way to get this done with the command line in Linux. Thanks, Lisanne
  • TiborNagy
    Senior Member
    • Mar 2010
    • 329

    #2
    awk '{if(NR%4==2){print substr($0,5,length($0))}else{print}}' file.fastq

    Comment

    • maasha
      Senior Member
      • Apr 2009
      • 153

      #3
      Using Biopieces:

      Code:
      read_fastq -i in.fq | extract_seq -b 4 | write_fastq -o out.fq -x

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        You should probably learn to program - e.g. Perl, Python, Ruby - whatever your local gurus use would be sensible as you'd have someone nearby to help.

        Here's a high-level Biopython solution:

        Code:
        from Bio import SeqIO
        records = (rec[4:] for rec in SeqIO.parse("input.fastq", "fastq"))
        count = SeqIO.write(records, "output.fastq", "fastq")
        print "Trimmed %i FASTQ records" % count
        That uses lots of objects and would be a bit slow on large files, but it is quite simple and could be used on many other supported file formats. See http://news.open-bio.org/news/2009/0...on-fast-fastq/ which would suggest something like this using Python strings (much faster but FASTQ specific):

        Code:
        from Bio.SeqIO.QualityIO import FastqGeneralIterator
        handle = open("output.fastq", "w")
        for title, seq, qual in FastqGeneralIterator(open("input.fastq")) :
            handle.write("@%s\n%s\n+\n%s\n" % (title, seq[4:], qual[4:]))
        handle.close()
        Similarly if you want to learn Perl or Ruby or Java, there are FASTQ modules in BioPerl, BioRuby and BioJava. See http://dx.doi.org/10.1093/nar/gkp1137
        Last edited by maubp; 10-25-2012, 02:30 AM. Reason: typo

        Comment

        • lisann_5
          Junior Member
          • Jan 2012
          • 6

          #5
          Thanks!

          Thank you all for the replay. I found my solution for this problem by maasha!

          Comment

          Latest Articles

          Collapse

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by SEQadmin2, 06-05-2026, 10:09 AM
          0 responses
          15 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-04-2026, 08:59 AM
          0 responses
          32 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 12:03 PM
          0 responses
          35 views
          0 reactions
          Last Post SEQadmin2  
          Started by SEQadmin2, 06-02-2026, 11:40 AM
          0 responses
          23 views
          0 reactions
          Last Post SEQadmin2  
          Working...