Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Conversion from ‘qseq.txt’ to ‘fastq’ format

    Hi
    Can anybody suggest a program or share a code to convert the illumina _qseq.txt files to fastq format used as input for Bowtie?
    Thanks,
    Joseph

  • #2
    Here is some perl code which does the bare minimum to convert qseq to fastq

    (name the script qseq2fastq.pl)

    Code:
    #!/usr/bin/perl
    
    use warnings;
    use strict;
    
    while (<>) {
    	chomp;
    	my @parts = split /\t/;
    	print "@","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
    	print "$parts[8]\n";
    	print "+\n";
    	print "$parts[9]\n";
    }
    To use this code (assuming you are using pipline 1.3.2) cd to your GERALD directory and type the following:

    cat Temp/s_1_1_0???_ub_qseq.txt | qseq2fastq.pl > s_1_1.fastq

    This will collect all of the files for lane 1, read 1 in the GERALD/Temp directory and output a single fastq for lane1, read 1.

    I use the qseq files in the GERALD/Temp directory instead of the Bustard directory because of the difference in the way these files represent positions with no base call or an ambiguous base call. The files in the Bustard directory have a '.' in the sequence string while those in the Gerald/Temp directory have an 'N'. If you want to use the qseq files in the Bustard directory you will have to add some code to change any '.'s in the sequence line to 'N's.

    The quality scores are still represented as Illumina's phred64 so don't forget to either convert prior to running bowtie or pass the --phred64 option when running bowtie.

    Comment


    • #3
      trying qseq2fastq.pl

      Hi
      Thank you for your help.
      To try your script, I copied 4 files to a dir along with qseq2fastq.pl.
      Here is the content of the dir:
      ls
      qseq2fastq.pl s_7_1_0001_ub_qseq.txt s_7_1_0002_ub_qseq.txt s_7_1_0003_ub_qseq.txt s_7_1_0004_ub_qseq.txt

      I got an error:
      cat s_7_1_0???_ub_qseq.txt | qseq2fastq.pl > s_7_1.fastq
      -bash: qseq2fastq.pl: command not found

      When I added perl to the command, I did not get the error but the output file s_7_1.fastq was empty:

      cat s_7_1_0???_ub_qseq.txt | perl qseq2fastq.pl > s_7_1.fastq

      any suggestions?
      Thanks,
      Joseph

      Comment


      • #4
        please ignore

        Originally posted by joseph View Post
        Hi
        Thank you for your help.
        To try your script, I copied 4 files to a dir along with qseq2fastq.pl.
        Here is the content of the dir:
        ls
        qseq2fastq.pl s_7_1_0001_ub_qseq.txt s_7_1_0002_ub_qseq.txt s_7_1_0003_ub_qseq.txt s_7_1_0004_ub_qseq.txt

        I got an error:
        cat s_7_1_0???_ub_qseq.txt | qseq2fastq.pl > s_7_1.fastq
        -bash: qseq2fastq.pl: command not found

        When I added perl to the command, I did not get the error but the output file s_7_1.fastq was empty:

        cat s_7_1_0???_ub_qseq.txt | perl qseq2fastq.pl > s_7_1.fastq

        any suggestions?
        Thanks,
        Joseph

        please ignore the above message. I got it to work. Thanks

        Comment


        • #5
          Need help on this topic

          I need to get this script to work with the qseq.txt files in GERALD from pipeline 1.5 Could someone modify it so that it works?. There may be a difference with the qseq.txt files from previous pipelines because the script throws a lot of error messages like this:

          mymac:fasta programmer$ perl s_1_1_0???_ub_qseq.txt | qseq2fastq.pl > s_1_1.fastq
          -bash: qseq2fastq.pl: command not found
          Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "1 1"
          (Missing operator before 1?)
          Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "1 1"
          (Missing operator before 1?)
          Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "1 0"
          (Missing operator before 0?)
          Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "0 24"
          (Missing operator before 24?)
          Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "240"
          (Missing operator before 0?)
          Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "0 1"


          This will be highly appreciated!!

          Comment


          • #6
            Originally posted by Alvaro Hernandez View Post
            mymac:fasta programmer$ perl s_1_1_0???_ub_qseq.txt | qseq2fastq.pl > s_1_1.fastq
            -bash: qseq2fastq.pl: command not found
            Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "1 1"
            (Missing operator before 1?)
            Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "1 1"
            (Missing operator before 1?)
            Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "1 0"
            (Missing operator before 0?)
            Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "0 24"
            (Missing operator before 24?)
            Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "240"
            (Missing operator before 0?)
            Number found where operator expected at s_1_1_0001_ub_qseq.txt line 1, near "0 1"


            This will be highly appreciated!!
            Your command should be written as:
            Code:
            cat s_1_1_0???_ub_qseq.txt | perl qseq2fastq.pl > s_1_1.fastq
            Xi Wang

            Comment


            • #7
              one more question

              Xi Wang,

              Thank you for your help, it now worked!

              One more question: The fastq file from GERALD has the following structure for each read:

              @HWI-EAS385_0042:7:1:54:1892#0/1
              TAGCTCTTGATCCGGCAAACAAACCAACGCTTGGAGCGGGGGGGGGACGGGAGAGGGGGTCAGCCGGAGGGCGGGGACCA
              +HWI-EAS385_0042:7:1:54:1892#0/1
              `baabaaaaa`][aaaa_a^^aaaa^`a``\[]`UXQVW^_\MFX^[ZMDT\\^WFX_^XXK[TH^YIS\XYONVYQY^[

              (this is: @read name, another line with the read sequence, another line with +read name and another line with quality scores)

              However, this script gives the following structure:

              @HWUSI-EAS594-R:1:3:1453:1350#0/1
              CCCAGTTCCGACGATCGATTTGCACGTCAGAATCGCTACGGACCTCCATCAGGGTTTCCCCTGACTTCGTCCTGACCAGG
              +
              eea^cdfdffgggggggggggeggggdggdffgdbdgddgggg`g^dfbfgdggcfbgfffcb]gffbfcfcefbbBBBB

              (does not give the read name after the +). Will it recognize that the quality scores belong to that read? How can I make the read name appear after the + to make it exactly similar to the GERALD file?

              Thanks again for your expert help!

              Alvaro

              Comment


              • #8
                Originally posted by Alvaro Hernandez View Post
                How can I make the read name appear after the
                You can use the script below (name it qseq2fastq.pl and replace the former one):

                Code:
                #!/usr/bin/perl
                
                use warnings;
                use strict;
                
                while (<>) {
                	chomp;
                	my @parts = split /\t/;
                	print "@","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
                	print "$parts[8]\n";
                	print "+","$parts[0]:$parts[2]:$parts[3]:$parts[4]:$parts[5]#$parts[6]/$parts[7]\n";
                	print "$parts[9]\n";
                }
                Xi Wang

                Comment


                • #9
                  ***Thanks!!

                  Xi Wang,

                  Your modified script run perfectly. Thanks again for sharing your expertise and following up with this. You saved my day.

                  AH

                  Comment


                  • #10
                    Originally posted by Alvaro Hernandez View Post
                    (does not give the read name after the +). Will it recognize that the quality scores belong to that read? How can I make the read name appear after the + to make it exactly similar to the GERALD file?
                    Alvaro
                    The official definition of the FASTQ format (see here and here) states that the name following the '+' is optional. Including exactly the same information twice is inefficient. Any software which reads FASTQ files should follow the standard and not require the name after the +.

                    Comment


                    • #11
                      Thanks

                      Thanks kmcarr, it is good to know that. I usually just deal with the standard pipeline but this time GERALD could not finish the alignment and had to merge these files, so I am learning all these things.

                      And the original script was yours so I appreciate you sharing it with everybody and Xi Wang for showing me how to use it.

                      Comment


                      • #12
                        A quick and easy way is to convert it using CLC Genomeworkbench.

                        Comment


                        • #13
                          qseq vs. illumina-fastq

                          Hi All,

                          I am wondering if qseq format is the same thing as fastq-illumina (differing from fastq-sanger only in the calculation of the quality score) or if qseq and fastq-illumina are different formats.

                          Thanks!

                          Comment


                          • #14
                            They are the same format, the qseq files are simply tile-divided sequence files. <- Not correct! See kmcarr's response below.

                            Following in the footsteps of perl - there is always more than one way to do it - did you (Alvaro) try configuring GERALD to just output the sequences instead of aligning them?
                            It can be done in the config file for the GERALD script by simply telling GERALD to output the sequence from the given lanes, e.g.:
                            123:ANALYSIS sequence

                            genbio64, only if you have the money for CLC
                            Last edited by Thomas Doktor; 03-25-2010, 04:23 AM.

                            Comment


                            • #15
                              Originally posted by joseph View Post
                              Hi
                              Can anybody suggest a program or share a code to convert the illumina _qseq.txt files to fastq format used as input for Bowtie?
                              Thanks,
                              Joseph
                              Most aligners come with scripts to perform those conversions. bfast has (scripts/ill2fastq.pl).
                              -drd

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              51 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              68 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X