Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Maq - sol2sanger problem - different sizes for the pair?

    Hi, All

    I just use "maq sol2sanger" to convert Illumina's _sequence.txt to .fastq format. I used paired-end design. I have the following two txt files

    s_1_1_sequence.txt ; size 4116883072
    s_1_2_sequence.txt ; size 4116883072

    After sol2sanger conversion, the fastq files don't have the same size:

    s_1_1_sequence.fastq; size 3644668984
    s_1_2_sequence.fastq; size 3644660878

    It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.

    Can anyone help me answer this question?

    Thanks very much!

    -Cliff

  • #2
    That does look like something has gone wrong.

    Also, assuming you are using FASTQ files from Illumina pipeline 1.3+, then don't use sol2sanger, use ill2sanger (requires a patch to MAQ - search the forum).

    Or BioPerl, or EMBOSS, or an ad-hoc perl script or, ... lots of examples on the forum. My biased suggestion would be to use Biopython, http://news.open-bio.org/news/2009/0...vert-function/

    See also: http://en.wikipedia.org/wiki/FASTQ_format
    Last edited by maubp; 12-07-2009, 11:55 AM. Reason: Typo

    Comment


    • #3
      Originally posted by cliff View Post
      It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.
      Have you checked the files? sol2sanger predicate doesn't print sequence headers twice, so

      @seqID
      CGATCGTAGCTAGC
      +seqID
      BBBBBBBBBBBB

      becomes

      @seqID
      CGATCGTAGCTAGC
      +
      ###########

      (the scores are completely random in this example ^__^)

      hence you may missing bytes

      Comment


      • #4
        I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

        cliff - how about posting the first few records of each file?

        Comment


        • #5
          Originally posted by maubp View Post
          I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

          cliff - how about posting the first few records of each file?
          You're right! On a second read I realize the issue here is not "the size differ before and after conversion" but "the paired reads differ in size after conversion"... Whoops!

          d

          Comment


          • #6
            Thanks for all your replies. Here the fastq files:

            1: $ more s_1_1_sequence.fastq

            @BILLIEHOLIDAY:1:1:3:1204#0/1
            GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
            +
            BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
            @BILLIEHOLIDAY:1:1:3:277#0/1
            TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
            +
            BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            2: $ more s_1_2_sequence.fastq
            @BILLIEHOLIDAY:1:1:3:1204#0/2
            TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
            +
            BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
            @BILLIEHOLIDAY:1:1:3:277#0/2
            GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
            +
            A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>

            Comment


            • #7
              Or using the [ code ] tags, since otherwise the forum mangles them:

              1: $ more s_1_1_sequence.fastq

              Code:
              @BILLIEHOLIDAY:1:1:3:1204#0/1
              GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
              +
              BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
              @BILLIEHOLIDAY:1:1:3:277#0/1
              TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
              +
              BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
              2: $ more s_1_2_sequence.fastq
              Code:
              @BILLIEHOLIDAY:1:1:3:1204#0/2
              TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
              +
              BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
              @BILLIEHOLIDAY:1:1:3:277#0/2
              GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
              +
              A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>
              At first glance, I see nothing amiss with the FASTQ representation. Interestingly the read quality of the forward reads trails off much more quickly than the reverse reads.

              Comment


              • #8
                Thanks, maubp!

                We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

                Thank~

                Comment


                • #9
                  Originally posted by cliff View Post
                  Thanks, maubp!

                  We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

                  Thank~
                  This probably won't make any difference to the file size oddity. The difference between sol2sanger and ill2sanger is how they map the quality scores.

                  If your data is from Illumina 1.3 or later, use ill2sanger.

                  If your data is from Solexa 1.0 up to Illumina 1.2, use sol2sanger.

                  Comment


                  • #10
                    maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

                    Do you know how to install and use this maq-ill2sanger.patch?

                    I am sorry I am not a cs background..
                    Last edited by cliff; 12-11-2009, 11:55 AM.

                    Comment


                    • #11
                      Originally posted by cliff View Post
                      maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

                      Do you know how to install and use this maq-ill2sanger.patch?

                      I am sorry I am not a cs background..
                      There was a discussion on this here:
                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                      Basically (and this isn't going to be detailed enough), grab the MAQ source code, use the patch command to make this change, compile MAQ, install MAQ. If you didn't install MAQ in the first place, this might be tricky.

                      --

                      Alternatively, there are non-MAQ options for converting the FASTQ files.

                      If you like Perl, there are plenty of scripts to do this in Perl (some using BioPerl) - search the forum.

                      You could also use the seqret tool from EMBOSS 6.1.0 patch 1 or later.

                      Other options include installing Biopython 1.52 or later, and using a tiny Python script like http://www.biopython.org/wiki/Reading_from_unix_pipes or like this:
                      Code:
                      from Bio import SeqIO
                      count = SeqIO.convert("s_1_1_sequence.txt", "fastq-illumina", "s_1_1_sequence.fastq", "fastq-sanger")
                      print "Converted %i forward reads" % count
                      count = SeqIO.convert("s_1_2_sequence.txt", "fastq-illumina", "s_1_2_sequence.fastq", "fastq-sanger")
                      print "Converted %i reverse reads" % count
                      Last edited by maubp; 12-09-2009, 07:13 AM. Reason: Clarity; adding link

                      Comment


                      • #12
                        I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
                        "#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
                        #1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
                        #2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
                        #3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
                        at fastq2bfq.c:137
                        #4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
                        #5 0x00000000004019b9 in _start ()"

                        Any suggestions in solving the problem(s) would be greatly appreciated.

                        Thanks,
                        Harold

                        Comment


                        • #13
                          Originally posted by HESmith View Post
                          I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
                          "#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
                          #1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
                          #2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
                          #3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
                          at fastq2bfq.c:137
                          #4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
                          #5 0x00000000004019b9 in _start ()"

                          Any suggestions in solving the problem(s) would be greatly appreciated.

                          Thanks,
                          Harold
                          Interesting... can you tell me your system configuration? (Hardware/software). Also, can you test if the sol2sanger works? ill2sanger is nothing but a different version of sol2sanger so, a segfault should be raised in that case too

                          Comment


                          • #14
                            As dawe suggested, retry sol2sanger on your newly compiled MAQ to see if that crashes.

                            It would also be worth re-downloading the FASTQ files (from your service provider, collaborator - where ever you got them from) just in case there was a corruption on transfer. That could could explain the file size oddity. Its a long shot though.

                            Comment


                            • #15
                              Hi, maubp

                              I have tried ill2sanger, but still got the same problem.

                              The orginal txt files from Read 1 and Read 2 of the same lane are in the same size as below:

                              4116883072 read1.txt
                              4116883072 read2.txt

                              But, after ill2sanger, the two reads have different sizes:

                              3644668984 read1.fastq
                              3644660878 read2.fastq

                              This problem is exactly the same as what I saw after sol2sanger. And all the other lanes are fine except this one.

                              Do you have thoughts on this?

                              Thanks

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin




                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                04-22-2024, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-25-2024, 11:49 AM
                              0 responses
                              19 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-24-2024, 08:47 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              62 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              60 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X