Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • cliff
    Member
    • Oct 2009
    • 41

    Maq - sol2sanger problem - different sizes for the pair?

    Hi, All

    I just use "maq sol2sanger" to convert Illumina's _sequence.txt to .fastq format. I used paired-end design. I have the following two txt files

    s_1_1_sequence.txt ; size 4116883072
    s_1_2_sequence.txt ; size 4116883072

    After sol2sanger conversion, the fastq files don't have the same size:

    s_1_1_sequence.fastq; size 3644668984
    s_1_2_sequence.fastq; size 3644660878

    It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.

    Can anyone help me answer this question?

    Thanks very much!

    -Cliff
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    That does look like something has gone wrong.

    Also, assuming you are using FASTQ files from Illumina pipeline 1.3+, then don't use sol2sanger, use ill2sanger (requires a patch to MAQ - search the forum).

    Or BioPerl, or EMBOSS, or an ad-hoc perl script or, ... lots of examples on the forum. My biased suggestion would be to use Biopython, http://news.open-bio.org/news/2009/0...vert-function/

    See also: http://en.wikipedia.org/wiki/FASTQ_format
    Last edited by maubp; 12-07-2009, 11:55 AM. Reason: Typo

    Comment

    • dawe
      Senior Member
      • Apr 2009
      • 258

      #3
      Originally posted by cliff View Post
      It is weird..They should have given out the same size, right? Besides, in all the other lanes, this conversion all output the same size for the pair.
      Have you checked the files? sol2sanger predicate doesn't print sequence headers twice, so

      @seqID
      CGATCGTAGCTAGC
      +seqID
      BBBBBBBBBBBB

      becomes

      @seqID
      CGATCGTAGCTAGC
      +
      ###########

      (the scores are completely random in this example ^__^)

      hence you may missing bytes

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

        cliff - how about posting the first few records of each file?

        Comment

        • dawe
          Senior Member
          • Apr 2009
          • 258

          #5
          Originally posted by maubp View Post
          I'd wondered about that too dawe, and while it does explain why the converted files are smaller than the originals, it does not explain why they are different sizes to each other.

          cliff - how about posting the first few records of each file?
          You're right! On a second read I realize the issue here is not "the size differ before and after conversion" but "the paired reads differ in size after conversion"... Whoops!

          d

          Comment

          • cliff
            Member
            • Oct 2009
            • 41

            #6
            Thanks for all your replies. Here the fastq files:

            1: $ more s_1_1_sequence.fastq

            @BILLIEHOLIDAY:1:1:3:1204#0/1
            GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
            +
            BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
            @BILLIEHOLIDAY:1:1:3:277#0/1
            TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
            +
            BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

            2: $ more s_1_2_sequence.fastq
            @BILLIEHOLIDAY:1:1:3:1204#0/2
            TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
            +
            BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
            @BILLIEHOLIDAY:1:1:3:277#0/2
            GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
            +
            A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>

            Comment

            • maubp
              Peter (Biopython etc)
              • Jul 2009
              • 1544

              #7
              Or using the [ code ] tags, since otherwise the forum mangles them:

              1: $ more s_1_1_sequence.fastq

              Code:
              @BILLIEHOLIDAY:1:1:3:1204#0/1
              GACCACACCCTGNAGCCCTTTCTGTCCAAACAGAAAGTAAGATATTCCTTGGGCTGGTTGGTCTGAGGACCTGAGGTTGTAGGTGGACACCCTCATGGAGG
              +
              BBBCBCCCCBB.&6;=-:>9>7?.>@=B7B>+?1.2=0;-90?<B<>;><@@3/6<*4*>47584.:597<723>9%%%%%%%%%%%%%%%%%%%%%%%%%
              @BILLIEHOLIDAY:1:1:3:277#0/1
              TTGAGACAAGAGNATCACTTGAACCCAGAAATTCGAGGCCAGCCTGGGCAACAGAGAGAGCCCTCATTTCTACAAAAAATAAAAATATTAGCCAGGCATGG
              +
              BABBA?BB8?<9&40C4BA@:?@BBB:B?A@>8B=@)7B><8@B6:>>=<4=38?8?9;739%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
              2: $ more s_1_2_sequence.fastq
              Code:
              @BILLIEHOLIDAY:1:1:3:1204#0/2
              TCATTACCTACTTTATTGCTCACACATAGCCTGTTTGGTGGTCTCTTCACACGGACGCGTGTGACATTTGGTGCCAAAACCCAGGACAGGAGGAGCNCTTT
              +
              BCBBCAACB@CCCBAC@?B@B?C<CB@CBBBAB=ABB?@A45BCCBBAACCB?BB@BBB6B@B@B@@A@AA@BA@?3?8@?@B<@-@@<6@A%%%%%%%%%
              @BILLIEHOLIDAY:1:1:3:277#0/2
              GATAGGGTTTAGATGTCGTTTAGGCTGGAGTGCAGTGGTACATCACGGCTCACTGCAGCCTCGACCTCCCAGGCTTAAGCAGCCCTCCCACCTCAGCCTCC
              +
              A=@9B?>6??B7ABA=BC>9B@6BC@B>@B4BBBB;BB1B<;BABBAABB<(3@=?@A>@=@A>6A>?>>>>??8=93?5>=):>=;A92;81?26>226>
              At first glance, I see nothing amiss with the FASTQ representation. Interestingly the read quality of the forward reads trails off much more quickly than the reverse reads.

              Comment

              • cliff
                Member
                • Oct 2009
                • 41

                #8
                Thanks, maubp!

                We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

                Thank~

                Comment

                • maubp
                  Peter (Biopython etc)
                  • Jul 2009
                  • 1544

                  #9
                  Originally posted by cliff View Post
                  Thanks, maubp!

                  We use illumina pipeline 1.5. I am thinking of trying ill2sanger. Do I need use ill2sanger to convert all my _sequence.txt files to .fastq files? As I said, all the other fastq files are all have the same size between paired-reads. Can I just try ill2sanger on the paired reads which differ in .fastq size?

                  Thank~
                  This probably won't make any difference to the file size oddity. The difference between sol2sanger and ill2sanger is how they map the quality scores.

                  If your data is from Illumina 1.3 or later, use ill2sanger.

                  If your data is from Solexa 1.0 up to Illumina 1.2, use sol2sanger.

                  Comment

                  • cliff
                    Member
                    • Oct 2009
                    • 41

                    #10
                    maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

                    Do you know how to install and use this maq-ill2sanger.patch?

                    I am sorry I am not a cs background..
                    Last edited by cliff; 12-11-2009, 11:55 AM.

                    Comment

                    • maubp
                      Peter (Biopython etc)
                      • Jul 2009
                      • 1544

                      #11
                      Originally posted by cliff View Post
                      maubp, thanks. I just downloaded ill2sanger from here http://sourceforge.net/tracker/?func...15&atid=938895

                      Do you know how to install and use this maq-ill2sanger.patch?

                      I am sorry I am not a cs background..
                      There was a discussion on this here:
                      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc


                      Basically (and this isn't going to be detailed enough), grab the MAQ source code, use the patch command to make this change, compile MAQ, install MAQ. If you didn't install MAQ in the first place, this might be tricky.

                      --

                      Alternatively, there are non-MAQ options for converting the FASTQ files.

                      If you like Perl, there are plenty of scripts to do this in Perl (some using BioPerl) - search the forum.

                      You could also use the seqret tool from EMBOSS 6.1.0 patch 1 or later.

                      Other options include installing Biopython 1.52 or later, and using a tiny Python script like http://www.biopython.org/wiki/Reading_from_unix_pipes or like this:
                      Code:
                      from Bio import SeqIO
                      count = SeqIO.convert("s_1_1_sequence.txt", "fastq-illumina", "s_1_1_sequence.fastq", "fastq-sanger")
                      print "Converted %i forward reads" % count
                      count = SeqIO.convert("s_1_2_sequence.txt", "fastq-illumina", "s_1_2_sequence.fastq", "fastq-sanger")
                      print "Converted %i reverse reads" % count
                      Last edited by maubp; 12-09-2009, 07:13 AM. Reason: Clarity; adding link

                      Comment

                      • HESmith
                        Senior Member
                        • Oct 2009
                        • 512

                        #12
                        I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
                        "#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
                        #1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
                        #2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
                        #3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
                        at fastq2bfq.c:137
                        #4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
                        #5 0x00000000004019b9 in _start ()"

                        Any suggestions in solving the problem(s) would be greatly appreciated.

                        Thanks,
                        Harold

                        Comment

                        • dawe
                          Senior Member
                          • Apr 2009
                          • 258

                          #13
                          Originally posted by HESmith View Post
                          I'm having a different issue with the ill2sanger patch in updating an existing install of MAQ. Downloaded the patch from sourceforge, ran the patch command (which modified fastq2bfq.c, main.c, and main.h), compiled MAQ with "make" and installed with "make install". Tried to run the ill2sanger command, which exited with a segmentation fault. Ran the command in gdb, which returned "Program received signal SIGSEGV, Segmentation fault. 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6". Backtrace returned the following:
                          "#0 0x000000340fc44c85 in vfprintf () from /lib64/libc.so.6
                          #1 0x000000340fc4faa8 in fprintf () from /lib64/libc.so.6
                          #2 0x0000000000405369 in ill2sanger (fpin=0x63a010, fpout=0x0) at fastq2bfq.c:105
                          #3 0x0000000000405424 in ma_ill2sanger (argc=<value optimized out>, argv=<value optimized out>)
                          at fastq2bfq.c:137
                          #4 0x000000340fc1ea2d in __libc_start_main () from /lib64/libc.so.6
                          #5 0x00000000004019b9 in _start ()"

                          Any suggestions in solving the problem(s) would be greatly appreciated.

                          Thanks,
                          Harold
                          Interesting... can you tell me your system configuration? (Hardware/software). Also, can you test if the sol2sanger works? ill2sanger is nothing but a different version of sol2sanger so, a segfault should be raised in that case too

                          Comment

                          • maubp
                            Peter (Biopython etc)
                            • Jul 2009
                            • 1544

                            #14
                            As dawe suggested, retry sol2sanger on your newly compiled MAQ to see if that crashes.

                            It would also be worth re-downloading the FASTQ files (from your service provider, collaborator - where ever you got them from) just in case there was a corruption on transfer. That could could explain the file size oddity. Its a long shot though.

                            Comment

                            • cliff
                              Member
                              • Oct 2009
                              • 41

                              #15
                              Hi, maubp

                              I have tried ill2sanger, but still got the same problem.

                              The orginal txt files from Read 1 and Read 2 of the same lane are in the same size as below:

                              4116883072 read1.txt
                              4116883072 read2.txt

                              But, after ill2sanger, the two reads have different sizes:

                              3644668984 read1.fastq
                              3644660878 read2.fastq

                              This problem is exactly the same as what I saw after sol2sanger. And all the other lanes are fine except this one.

                              Do you have thoughts on this?

                              Thanks

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...