Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • carolW
    Senior Member
    • Apr 2013
    • 103

    fastq2bam

    Hi,
    which tools is better to convert fastq2bam? picard or samtools or any other that you may suggest? it seems that picard has different converters depending on from which technology fastq is generated. Will it matter to apply a converter for ex if fastq is not generated from the technologies that it was generated fastq-solexa if fastq is not generated from solexa?

    Cheers,

    Carol
  • maubp
    Peter (Biopython etc)
    • Jul 2009
    • 1544

    #2
    Generally FASTQ to BAM means aligning reads to a reference.

    And yes, it is important that the FASTQ encoding is correctly set for this. Using the old (and long longer used) Solexa/Illumina FASTQ encoding rather than the (now standard) Sanger FASTQ encoding would result in wrong read quality scores in the BAM file.

    Comment

    • GenoMax
      Senior Member
      • Feb 2008
      • 7142

      #3
      @Peter: I think @carolW is referring to FastqToSam from Picard tools which stores reads in unaligned BAM format.

      Comment

      • maubp
        Peter (Biopython etc)
        • Jul 2009
        • 1544

        #4
        Good point. But yes, if you do mean storing unaligned reads from FASTQ files as SAM/BAM files, the same applies to checking the quality score encoding.

        Comment

        • carolW
          Senior Member
          • Apr 2013
          • 103

          #5
          as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

          and what would be the best tool? picard or any other tool?

          Comment

          • GenoMax
            Senior Member
            • Feb 2008
            • 7142

            #6
            Use BamHash to compare the data: https://github.com/DecodeGenetics/BamHash

            Raw file sizes are not a good indicator.

            Comment

            • Brian Bushnell
              Super Moderator
              • Jan 2014
              • 2709

              #7
              Originally posted by carolW View Post
              as a matter of fact, I want to convert bam2fastq as fastq takes less space and yes the bams are unaligned. in parallel, i wanted to have a tool that converts the reverse to find out if the fastq files contain all the original necessary info in the bam files. would it be enough to compare the size of bam converted from fastq to the original bam to determine if fastq is the equivalent of the original bam?

              and what would be the best tool? picard or any other tool?
              I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

              Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
              And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
              Last edited by Brian Bushnell; 01-28-2016, 07:09 PM.

              Comment

              • maubp
                Peter (Biopython etc)
                • Jul 2009
                • 1544

                #8
                Brian, see https://github.com/DecodeGenetics/BamHash as mentioned earlier, the authors describe it as a tool to: "Hash BAM and FASTQ files to verify data integrity... The result can be compared to verify that the pair of FASTQ files contain the same read information as the aligned BAM file."

                Comment

                • carolW
                  Senior Member
                  • Apr 2013
                  • 103

                  #9
                  Originally posted by Brian Bushnell View Post
                  I would simply gzip-compress to a high level (such as 8, using pigz) if you want to save space. Or Pbzip for even higher compression. Sam and bam are poor formats for unaligned reads, as it is much more difficult to determine how the read pairing is organized, compared to fastq, which is the universal standard for raw sequence data. Storing data in anything other than the universal standards - which are fastq, fasta, and gzip - give you a small increase in compression for a huge increase in probability that you made a very bad choice.

                  Edit - SRA is a great example of why this is a bad idea. It causes problems for everyone who uses it.
                  And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?

                  and how are pigz, Pbzip compared with cram of EBI? Does Pbzip compress even at a higher level than CRAM?

                  The original data are BAM. bam can very well be used for alignment but I convert to fastq for alignment. moreover as fastq is in text, I just thought that it can be compressed at a significant level compared to bam

                  I tried to compress a 40G bam with Pbzip2 with -9 option and didn't gain any thing as the bz2 file had 40G at the end. this might due to the fact that the bam file is the collection of smaller bam files in one bam file.
                  Last edited by carolW; 01-29-2016, 07:16 AM.

                  Comment

                  • Brian Bushnell
                    Super Moderator
                    • Jan 2014
                    • 2709

                    #10
                    Compressing a compressed file will usually not give any benefit; you have to compress the raw data. In fact, compressing a compressed file will often result in slightly larger output.

                    For unaligned reads, bam compression is not much better than gzipped fastq. I don't have any numbers but I would expect gzipped fastq to be a few percent bigger than bam, and bzip2 to be a few percent smaller (on the order of 5-10%, I'd imagine), and cram to be even smaller. For mapped sorted reads, though, bam and cram become substantially more efficient.

                    Incidentally, I wrote a program called "Clumpify" that can rearrange sequence data (fastq, fasta, sam, whatever) files to compress smaller by putting overlapping reads near each other. It's in the BBMap package. If you want to maximally compress the data, and it is not aligned, you can run that prior to putting the files in whatever format you decide on.

                    Comment

                    • dpryan
                      Devon Ryan
                      • Jul 2011
                      • 3478

                      #11
                      Originally posted by Brian Bushnell View Post
                      And, I don't know of any tool that compares aligned and unaligned files to see if they have the same data. Can you access the original non-BAM data?
                      bamHash can do that. It was originally made to compare fastq and BAM files, but one could just as easily compare multiple BAM files.

                      Edit: I should have scrolled down! Peter already mentioned it!

                      Comment

                      • carolW
                        Senior Member
                        • Apr 2013
                        • 103

                        #12
                        If picard converts bam2fastq and fastq2bam, is there any way to have the original bam through these 2 conversions? If so, which parameters to use and if not, why? what would differ between 2 bams?

                        Comment

                        • dpryan
                          Devon Ryan
                          • Jul 2011
                          • 3478

                          #13
                          It would depend on whether the initial BAM file contained only unaligned reads and nothing else. Conversion to fastq is otherwise a lossy process.

                          Comment

                          • GenoMax
                            Senior Member
                            • Feb 2008
                            • 7142

                            #14
                            Can you tell us again what exactly you are trying to do?

                            Are you asking if bam_start would be identical to bam_new in this example? (bam_start --> Picard bam2fastq --> Fastq --> Picard fastq2bam --> bam_new)

                            You can use bamhash on the two files and let us know what you find.

                            Comment

                            • carolW
                              Senior Member
                              • Apr 2013
                              • 103

                              #15
                              yes, if the bam-start will be the sam as bam_new? does the file size not matter?

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                06-02-2026, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Today, 08:59 AM
                              0 responses
                              9 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              21 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              17 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              30 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...