Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there a standard file format for FASTQ paired-end reads?

    Hi, just a general question: my experience is that there seems to be no standard way of defining paired-end reads in a FASTQ file.
    For example, Trinity expects that paired-end reads are defined with

    Code:
    @nameofthesequence /1
    or
    Code:
    @nameofthesequence /2
    in the header line of a FASTQ file, ..and that the pairs are in order in two separate files for left and right reads (as I understand it)
    Trinity does add the /1 or /2 during the assembly process, but some preprocessing steps (e.g. in silico read normalization) expect that these tags are already there.
    Some programs expect the pairs to be interleaved in a single file, some don't (leading to scripts like shufflesequences.pl etc and Galaxy tools to interleave and un-interleave things).

    I was wondering, is there some standard way of defining paired-end sequences that I'm not aware of? If not, could we, as a community, come up with one? Thoughts?

    As a side issue, I've seen some unsafe code to add the /1 and /2 tags onto the end of FASTQ files; for example, prior to in silico read normalisation as described here:



    Anything that uses the @ tag at the start of the FASTQ header line is potentially unsafe since the @ (and any other unique bits after the @) can potentially turn up in the quality scores, and even potentially at the start of the quality scores, such that the start of the quality score line might be indistinguishable from the header line. e.g.:

    Code:
    sed -i '/^@M00/ s/\ .\+/\/1/g' *_R1.fastq
    ..is potentially unsafe since it searches for the @M00 at the start of the header line (the @ is standard FASTQ, the M00 is presumably some tag from a MiSeq), and it's possible (given millions/billions of reads) that some quality score lines might start with @M00 too. My alternative approach is just to add /1 (or /2 for right reads) to every fourth line.

    Code:
    sed '1~4 s/$/ \/1/g' your_fastq_file.fastq > your_new_fastq_file.fastq
    (for left reads) , or
    Code:
    sed '1~4 s/$/ \/2/g' your_fastq_file.fastq > your_new_fastq_file.fastq
    (for right reads).

    This simply adds ' /1' ( i.e. a space, a slash and a 1) to the end of every 4th line starting with the first line. If your file is FASTQ format this should work (works for me anyway). Would'nt be too hard to modify this to add the tags to interleaved paired-end FASTQ files too. You can use the sed -i option to replace rather than redirecting to a new file if you want.

  • #2
    The way paired-end reads are denoted for illumina sequencing changed over the years. You can find the evolution here: http://en.wikipedia.org/wiki/FASTQ_format (Illumina sequence identifiers)

    Comment


    • #3
      Originally posted by GenoMax View Post
      The way paired-end reads are denoted for illumina sequencing changed over the years. You can find the evolution here: http://en.wikipedia.org/wiki/FASTQ_format (Illumina sequence identifiers)
      I agree that page is very useful regarding the evolution of the FASTQ format with regards to quality encoding and Illumina "machine" tags but actually contains no information at all on how paired-end reads are defined. EDIT: sorry, yes it does.

      HTML Code:
      /1 	the member of a pair, /1 or /2 (paired-end or mate-pair reads only)
      But this seems to be very inconsistently applied, I'm working on data right now that is paired-end but these tags are not in it. This is possibly because Illumina 1.8 / 1.9 pipeline no longer adds these tags.
      I guess I was just saying that it all seems a bit inconsistent (does it have a tag or not, are the files produced as interleaved files or not) which can be problematic for tool development.
      Last edited by danwiththeplan; 01-16-2014, 02:54 PM. Reason: fix

      Comment


      • #4
        Actually to answer my own question I see that even Illumina 1.8 (and presumably 1.9) have designations for paired-end reads, it's just that it's now buried in the middle of the header line and some tools still expect them to be at the end as per earlier pipeline versions

        Comment


        • #5
          if you don't know the format type... I usually go with FastQC to see what the format type is.

          Overall the basics are:

          Sanger/Illumina 1.8+ (1.9) are phred33
          Illumina 1.3 to 1.5 are phred66

          Also please take a look at page 27 in this pdf document: https://www.msi.umn.edu/sites/defaul...Module%201.pdf

          All the best.

          Comment


          • #6
            Originally posted by Zapages View Post
            if you don't know the format type... I usually go with FastQC to see what the format type is.

            Overall the basics are:

            Sanger/Illumina 1.8+ (1.9) are phred33
            Illumina 1.3 to 1.5 are phred66
            It seems like you've missed the nature of this question, which was about how the end pair status was encoded in the sequence header, rather than how the quality scores are encoded.

            Comment


            • #7
              Originally posted by gringer View Post
              It seems like you've missed the nature of this question, which was about how the end pair status was encoded in the sequence header, rather than how the quality scores are encoded.
              Correct, but I really should have looked at the wiki, since this info is included.

              However it was also more of an observation that how the paired-end status is encoded has changed without warning and is very inconsistently applied in practice, which and a bit of a pain for developers of analysis tools. Also to warn about the unsafe code some people seem to be using to apply the old-style paired-end tags (that some tools still look for) to new formats.

              Also, FASTQC is not that great in my opinion. Or rather, it was originally designed to do one thing (analysing Illumina data, where all the reads are the same length, and are less than 100bp) but people then go and use it for absolutely everything. For example, if your reads are longer than that, it compresses the end of the reads into the last bit of the graph. It does not deal particularly well with reads that are not all the same length (eg Ion torrent). I prefer custom R scripts in this case.
              Last edited by danwiththeplan; 01-16-2014, 06:02 PM.

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM
              • seqadmin
                Techniques and Challenges in Conservation Genomics
                by seqadmin



                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                Avian Conservation
                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                03-08-2024, 10:41 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 03-27-2024, 06:37 PM
              0 responses
              12 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-27-2024, 06:07 PM
              0 responses
              11 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-22-2024, 10:03 AM
              0 responses
              53 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 03-21-2024, 07:32 AM
              0 responses
              68 views
              0 likes
              Last Post seqadmin  
              Working...
              X