Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to add a suffix to fastq file

    Hi everyone,

    I'm trying to do some alignments, but my latest illumina data came with a rather strange suffix. The read identifiers looks like, so for the pairs:
    @HWI-ST201:195:BB0036ABXX:6:1101:1407:1941 1:N:0:
    @HWI-ST201:195:BB0036ABXX:6:1101:1407:1941 2:N:0:

    This seems to be confusing a number of programs, does anyone know of good script to trim off that 1:N:0: and just add a more standard .1 or .F kind of thing?

    Thanks

  • #2
    Your question made me finally write a blogpost about this: http://contig.wordpress.com/2011/09/...-fastq-header/. The awk command mentioned can be adjusted as needed. Also, check out http://en.wikipedia.org/wiki/Fastq for a comparison of old- and new-style headers.

    Comment


    • #3
      That looks great, thank you. I was all set to attempt to write my own script, but I'm very new to actually writing scripts, so I'm glad it didn't come to that.

      Comment


      • #4
        Hi,

        Is it possible to ADD a term to the illumina FASTQ file? Will that interfere with programs that map reads to a reference? Or will those programs generally ignore everything after the "#"?

        Thanks,
        Andor

        Comment


        • #5
          You can add whatever you want to the read name, but you can't add new lines or anything to the lines with bases or qualities.

          Comment


          • #6
            Originally posted by cement_head View Post
            Is it possible to ADD a term to the illumina FASTQ file? Will that interfere with programs that map reads to a reference? Or will those programs generally ignore everything after the "#"?
            As Brian Bushell said, the fastq header line is free text. Quoting from Cock et al 2010:

            ‘@’ title line which often holds just a record identifier. This is a free format field with no length limit—allowing arbitrary annotation or comments to be included...
            However, you can't assume that downstream programs, e.g. aligners, expect more stringent constraints, e.g. absence of blank spaces. Also, some programs expect PE reads to have the same name (not sure if the fastq spec require this?)

            Comment


            • #7
              I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...

              Comment


              • #8
                Originally posted by dpryan View Post
                I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...
                Indeed, and what is more annoying is when programs discard anything after the first part of the name because then the pair information is lost. Examples: seqtk and the readfq library (I submitted a patch for the Perl version a long time ago, and this fix is on github).

                If the pair information has been lost, or you need to adjust the format for some aligner, you can use Pairfq (specifically with the subcommand addinfo). For this one simple task though, it is probably just as easy to write out a shell command. If this is a useful tool you are using, it would probably be worth asking the developers to support Illumina Fastq files.

                Comment


                • #9
                  Originally posted by SES View Post
                  Indeed, and what is more annoying is when programs discard anything after the first part of the name because then the pair information is lost. Examples: seqtk and the readfq library (I submitted a patch for the Perl version a long time ago, and this fix is on github).
                  I just want to mention that everything in the BBTools package does NOT do this, so you can subsample, normalize, trim, filter, etc. while leaving the names intact. But, some pipelines require that everything after the first whitespace be truncated, on the assumption that these are comments. For example, sam format requires read 1 and read 2 to have the exact same name, while in Illumina's output they have different names (/1 and /2, for example). So BBMap has a a couple related flags - "trimreaddescriptions", which will truncate everything after the first whitespace (for both reads and reference contigs), default false; and "keepnames", which will force read 1 and read 2 to retain their original name, even though the resulting sam file will not technically be spec-compliant (it's still useful in many situations). By default, for paired reads, read 1 and read 2 will both get the full name of read 1 so as to produce a valid sam file.

                  Originally posted by dpryan View Post
                  I've also seen a couple programs require paired-end read names to end in /1 and /2, even though that's neither a standard nor common practice (in fact, it's a stupid requirement). You'd be surprised how easy it is to break some aligners...
                  Also, BBTools makes use of that information for autodetecting whether a single file is paired and interleaved, but it can be overridden. And it's certainly not required
                  Last edited by Brian Bushnell; 08-27-2014, 09:23 AM.

                  Comment


                  • #10
                    Originally posted by Brian Bushnell View Post
                    I just want to mention that everything in the BBTools package does NOT do this, so you can subsample, normalize, trim, filter, etc. while leaving the names intact. But, some pipelines require that everything after the first whitespace be truncated, on the assumption that these are comments. For example, sam format requires read 1 and read 2 to have the exact same name, while in Illumina's output they have different names (/1 and /2, for example). So BBMap has a a couple related flags - "trimreaddescriptions", which will truncate everything after the first whitespace (for both reads and reference contigs), default false; and "keepnames", which will force read 1 and read 2 to retain their original name, even though the resulting sam file will not technically be spec-compliant (it's still useful in many situations). By default, for paired reads, read 1 and read 2 will both get the full name of read 1 so as to produce a valid sam file.
                    That is helpful information. I guess the developers of some tools assume you are only going to be mapping to a reference and working with SAM files (thus, trimming the read names to be valid). Of course, this is not the case for many of us but I can see how that assumption is valid for some (possibly most) use cases.

                    Comment

                    Latest Articles

                    Collapse

                    • seqadmin
                      New Genomics Tools and Methods Shared at AGBT 2025
                      by seqadmin


                      This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                      The Headliner
                      The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                      03-03-2025, 01:39 PM
                    • seqadmin
                      Investigating the Gut Microbiome Through Diet and Spatial Biology
                      by seqadmin




                      The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                      02-24-2025, 06:31 AM

                    ad_right_rmr

                    Collapse

                    News

                    Collapse

                    Topics Statistics Last Post
                    Started by seqadmin, Today, 12:50 PM
                    0 responses
                    10 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 03-03-2025, 01:15 PM
                    0 responses
                    181 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-28-2025, 12:58 PM
                    0 responses
                    276 views
                    0 likes
                    Last Post seqadmin  
                    Started by seqadmin, 02-24-2025, 02:48 PM
                    0 responses
                    663 views
                    0 likes
                    Last Post seqadmin  
                    Working...
                    X