Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • space character in headers

    Hi,

    I was wondering if anyone has problems with space characters in the header of a fastq file

    We have two technical replicates and out sequencing people told us it will save us time, if we merge the two files and analyse them together.
    The problem is that the header are the same with the exception of the number of the technical replicate:
    Code:
    technical replicate 1
    @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 1:Y:0:TTAGGC
    technical replicate 2
    @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 2:Y:0:TTAGGC
    Q1: Are space character allowed in the header of the fastq file?

    When we analysed the merged file we got the same results as the first original file of the two.
    This led me to the thought, that such spaces and everything that followed are being ignored. We are working with maqgene to try and find SNPs in C. elegans.

    Q2: Does it make sense at all to merge the two technical replicates to save analysis time? Do I loose information by merging them?

    Thanks for the help

    Assa

  • #2
    I'm not sure about 1, but for 2, it will probably be safer to keep them separate and it would take at most twice as long to analyze. Realistically, if you end up fiddling around a lot with the analysis for the first one then it will take much less time to analyze the second one when you get all the kinks worked out. If it's only a one time thing I wouldn't worry about screwing things up while merging.

    Comment


    • #3
      Originally posted by frymor View Post
      Q1: Are space character allowed in the header of the fastq file?
      Yes, and they mark the end of the identifier and the start of the free format text.

      You are suffering from the Illumina 1.8 switch from /1 and /2 suffixes to putting the segment number separately from the identifier. i.e. They switched from record read segment IDs, to the shared ID (similar to how SAM/BAM record parts one and two with the same ID but different FLAG values). See also:
      Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

      I think it is time to retire the FASTQ file format in favour of storing unaligned reads in SAM/BAM format . I will try to explain, as thi...
      Last edited by maubp; 11-02-2011, 11:50 AM. Reason: typo

      Comment


      • #4
        Originally posted by frymor View Post
        Hi,

        I was wondering if anyone has problems with space characters in the header of a fastq file

        We have two technical replicates and out sequencing people told us it will save us time, if we merge the two files and analyse them together.
        The problem is that the header are the same with the exception of the number of the technical replicate:
        Code:
        technical replicate 1
        @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 1:Y:0:TTAGGC
        technical replicate 2
        @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 2:Y:0:TTAGGC
        Q1: Are space character allowed in the header of the fastq file?

        When we analysed the merged file we got the same results as the first original file of the two.
        This led me to the thought, that such spaces and everything that followed are being ignored. We are working with maqgene to try and find SNPs in C. elegans.

        Q2: Does it make sense at all to merge the two technical replicates to save analysis time? Do I loose information by merging them?

        Thanks for the help

        Assa
        Those reads you have shown are not two different technical replicates, they are read 1 and read 2 from a paired end format run. The format of the headers shows them to be generated by Illumina's CASAVA v1.8+. The CASAVA pipeline normally produces two separate files, one for each read. They should be kept this way until you know what you are going to do with them. Some software requires that the two reads be supplied as separate files, other programs want the reads shuffled together in one file. I don't know what maqgene expects for paired reads.

        Comment


        • #5
          Originally posted by kmcarr View Post
          Those reads you have shown are not two different technical replicates, they are read 1 and read 2 from a paired end format run.
          Are you sure these are two files of a paired-end reads data set?

          As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.

          Where can I find out what is the meaning of the header?
          What will it looks like, if it was two sets of technical replicates instead of PE reads?

          Thanks a lot for the info about the PE data. It needs a totally different analysis. Up until now we analyzed them separately, because we didn't know they were connected.

          We are working with the MAQGene software and I am not even sure if there's an option for explicit work with PE reads.
          Does it make sense at all to merge the two files, if they are really paired-end reads?
          Last edited by frymor; 11-03-2011, 12:03 AM.

          Comment


          • #6
            Originally posted by frymor View Post
            Are you sure these are two files of a paired-end reads data set?
            Pretty sure based on the naming.
            Originally posted by frymor View Post
            As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.
            Odd. Double check that with them.
            Originally posted by frymor View Post
            Where can I find out what is the meaning of the header?
            Read the thread I linked to earlier, and/or the official Illumina 1.8 documentation:
            Discussion of next-gen sequencing related bioinformatics: resources, algorithms, open source efforts, etc

            Comment


            • #7
              Originally posted by frymor View Post
              Are you sure these are two files of a paired-end reads data set?
              Yes, absolutely.

              As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.
              I think you had better speak with your sequencing facility to clarify exactly what they are telling you.

              Where can I find out what is the meaning of the header?
              I have attached a PDF I normally give to our clients which describes describes the file name and FASTQ header line format for CASAVA 1.8 (This file was heavily cribbed from Illumina's documentation.)

              What can be seen from your examples are:
              The read IDs are identical so they are two reads from the same cluster. They are reads 1 and 2 of a paired read. The barcode tag sequence for that cluster is TTAGGC. The read did NOT pass the Illumina quality filter ("Y" means it failed filtering).

              What will it looks like, if it was two sets of technical replicates instead of PE reads?
              I can't say since I'm not completely clear on what you mean by "two sets of technical replicates". If you mean they just ran the same library twice then the read ID information would be entirely different and there is not matching of reads from one replicate to the other.

              Thanks a lot for the info about the PE data. It needs a totally different analysis. Up until now we analyzed them separately, because we didn't know they were connected.

              We are working with the MAQGene software and I am not even sure if there's an option for explicit work with PE reads.
              Does it make sense at all to merge the two files, if they are really paired-end reads?
              I don't know if/how MAQGene manages paired reads. You'll have to consult the documentation for that.
              Attached Files

              Comment

              Latest Articles

              Collapse

              • seqadmin
                Current Approaches to Protein Sequencing
                by seqadmin


                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                04-04-2024, 04:25 PM
              • seqadmin
                Strategies for Sequencing Challenging Samples
                by seqadmin


                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                03-22-2024, 06:39 AM

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by seqadmin, 04-11-2024, 12:08 PM
              0 responses
              18 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 10:19 PM
              0 responses
              22 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-10-2024, 09:21 AM
              0 responses
              17 views
              0 likes
              Last Post seqadmin  
              Started by seqadmin, 04-04-2024, 09:00 AM
              0 responses
              49 views
              0 likes
              Last Post seqadmin  
              Working...
              X