Header Leaderboard Ad

Collapse

space character in headers

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • space character in headers

    Hi,

    I was wondering if anyone has problems with space characters in the header of a fastq file

    We have two technical replicates and out sequencing people told us it will save us time, if we merge the two files and analyse them together.
    The problem is that the header are the same with the exception of the number of the technical replicate:
    Code:
    technical replicate 1
    @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 1:Y:0:TTAGGC
    technical replicate 2
    @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 2:Y:0:TTAGGC
    Q1: Are space character allowed in the header of the fastq file?

    When we analysed the merged file we got the same results as the first original file of the two.
    This led me to the thought, that such spaces and everything that followed are being ignored. We are working with maqgene to try and find SNPs in C. elegans.

    Q2: Does it make sense at all to merge the two technical replicates to save analysis time? Do I loose information by merging them?

    Thanks for the help

    Assa

  • #2
    I'm not sure about 1, but for 2, it will probably be safer to keep them separate and it would take at most twice as long to analyze. Realistically, if you end up fiddling around a lot with the analysis for the first one then it will take much less time to analyze the second one when you get all the kinks worked out. If it's only a one time thing I wouldn't worry about screwing things up while merging.

    Comment


    • #3
      Originally posted by frymor View Post
      Q1: Are space character allowed in the header of the fastq file?
      Yes, and they mark the end of the identifier and the start of the free format text.

      You are suffering from the Illumina 1.8 switch from /1 and /2 suffixes to putting the segment number separately from the identifier. i.e. They switched from record read segment IDs, to the shared ID (similar to how SAM/BAM record parts one and two with the same ID but different FLAG values). See also:
      http://seqanswers.com/forums/showthread.php?t=8895
      http://blastedbio.blogspot.com/2011/...ve-sambam.html
      Last edited by maubp; 11-02-2011, 11:50 AM. Reason: typo

      Comment


      • #4
        Originally posted by frymor View Post
        Hi,

        I was wondering if anyone has problems with space characters in the header of a fastq file

        We have two technical replicates and out sequencing people told us it will save us time, if we merge the two files and analyse them together.
        The problem is that the header are the same with the exception of the number of the technical replicate:
        Code:
        technical replicate 1
        @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 1:Y:0:TTAGGC
        technical replicate 2
        @HWI-ST0764:77:C03HCACXX:4:1101:1242:2075 2:Y:0:TTAGGC
        Q1: Are space character allowed in the header of the fastq file?

        When we analysed the merged file we got the same results as the first original file of the two.
        This led me to the thought, that such spaces and everything that followed are being ignored. We are working with maqgene to try and find SNPs in C. elegans.

        Q2: Does it make sense at all to merge the two technical replicates to save analysis time? Do I loose information by merging them?

        Thanks for the help

        Assa
        Those reads you have shown are not two different technical replicates, they are read 1 and read 2 from a paired end format run. The format of the headers shows them to be generated by Illumina's CASAVA v1.8+. The CASAVA pipeline normally produces two separate files, one for each read. They should be kept this way until you know what you are going to do with them. Some software requires that the two reads be supplied as separate files, other programs want the reads shuffled together in one file. I don't know what maqgene expects for paired reads.

        Comment


        • #5
          Originally posted by kmcarr View Post
          Those reads you have shown are not two different technical replicates, they are read 1 and read 2 from a paired end format run.
          Are you sure these are two files of a paired-end reads data set?

          As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.

          Where can I find out what is the meaning of the header?
          What will it looks like, if it was two sets of technical replicates instead of PE reads?

          Thanks a lot for the info about the PE data. It needs a totally different analysis. Up until now we analyzed them separately, because we didn't know they were connected.

          We are working with the MAQGene software and I am not even sure if there's an option for explicit work with PE reads.
          Does it make sense at all to merge the two files, if they are really paired-end reads?
          Last edited by frymor; 11-03-2011, 12:03 AM.

          Comment


          • #6
            Originally posted by frymor View Post
            Are you sure these are two files of a paired-end reads data set?
            Pretty sure based on the naming.
            Originally posted by frymor View Post
            As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.
            Odd. Double check that with them.
            Originally posted by frymor View Post
            Where can I find out what is the meaning of the header?
            Read the thread I linked to earlier, and/or the official Illumina 1.8 documentation:
            http://seqanswers.com/forums/showthread.php?t=8895

            Comment


            • #7
              Originally posted by frymor View Post
              Are you sure these are two files of a paired-end reads data set?
              Yes, absolutely.

              As far as we understood our technicians here at the sequencing centre, they didn't do any PE-analysis, but produce for us technical replicates.
              I think you had better speak with your sequencing facility to clarify exactly what they are telling you.

              Where can I find out what is the meaning of the header?
              I have attached a PDF I normally give to our clients which describes describes the file name and FASTQ header line format for CASAVA 1.8 (This file was heavily cribbed from Illumina's documentation.)

              What can be seen from your examples are:
              The read IDs are identical so they are two reads from the same cluster. They are reads 1 and 2 of a paired read. The barcode tag sequence for that cluster is TTAGGC. The read did NOT pass the Illumina quality filter ("Y" means it failed filtering).

              What will it looks like, if it was two sets of technical replicates instead of PE reads?
              I can't say since I'm not completely clear on what you mean by "two sets of technical replicates". If you mean they just ran the same library twice then the read ID information would be entirely different and there is not matching of reads from one replicate to the other.

              Thanks a lot for the info about the PE data. It needs a totally different analysis. Up until now we analyzed them separately, because we didn't know they were connected.

              We are working with the MAQGene software and I am not even sure if there's an option for explicit work with PE reads.
              Does it make sense at all to merge the two files, if they are really paired-end reads?
              I don't know if/how MAQGene manages paired reads. You'll have to consult the documentation for that.
              Attached Files

              Comment

              Latest Articles

              Collapse

              • seqadmin
                A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                by seqadmin


                ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                01-24-2023, 01:19 PM
              • seqadmin
                Introduction to Single-Cell Sequencing
                by seqadmin
                Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                ...
                01-09-2023, 03:10 PM

              ad_right_rmr

              Collapse
              Working...
              X