Announcement

Collapse
No announcement yet.

All sequence bases have the same quality score.

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • All sequence bases have the same quality score.

    Hi all,
    I am doing some analysis on the dataset here:

    https://trace.ddbj.nig.ac.jp/DRASear...acc=ERX1434776

    Some basic info for the data without looking into above link:
    ----
    Illumina Genome Analyzer IIx paired end sequencing
    shotgun sequencing
    WGS
    Pseudomonas fluorescens
    Paired-end
    ----

    When I search for 'Genome Analyzer IIx', could find the quality encoding information. I have seen that the quality scores for all bases are '?', e.g.

    @ERR1363506.14 226/1
    GTCCACTACAGGTCGAAGCCGAAGGCGACGAGTTGCGTGTTTACGCGCCCAATCGTTTTGTTCTCGACTGGGTCAACGAGAAGTACCTGAGCCGCGTGCT
    +
    ????????????????????????????????????????????????????????????????????????????????????????????????????

    My question is:
    Is it normal to have a identical quality score for all bases?
    When I analysis the data, some bio tools report errors that it cannot detect the quality offset or quality encoding, is above the cause of the errors?

    Thanks.

  • #2
    This is an odd dataset.

    First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

    You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

    Comment


    • #3
      Originally posted by GenoMax View Post
      This is an odd dataset.

      First of all there are three files for a PE dataset (I thought one was a file for the barcode/tags, but that does not appear to be the case). The fastq headers are non-standard and then there is that issue of every Q-score set to ? for the entire dataset in all three files.

      You should try to find out more information (directly from the submitter, if you can) before spending time analyzing this data.

      Thanks your answer.

      This data can be found from DRASearch, NCBI SRA, and EBI.
      All these sources of these data has strange quality values.
      However I wasn't able to find the contact info of the submitter, but I email EBI help, and got reply as follow:

      CRAM files are compressed NGS read files. The sequences can are retrieved byusing the reference but quality scores are quantised into a smaller range in
      order to use less space. It looks like the compression on this cram file is such
      that all quality scores average into the same value. These are probably low
      value quality scores, or the quality scores were not available in the first
      place.
      I would just leave the data, or set the --offset =33 for the tool, just to pass the analysis.

      Comment


      • #4
        Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?

        Edit: I think the third file is likely of single reads that had the mate discarded during trimming. You can check on that possibility to see if the headers there are not present in _1 or _2 file.
        Last edited by GenoMax; 06-24-2016, 07:37 AM.

        Comment


        • #5
          Originally posted by GenoMax View Post
          Ok. So we have an explanation for the Q-scores but what about the presence of 3 files, all of which have the same length sequence data?
          Usually, splitting the .sra files of pair-end reads using fastq-dump from SRA-toolkit,

          a parameter --split-3 is used to do this:


          Legacy 3-file splitting for mate-pairs: First 2 biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only 1 biological read is dumpable - it is placed in *.fastq.

          so the smaller file is usually called unmapped sequence, which contains the sequence which the mate pair sequence cannot be found.

          https://www.biostars.org/p/11111/
          http://www.ncbi.nlm.nih.gov/Traces/s...ew=toolkit_doc

          Comment


          • #6
            See the edit I just made to the post above.

            Comment


            • #7
              Originally posted by GenoMax View Post
              See the edit I just made to the post above.
              Saw it.
              I think there is no trimming involved at/before that stage. The third file is a collection of unloved ones.

              Comment

              Working...
              X