Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question on DNA Formatting

    Hello, this is my first post here and I am wondering if I could receive some assistance. I am on the ftp://ftp.ncbi.nih.gov/genomes/ ftp site and downloaded some genomes to review. I am trying to develop a more advanced compression algo for dna but noticed some interesting things about the files. Some letters are capped while others are not. There also seems to be a carriage return every so many characters. Is there a reason for this? Also I am assuming the large sections of N represent no data? I am trying to get the files as tiny as possible but am wondering if preserving caps or carriage returns is necessary for the tools that are being used. Thanks in advance.

  • #2
    Sounds exciting!

    Lower-case letters mean "masked", usually implying they are repetitive, or low-confidence; the exact meaning is application-specific. Typically, programs will either ignore lower-case letters (convert them to N) or make them upper-case and use them like all of the other upper-case letters. This information must be preserved; however, it's not relevant to most applications. An official genome of an organism is all upper-case; the ones with lower-case letters are processed in a specific way for some specific application.

    Also, raw reads, which are much more interesting from a compression standpoint (since they amount to hundreds of thousands of times as much data as genomes) do not have lower-case letters. Ultimately, if you made a compression program that was case-insensitive, it would still be useful for that reason, though obviously less likely to catch on. I suggest you design it to handle all-upper-case ACGTN efficiently, and be capable of handling other things without regard to efficiency. Or have an option to convert lower-case to N, for example. There are also other degenerate bases to watch for.

    However!

    There are 2 other components - names and quality scores. Genomes normally don't have quality scores, but reads do (see the fastq format). And names can essentially contain anything other than newlines. So, compression of fasta files (only names and sequence) is dominated by the sequence, while compression of fastq files is usually dominated by qualities and names.

    As for the number of letters before a newline - that's legacy stuff, probably for Fortran and fixed-width consoles. In fasta format, lines may be any length and newlines are irrelevant; they are typically wrapped at 70 characters. If you input a genome with 70-character wrapping and output it with 100-character wrapping, that is still the same genome, and no correctly-written program will differentiate between them. Fastq is much more convenient because newlines actually have a meaning.

    Oh, and "N" means unknown. If you only care about compressing actual genomes as tightly as possible, you can just handle capital ACGTN, but you still must handle the names (in fasta, that's everything from the ">" to the next newline).
    Last edited by Brian Bushnell; 12-01-2014, 06:38 PM.

    Comment


    • #3
      Thank you for the very detailed response Brian, my background is in computer science so I have been very much flying in the dark here. I think I could make the program just detect the usage of lower and upper and preserve the formatting without much issue, but the lower case and upper case is also going to significantly degrade compression, I'm going to have to come up with a good solution. At any rate my current progress is 92% compression, which is not the best and I am able to compress a 250MB chromosome in 9 seconds which is also not the best. Thanks again for the much needed help.

      Comment

      Latest Articles

      Collapse

      • seqadmin
        Essential Discoveries and Tools in Epitranscriptomics
        by seqadmin




        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
        Yesterday, 07:01 AM
      • seqadmin
        Current Approaches to Protein Sequencing
        by seqadmin


        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
        04-04-2024, 04:25 PM

      ad_right_rmr

      Collapse

      News

      Collapse

      Topics Statistics Last Post
      Started by seqadmin, 04-11-2024, 12:08 PM
      0 responses
      56 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 10:19 PM
      0 responses
      52 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-10-2024, 09:21 AM
      0 responses
      45 views
      0 likes
      Last Post seqadmin  
      Started by seqadmin, 04-04-2024, 09:00 AM
      0 responses
      55 views
      0 likes
      Last Post seqadmin  
      Working...
      X