Hello, this is my first post here and I am wondering if I could receive some assistance. I am on the ftp://ftp.ncbi.nih.gov/genomes/ ftp site and downloaded some genomes to review. I am trying to develop a more advanced compression algo for dna but noticed some interesting things about the files. Some letters are capped while others are not. There also seems to be a carriage return every so many characters. Is there a reason for this? Also I am assuming the large sections of N represent no data? I am trying to get the files as tiny as possible but am wondering if preserving caps or carriage returns is necessary for the tools that are being used. Thanks in advance.
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Sounds exciting!
Lower-case letters mean "masked", usually implying they are repetitive, or low-confidence; the exact meaning is application-specific. Typically, programs will either ignore lower-case letters (convert them to N) or make them upper-case and use them like all of the other upper-case letters. This information must be preserved; however, it's not relevant to most applications. An official genome of an organism is all upper-case; the ones with lower-case letters are processed in a specific way for some specific application.
Also, raw reads, which are much more interesting from a compression standpoint (since they amount to hundreds of thousands of times as much data as genomes) do not have lower-case letters. Ultimately, if you made a compression program that was case-insensitive, it would still be useful for that reason, though obviously less likely to catch on. I suggest you design it to handle all-upper-case ACGTN efficiently, and be capable of handling other things without regard to efficiency. Or have an option to convert lower-case to N, for example. There are also other degenerate bases to watch for.
However!
There are 2 other components - names and quality scores. Genomes normally don't have quality scores, but reads do (see the fastq format). And names can essentially contain anything other than newlines. So, compression of fasta files (only names and sequence) is dominated by the sequence, while compression of fastq files is usually dominated by qualities and names.
As for the number of letters before a newline - that's legacy stuff, probably for Fortran and fixed-width consoles. In fasta format, lines may be any length and newlines are irrelevant; they are typically wrapped at 70 characters. If you input a genome with 70-character wrapping and output it with 100-character wrapping, that is still the same genome, and no correctly-written program will differentiate between them. Fastq is much more convenient because newlines actually have a meaning.
Oh, and "N" means unknown. If you only care about compressing actual genomes as tightly as possible, you can just handle capital ACGTN, but you still must handle the names (in fasta, that's everything from the ">" to the next newline).Last edited by Brian Bushnell; 12-01-2014, 06:38 PM.
-
Thank you for the very detailed response Brian, my background is in computer science so I have been very much flying in the dark here. I think I could make the program just detect the usage of lower and upper and preserve the formatting without much issue, but the lower case and upper case is also going to significantly degrade compression, I'm going to have to come up with a good solution. At any rate my current progress is 92% compression, which is not the best and I am able to compress a 250MB chromosome in 9 seconds which is also not the best. Thanks again for the much needed help.
Comment
Latest Articles
Collapse
-
by seqadmin
The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...-
Channel: Articles
Yesterday, 07:01 AM -
-
by seqadmin
Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...-
Channel: Articles
04-04-2024, 04:25 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 04-11-2024, 12:08 PM
|
0 responses
56 views
0 likes
|
Last Post
by seqadmin
04-11-2024, 12:08 PM
|
||
Started by seqadmin, 04-10-2024, 10:19 PM
|
0 responses
52 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 10:19 PM
|
||
Started by seqadmin, 04-10-2024, 09:21 AM
|
0 responses
45 views
0 likes
|
Last Post
by seqadmin
04-10-2024, 09:21 AM
|
||
Started by seqadmin, 04-04-2024, 09:00 AM
|
0 responses
55 views
0 likes
|
Last Post
by seqadmin
04-04-2024, 09:00 AM
|
Comment