Seqanswers Leaderboard Ad

**Brian Bushnell** · 12-01-2014, 06:20 PM

Sounds exciting!

Lower-case letters mean "masked", usually implying they are repetitive, or low-confidence; the exact meaning is application-specific. Typically, programs will either ignore lower-case letters (convert them to N) or make them upper-case and use them like all of the other upper-case letters. This information must be preserved; however, it's not relevant to most applications. An official genome of an organism is all upper-case; the ones with lower-case letters are processed in a specific way for some specific application.

Also, raw reads, which are much more interesting from a compression standpoint (since they amount to hundreds of thousands of times as much data as genomes) do not have lower-case letters. Ultimately, if you made a compression program that was case-insensitive, it would still be useful for that reason, though obviously less likely to catch on. I suggest you design it to handle all-upper-case ACGTN efficiently, and be capable of handling other things without regard to efficiency. Or have an option to convert lower-case to N, for example. There are also other degenerate bases to watch for.

However!

There are 2 other components - names and quality scores. Genomes normally don't have quality scores, but reads do (see the fastq format). And names can essentially contain anything other than newlines. So, compression of fasta files (only names and sequence) is dominated by the sequence, while compression of fastq files is usually dominated by qualities and names.

As for the number of letters before a newline - that's legacy stuff, probably for Fortran and fixed-width consoles. In fasta format, lines may be any length and newlines are irrelevant; they are typically wrapped at 70 characters. If you input a genome with 70-character wrapping and output it with 100-character wrapping, that is still the same genome, and no correctly-written program will differentiate between them. Fastq is much more convenient because newlines actually have a meaning.

Oh, and "N" means unknown. If you only care about compressing actual genomes as tightly as possible, you can just handle capital ACGTN, but you still must handle the names (in fasta, that's everything from the ">" to the next newline).

**AKatawazi** · 12-01-2014, 07:41 PM

Thank you for the very detailed response Brian, my background is in computer science so I have been very much flying in the dark here. I think I could make the program just detect the usage of lower and upper and preserve the formatting without much issue, but the lower case and upper case is also going to significantly degrade compression, I'm going to have to come up with a good solution. At any rate my current progress is 92% compression, which is not the best and I am able to compress a 250MB chromosome in 9 seconds which is also not the best. Thanks again for the much needed help.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 56 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 52 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 45 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 55 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Question on DNA Formatting

Comment

Comment

Latest Articles

ad_right_rmr

News