A strange size difference of fastq file

qinhao13

Junior Member

Join Date: Jan 2015

Posts: 1
- Share
- Tweet
#1

A strange size difference of fastq file

01-06-2015, 01:47 AM

Hi, currently I'm working on a Illumina sequencing data in fastq format. I downloaded it from public available database (TCGA) and it was zipped. After unzip and trimming the size of the file is about 16G. Interesting thing comes. After I copied this file to another partition, the size of the new copy became 7.6G. The number of lines in the files, the number of reads and their length distribution are the same in the two files. So I guess the two files have the same content, the new copy is not truncated.

Moreover, when I run Tophat2/Cufflinks with 16G copy, it takes much longer time to finish and the the result looks strange. But it is quite normal with the 7.6G copy. This might not be a bioinformatics question but it's quite interesting. What happened to the file? What might be those additional size in the file?

Thanks a lot.
Tags: None
dariober

Senior Member

Join Date: May 2010

Posts: 311
- Share
- Tweet
#2

01-06-2015, 02:56 AM

I can't tell... But one thing you can try to get some hints is:

Code:

cat -vet my_strange_reads.fq | less

This is will show you non-printable characters in the file. In a typical fastq file you shouldn't see anything new in addition to the usual alphanumeric characters and some metacharacters in the read names.

In practice, I would download again the file just to make sure something got corrupted in the process.
Comment

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
A Close Examination at Probiotic-Related Bacteremia by seqadmin Started by seqadmin, Today, 08:06 AM	0 responses 11 views 0 likes	Last Post by seqadmin Today, 08:06 AM
Expanded Genetic Insights into Blood Pressure Regulation by seqadmin Started by seqadmin, 04-30-2024, 12:17 PM	0 responses 13 views 0 likes	Last Post by seqadmin 04-30-2024, 12:17 PM
The Role of Enhancers in Defining Cell Fate by seqadmin Started by seqadmin, 04-29-2024, 10:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-29-2024, 10:49 AM
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 26 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM

Seqanswers Leaderboard Ad

Announcement

A strange size difference of fastq file

Comment

Latest Articles

ad_right_rmr

News