Hi all,
I often subsample my fastq files by using the unix 'head' command, rather than randomly getting reads from random positions in my fastq file. My setup is as follows:
Casava output
I concatenate these files using the following:
In this way, the first part of this file will contain the reads from the original 'file1.fastq.gz'. I then subsample 25.000 reads from this file using the following command:
My worry is that by doing that, I will get some sort of bias in my analysis as I am only taking the 'head' of the first part of all my reads. The question is - is this a valid concern? I.e. are the reads in the first part of, say, 'file1.fastq.gz' somehow different than, say, the middle part of 'file4.fastq.gz'?
Thanks very much in advance
I often subsample my fastq files by using the unix 'head' command, rather than randomly getting reads from random positions in my fastq file. My setup is as follows:
Casava output
Code:
file1.fastq.gz file2.fastq.gz . . file#.fastq.gz
Code:
gunzip *.gz -c | gzip -9 > file.fastq.gz
Code:
head -n 100000 file.fastq.gz | <downstream analysis, e.g. blastn>
Thanks very much in advance
Comment