Hello
My name is Gabriel. I have asked this previously in the Illumina subforum but it seems that my post belongs here.
I'm writing because I'm analyzing Illumina reads (generated in a Hiseq 2000) from a genome of a particular insect species. The sequencing facility gave me the FASTQ files without adapters, but when checking the filtered FastQ files with the latest FastQC version (V 0.11.2) I am seeing a weird kmer pattern in the 5' region, it seems that a particular sequence is over represented, but the overrepresented sequence module does not show anything weird.
Also, it seems that the Kmer content overrepresented has a strong bias towards GC (i.e GGCCCGG, GCCCGGG and so on). I've also managed to overlap the Kmers to this sequence CTAGTATGGCCCGGGGGATCC but so far I've not been able to find anything related to this particular sequence. I'm concerned wheter it is OK to just trim this sequence, as I don't know how which meaning has this particular pattern. This sequence is present in both paired end files, and FastQC shows the kmer content peak in the 5' end of both files.
When searching this pattern with grep in my files I have noticed that there are several reads that seem to be duplicated, as the read sequence remains the same. I don't know if these duplicated reads should be removed or left.
So far and during my web search, I've only seen similar Kmer patterns when analyzing RNA-seq data, but this is not the case. Also, the "bad sequence" example from FastQC webpage shows a similar pattern, but in the 3' end, not in the 5' region, as this is my scenario.
It is worth noting that I have Paired end (2x100) files, and both files (1 and 2) have the same pattern.
I have attached the Kmer module graphs in these links:
I can add more information if needed.
Thank you very much, (and sorry for my english :P)
My name is Gabriel. I have asked this previously in the Illumina subforum but it seems that my post belongs here.
I'm writing because I'm analyzing Illumina reads (generated in a Hiseq 2000) from a genome of a particular insect species. The sequencing facility gave me the FASTQ files without adapters, but when checking the filtered FastQ files with the latest FastQC version (V 0.11.2) I am seeing a weird kmer pattern in the 5' region, it seems that a particular sequence is over represented, but the overrepresented sequence module does not show anything weird.
Also, it seems that the Kmer content overrepresented has a strong bias towards GC (i.e GGCCCGG, GCCCGGG and so on). I've also managed to overlap the Kmers to this sequence CTAGTATGGCCCGGGGGATCC but so far I've not been able to find anything related to this particular sequence. I'm concerned wheter it is OK to just trim this sequence, as I don't know how which meaning has this particular pattern. This sequence is present in both paired end files, and FastQC shows the kmer content peak in the 5' end of both files.
When searching this pattern with grep in my files I have noticed that there are several reads that seem to be duplicated, as the read sequence remains the same. I don't know if these duplicated reads should be removed or left.
So far and during my web search, I've only seen similar Kmer patterns when analyzing RNA-seq data, but this is not the case. Also, the "bad sequence" example from FastQC webpage shows a similar pattern, but in the 3' end, not in the 5' region, as this is my scenario.
It is worth noting that I have Paired end (2x100) files, and both files (1 and 2) have the same pattern.
I have attached the Kmer module graphs in these links:
I can add more information if needed.
Thank you very much, (and sorry for my english :P)
Comment