Originally posted by Brian Bushnell
View Post
Seqanswers Leaderboard Ad
Collapse
X
-
Hi Gabriel,
Thank you so much. It worked for me.
Originally posted by gab0 View PostHi gauravdube:
I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.
I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq
Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!
Best regards,
Gabriel
Leave a comment:
-
-
Originally posted by nike00 View PostDear Gabriel,
very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.
Thanks a lot,
nike00
Leave a comment:
-
-
Originally posted by gab0 View PostHi gauravdube:
I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.
I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq
Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!
Best regards,
Gabriel
very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.
Thanks a lot,
nike00
Leave a comment:
-
-
Originally posted by gauravdube View PostHi gab0,
I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.
I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.
I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq
Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!
Best regards,
Gabriel
Leave a comment:
-
-
Hi gab0,
I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.
Leave a comment:
-
-
Originally posted by nucacidhunter View PostApart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.
thanks for your help! So apart from the Kmer problem, the files look ok for downstream analysis.
Well, I've found and fixed (partially) the kmer problem, so in here I'll write out how I solved this out:
When checking the files with FastQC V0.11.2, I saw this strange kmer pattern. When checking the Kmers, I figured out that they were displaced by 1bp, so I started to assembly (just by eye) the Kmer sequence.Then, looking the Kmer pattern with grep, I found that there were some repeated sequences/reads, like this one:
"ACTAGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTA"
Looking further I found a variant of this read, like this one
"AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTAGAT"
As you can see, the variant is displaced 3bp in the 5' and 3' ends.
When searching the web again, I found a document from Illumina, the Illumina customer sequence letter. There I found some sequences that matched my reads, listed as: "Process Controls for TruSeq® Sample Preparation Kits Included in TruSeq DNA and RNA (v1/v2/LT/HT) and TruSeq Exome Kits"
So it seems that these reads came in as part of the library control, and they were not filtered by the sequencing facility.
I tested out a couple of tools for removing filtered reads. I used fastx_collapser but turns out that it produces FASTA files as output, not FASTQ files. Then I tested Fastq-mcf, which filtered the repeated reads, both correct repeated reads, and the control library reads.
After filtering out the repeated reads, now I had some FASTQ files without kmer warnings. Yoo-hoo!
Now I have to search for another tool to remove only the control reads, and maintaing the valid duplicates reads. I was thinking on using prinseq to remove these reads.
Thanks for your help!
Leave a comment:
-
-
Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.
Leave a comment:
-
-
Hi nucacidhunter:
Thanks for replying. I'll answer by quoting what you posted.
Originally posted by nucacidhunter View PostWhat kit was used for library prep
I asked them to sequence my library in a HiSeq 2000 Illumina machine, in paired end runs (2x100bp). As I found out when receiving my reads by the index and the adapter sequence that was sent to me later, they did multiplexing.
Originally posted by nucacidhunter View Postand could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.
TruSeq Universal Adapter
5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
TruSeq Adapter, Index 5
5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG
Attached to the post are the plots for both read files. I have uploaded the plots for forward and reverse files (2nd plot of each category would be the reverse plot).
Finally the kmer content
These files should let you download the full FastQC report (Ver 0.11.2) in case you want to see it
Thank you very much,
GabrielLast edited by gab0; 08-07-2014, 07:20 AM.
Leave a comment:
-
-
What kit was used for library prep and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.
Leave a comment:
-
-
weird kmer content in 5' end from genomic DNA PE reads
Hello
My name is Gabriel. I have asked this previously in the Illumina subforum but it seems that my post belongs here.
I'm writing because I'm analyzing Illumina reads (generated in a Hiseq 2000) from a genome of a particular insect species. The sequencing facility gave me the FASTQ files without adapters, but when checking the filtered FastQ files with the latest FastQC version (V 0.11.2) I am seeing a weird kmer pattern in the 5' region, it seems that a particular sequence is over represented, but the overrepresented sequence module does not show anything weird.
Also, it seems that the Kmer content overrepresented has a strong bias towards GC (i.e GGCCCGG, GCCCGGG and so on). I've also managed to overlap the Kmers to this sequence CTAGTATGGCCCGGGGGATCC but so far I've not been able to find anything related to this particular sequence. I'm concerned wheter it is OK to just trim this sequence, as I don't know how which meaning has this particular pattern. This sequence is present in both paired end files, and FastQC shows the kmer content peak in the 5' end of both files.
When searching this pattern with grep in my files I have noticed that there are several reads that seem to be duplicated, as the read sequence remains the same. I don't know if these duplicated reads should be removed or left.
So far and during my web search, I've only seen similar Kmer patterns when analyzing RNA-seq data, but this is not the case. Also, the "bad sequence" example from FastQC webpage shows a similar pattern, but in the 3' end, not in the 5' region, as this is my scenario.
It is worth noting that I have Paired end (2x100) files, and both files (1 and 2) have the same pattern.
I have attached the Kmer module graphs in these links:
I can add more information if needed.
Thank you very much, (and sorry for my english :P)Tags: None
-
Latest Articles
Collapse
-
by seqadmin
This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.
The Headliner
The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...-
Channel: Articles
03-03-2025, 01:39 PM -
-
by seqadmin
The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...-
Channel: Articles
02-24-2025, 06:31 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 03-20-2025, 05:03 AM
|
0 responses
17 views
0 reactions
|
Last Post
by seqadmin
03-20-2025, 05:03 AM
|
||
Started by seqadmin, 03-19-2025, 07:27 AM
|
0 responses
18 views
0 reactions
|
Last Post
by seqadmin
03-19-2025, 07:27 AM
|
||
Started by seqadmin, 03-18-2025, 12:50 PM
|
0 responses
19 views
0 reactions
|
Last Post
by seqadmin
03-18-2025, 12:50 PM
|
||
Started by seqadmin, 03-03-2025, 01:15 PM
|
0 responses
185 views
0 reactions
|
Last Post
by seqadmin
03-03-2025, 01:15 PM
|
Leave a comment: