Seqanswers Leaderboard Ad

**nucacidhunter** · 08-05-2014, 01:43 AM

What kit was used for library prep and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.

**gab0** · 08-05-2014, 07:45 AM

Hi nucacidhunter:

Thanks for replying. I'll answer by quoting what you posted.

Originally posted by nucacidhunter View Post

What kit was used for library prep

I sent the samples to another, external facility and I don't know which kit they used, so I'll find out ASAP.

I asked them to sequence my library in a HiSeq 2000 Illumina machine, in paired end runs (2x100bp). As I found out when receiving my reads by the index and the adapter sequence that was sent to me later, they did multiplexing.

Originally posted by nucacidhunter View Post

and could you post FastQC plots for per sequence GC content, sequence duplication levels and Illumina adapters.

They did told me the adapters used (when asked!), which would be these:

TruSeq Universal Adapter

5' AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT

TruSeq Adapter, Index 5

5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACACAGTGATCTCGTATGCCGTCTTCTGCTTG

Attached to the post are the plots for both read files. I have uploaded the plots for forward and reverse files (2nd plot of each category would be the reverse plot).

Finally the kmer content

These files should let you download the full FastQC report (Ver 0.11.2) in case you want to see it

Dropbox - 404

https://dl.dropboxusercontent.com/u/9000360/insect-1_fastqc.zip

Dropbox - 404

https://dl.dropboxusercontent.com/u/9000360/insect-2_fastqc.zip

Thank you very much,

Gabriel

Attached Files

**nucacidhunter** · 08-05-2014, 04:34 PM

Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.

**gab0** · 08-07-2014, 07:18 AM

Originally posted by nucacidhunter View Post

Apart from Kmer content every parameter looks fine in FastQC report. The number of over-represented Kmers is low (although it is unusual to see in balanced genomes) and I do not think it should be of any concern. The over-represented Kmer could be from duplicate reads (there is a small bump in %total sequences in duplication plot over >10) and it can be checked by removing duplicates and running FastQC again or it could be result of bias in at least one step of library prep due to AT rich nature of genome. Whether duplicates should be removed or not, I think it depends on downstream application and I will let bioinformatician to comment on it.

Hi

thanks for your help! So apart from the Kmer problem, the files look ok for downstream analysis.

Well, I've found and fixed (partially) the kmer problem, so in here I'll write out how I solved this out:

When checking the files with FastQC V0.11.2, I saw this strange kmer pattern. When checking the Kmers, I figured out that they were displaced by 1bp, so I started to assembly (just by eye) the Kmer sequence.Then, looking the Kmer pattern with grep, I found that there were some repeated sequences/reads, like this one:

"ACTAGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTA"

Looking further I found a variant of this read, like this one

"AGTATGGCCCGGGGGATCCTACGTTCCAAATGCAGCGAGCTCGTATAACCCTTTAAGAGTTGCTCTTTTTGTTTGGTAAGTTGCAAATCGAAGTTTTAGAT"

As you can see, the variant is displaced 3bp in the 5' and 3' ends.

When searching the web again, I found a document from Illumina, the Illumina customer sequence letter. There I found some sequences that matched my reads, listed as: "Process Controls for TruSeq® Sample Preparation Kits Included in TruSeq DNA and RNA (v1/v2/LT/HT) and TruSeq Exome Kits"

So it seems that these reads came in as part of the library control, and they were not filtered by the sequencing facility.

I tested out a couple of tools for removing filtered reads. I used fastx_collapser but turns out that it produces FASTA files as output, not FASTQ files. Then I tested Fastq-mcf, which filtered the repeated reads, both correct repeated reads, and the control library reads.

After filtering out the repeated reads, now I had some FASTQ files without kmer warnings. Yoo-hoo!

Now I have to search for another tool to remove only the control reads, and maintaing the valid duplicates reads. I was thinking on using prinseq to remove these reads.

Thanks for your help!

**gauravdube** · 04-07-2015, 05:35 AM

Hi gab0,

I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.

**gab0** · 04-07-2015, 06:05 AM

Originally posted by gauravdube View Post

Hi gab0,

I am facing exactly the same issue of k-mer content. Hence didn't created a different thread when i encountered yours. My question to you is: what is the tool you used to retain the valid duplicate reads and remove only the control reads. Thanks in advance.

Hi gauravdube:

I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

Best regards,

Gabriel

**nike00** · 06-12-2015, 06:33 AM

Originally posted by gab0 View Post

Hi gauravdube:

I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

Best regards,

Gabriel

Dear Gabriel,

very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.

Thanks a lot,
nike00

**NextGenSeq** · 06-12-2015, 07:02 AM

It looks like Nextera bias to me.

**Brian Bushnell** · 06-12-2015, 09:41 AM

Originally posted by nike00 View Post

Dear Gabriel,

very interesting post. I would like to know if you have a list of the Illumina adapters and the control sequences as well, to use as adapters.fa file. I cannot find them anywhere.

Thanks a lot,
nike00

If you download the BBMap package, the adapters are in the resources directory - nextera.fa.gz, truseq.fa.gz, and truseq_rna.fa.gz. You can use all of them with the flag "ref=nextera.fa.gz,truseq.fa.gz,truseq_rna.fa.gz" (with the appropriate paths).

**gauravdube** · 10-03-2015, 09:16 AM

Hi Gabriel,

Thank you so much. It worked for me.

Originally posted by gab0 View Post

Hi gauravdube:

I found and used tools from the BBMap package. Brian helped me out guiding me hot to use the bbduk tool.

I used the following command line: bbduk.sh -Xmx4g -in=(file).fastq.gz -in2=(file).fastq.gz ref=adapters.fa -out=out1.fastq -out2=out2.fastq

Adapters file has all the adapters that I could find for Illumina platforms, including the control sequences from the libraries, in fasta format. That worked for me, hopefully will work for you too!

Best regards,

Gabriel

**nike00** · 10-04-2015, 02:36 AM

Originally posted by Brian Bushnell View Post

If you download the BBMap package, the adapters are in the resources directory - nextera.fa.gz, truseq.fa.gz, and truseq_rna.fa.gz. You can use all of them with the flag "ref=nextera.fa.gz,truseq.fa.gz,truseq_rna.fa.gz" (with the appropriate paths).

Thank you very much!

Topics	Statistics	Last Post
ASHG 2024 Highlights – Part Two by seqadmin Started by seqadmin, Today, 11:09 AM	0 responses 22 views 0 likes	Last Post by seqadmin Today, 11:09 AM
ASHG 2024 Highlights – Part One by seqadmin Started by seqadmin, Today, 06:13 AM	0 responses 20 views 0 likes	Last Post by seqadmin Today, 06:13 AM
Seq-Scope Expands Possibilities for High-Resolution Gene Expression Analysis by seqadmin Started by seqadmin, 11-01-2024, 06:09 AM	0 responses 30 views 0 likes	Last Post by seqadmin 11-01-2024, 06:09 AM
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks by seqadmin Started by seqadmin, 10-30-2024, 05:31 AM	0 responses 21 views 0 likes	Last Post by seqadmin 10-30-2024, 05:31 AM

Seqanswers Leaderboard Ad

Announcement

weird kmer content in 5' end from genomic DNA PE reads

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News