Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
This topic is closed.
X
X
-
That would be a great new feature Simon! There is a list of adapters at;
There are some datasets at the Illumina website you can use as test.
Especially the Nextera adapter would be nice to include.
Comment
-
Originally posted by JQL View Post
Comment
-
Originally posted by boetsie View PostThat would be a great new feature Simon! There is a list of adapters at;
There are some datasets at the Illumina website you can use as test.
Especially the Nextera adapter would be nice to include.
I'll take a look through that list but I think all of the Nextera adapters use the same common core as the bulk of their adapters so would get caught by the sequence we're already using.
Another request - if anyone has any nice examples of datasets heavily contaminated with different adapters and would be willing to run a test version of FastQC on them then it would be nice to get some confirmation that we're catching the cases we're after with this new module.
Comment
-
Adapter Contamination Detection
Originally posted by simonandrews View PostThanks for sending that. Is that really an official posting on Illumina's site? They've been so tight over the years about not officially releasing the sequences of their adapters (so we didn't use sequences supplied by Illumina with FastQC for example), and then they go and post them on their website (along with the warning that you shouldn't post these anywhere!).
I'll take a look through that list but I think all of the Nextera adapters use the same common core as the bulk of their adapters so would get caught by the sequence we're already using.
Another request - if anyone has any nice examples of datasets heavily contaminated with different adapters and would be willing to run a test version of FastQC on them then it would be nice to get some confirmation that we're catching the cases we're after with this new module.
This blog post is also of relevance here.
I am assembling a Bacterial Genome.
Library details, Illumina MiSeq (Comes from a commercial sequencing provider)
Paired end library:
150bp Read Length
450bp Fragment Lenght
Mate pair Library:
250bp Read Length
300-1200bp (Average 700bp) Fragment Lenght
Used fastqc (with -k 10) on the Mate Pair data, both untrimmed and trimmed (Using Trimmomatic with Nextra adapters)
The fastqc kmer-profiles plot for untrimmed data,
Untrimmed Read 1
Untrimmed Read 2
The fastqc kmer-profiles plot for trimmed data,
(Using Trimmomatic 0.32 with Nextra adapters only)
Trimmed Read 1
Trimmed Read 2
An interesting observation is that this problem is not there with Paired end data for same sample. In my opinion this might be due to the shorter read lenght(150bp) in comparison to Mate Pair (250bp).
Hope this helps.
--
prakhar
Comment
-
Ha, I just wanted to post that blog prakhar!
In addition, the datasets from Illumina's BaseSpace are said to be publicly available; https://basespace.illumina.com/home/index
Comment
-
Thanks - the blog was really useful and I've added in the Nextera transposase sequence as an extra check in the default set. I think the barcode Kmers in that blog are just read through effects from the same adapters so don't need to be considered separately.
I can improve this over time (and of course people can add their own sequences in manually) but I'd like to get as useful a default set as possible when the new version ships.
Comment
-
Hi there!
Three quick questions....
1. What is the maximum amount of data fastqc can handle?
I am trying to analyze a huge concatenated sample of illumina data, but its stuck @"Starting analysis" for a while now. RAM is enough, and server is idle except for fastqc. I also see a running java command in top.
2. Any recommendations to filter away the bad sequences, fastqc had identified? mothur had filter.seqs, maybe something similar for illumina?
3. The whiskers in the boxplots are representing 100%?
Thank you very much!
Comment
-
Originally posted by nouse View Post
Three quick questions....
1. What is the maximum amount of data fastqc can handle?
Originally posted by nouse View PostI am trying to analyze a huge concatenated sample of illumina data, but its stuck @"Starting analysis" for a while now. RAM is enough, and server is idle except for fastqc. I also see a running java command in top.
Originally posted by nouse View Post2. Any recommendations to filter away the bad sequences, fastqc had identified? mothur had filter.seqs, maybe something similar for illumina?
Originally posted by nouse View Post3. The whiskers in the boxplots are representing 100%?
Comment
-
Thanks for the quick answer.
My 460 million reads were processed over night without troubles. I was just impatient.
I have paired end data, and it seems like some of the samples have problematic reverse reads (whiskers going down to phred<10 for some positions in some samples). This however seem to affect my downstream processing, so i want to get rid of say anything that has stretches of low quality over n bases. And anything with more than n ambigous base calls.
Mothur could do that fairly well with filter.seqs, but its just too slow for my dataset. also i need to convert fastq to fasta. SILVA ngs is able to that, too, but it is a webservice.
I check trim galore and solexaqa, but in the end i dont want to trim, i want to reject completely. I am a little bit surprised that those reads could make it out of the HiSeq (i was told the denoising is done internally).
Just to get the boxplots correctly, the whiskers represent 75-90 and 10-25% respectively. So, outside the whiskers there are still 20% of the data and outliers, correct?
Comment
-
You can do this sort of trimming with trim galore. Basically you specify to trim only based on quality and then reject anything with a final length which is shorter than the starting length. This would remove completely any reads which had any data removed.
Another thing to remember in Illumina data is that Illumina uses very low quality scores (I think it's a Phred score of 2 if I remember correctly) as a flag for calls it doens't like rather than as a true error probability. This is why you'll often see whiskers on fastqc plots suddently jump down to very low values and it's not really indicative of a sudden problem, just that some reads have crossed a threshold. There is an option to turn this off in the sequencing pipeline but I don't think anyone routinely uses it.
Comment
-
Hello,
I really like this application, and have used it successfully on several files, but now I'm trying to compare it to a trimmed file, and the trimmed file gives this exception:
Exception in thread "Thread-4" java.lang.NullPointerException
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:141)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:105)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java
:76)
at java.lang.Thread.run(Unknown Source)
Has anyone encountered this before or know of any possible solution?
I am using data from IlluminaBodyMap2.
trimming with Trimmomatic using these options: -phred33,
ILLUMINACLIP:/home/Trimmomatic-0.30/adapters/TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:20
The fastqc application is the latest windows version. (I just keep transferring to a linux VM)
Thanks to anyone who can help. The file refuses to complete due to the exception, but it reads 1362859 sequences before stopping.
Comment
-
Hi Susanna - that error suggests that your fastq file stopped in the middle of a fastq entry (which is 4 lines long) which suggests that your file has been truncated. There will be a nicer error message in the next release, but it will still mean that you've lost some data during one of your transfers and you'll need to go back to the original source to ensure that you have the rest of the file. It's a good idea to check that the file sizes match when you've downloaded a file and if possible check the md5sums of the downloaded files so you know you have the same data.
Comment
Latest Articles
Collapse
-
by seqadmin
Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...-
Channel: Articles
10-18-2024, 07:11 AM -
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
New Model Aims to Explain Polygenic Diseases by Connecting Genomic Mutations and Regulatory Networks
by seqadmin
Started by seqadmin, Yesterday, 05:31 AM
|
0 responses
10 views
0 likes
|
Last Post
by seqadmin
Yesterday, 05:31 AM
|
||
Started by seqadmin, 10-24-2024, 06:58 AM
|
0 responses
20 views
0 likes
|
Last Post
by seqadmin
10-24-2024, 06:58 AM
|
||
New AI Model Designs Synthetic DNA Switches for Targeted Gene Expression in Specific Cell Types
by seqadmin
Started by seqadmin, 10-23-2024, 08:43 AM
|
0 responses
48 views
0 likes
|
Last Post
by seqadmin
10-23-2024, 08:43 AM
|
||
Started by seqadmin, 10-17-2024, 07:29 AM
|
0 responses
58 views
0 likes
|
Last Post
by seqadmin
10-17-2024, 07:29 AM
|
Comment