Two peaks on FastQC plot "Per sequence GC content"

Dr khani replied

12-30-2017, 09:27 AM
thank you . i can not run bbmap tools on windows. i get error
Leave a comment:
luc replied

12-29-2017, 10:06 AM
As mentioned above, the two peaks could very well be a sign of a mixed sample (contamination).
You could remove the all the high GC content reads and see if this improves the assembly.
BBtools (BBduk?) has a GC content filter.
Leave a comment:
Dr khani replied

12-29-2017, 09:23 AM
my fastq GC content report has two peaks.can any one help me how i can assemble these type of data?
Attached Files

index.png (90.5 KB, 40 views)
Leave a comment:
Brian Bushnell replied

10-15-2014, 08:17 AM
Unfortunately, it looks like that tool does not merge reads with insert size shorter than read length, which was the point of the exercise. But from the graph I can infer that maybe 30% of the reads are indeed in that category, so there are a few possibilities:

1) The twin peaks are indeed from exon-capture bias, though I kind of doubt that, as it does not explain why trimming the reads would reduce it; and I would have expected such a bias to shift the peak center rather than creating a bimodal distribution, but of course it depends on the bait design.
2) There is an exonic and intronic peak, or gene and non-gene peak. The GC content of a gene changes markedly once you get just outside of its bounds. For example, just upstream of the gene, it becomes very AT-rich, IIRC. But, I don't really like that explanation either.
3) The adapter-trimming is unsuccessful or incomplete. From your GC content by base position, it looks fairly flat across the read, aside from the first 20 bp... so that doesn't make much sense either. Still, it wouldn't hurt to confirm. What were the total percent of reads and bases trimmed during adapter-trimming? I would expect something like 30% of the reads and maybe 5-10% of the bases. If you are using Nextera adapters, be sure you use those sequences for trimming.

I suggest that you bin some of your reads by GC - just split them into pairs with GC<50% and GC>50%. Map both to human and look at the mapping rates (ideally, forcing unclipped global alignments). If they are equivalent, then the issue is not caused by contamination or adapter sequence, and it's probably safe to ignore.

You can split the reads by GC content with my reformat tool:

reformat.sh in1=read1.fq in2=read2.fq out1=low1.fq out2=low2.fq maxgc=0.5

reformat.sh in1=read1.fq in2=read2.fq out1=high1.fq out2=high2.fq mingc=0.5
Leave a comment:
Khillo81 replied

10-15-2014, 06:27 AM
Thanks for your response. I first have to mention that I don't have a very strong background in bioinformatics and am using the CLC Genomics Workbench (ver. 7.5) which has a GUI and runs on Windows. I have used the Workbench's 'Merge Overlapping Pairs' function to generate the histogram below (I'm guessing it's similar to the BBMerge mentioned by Brian). I also haven't used the FASTQC but the native QC check in the Workbench. I'm attaching the output here. As you can see there is no severe drop in quality along the reads and besides the peaks in GC content observed at the end of the read (as I understand it, typical for Illumina data), the GC content along read length is around 45%. And the samples are human.
Attached Files

Merged pairs length distribution.png (16.7 KB, 121 views)

HT1159_22212-PR1_S2_L001_R1_001 (paired) - graphical QC report.pdf (201.5 KB, 148 views)
Leave a comment:
nucacidhunter replied

10-14-2014, 11:01 AM
Would you be able to post all of the FastQC output plots for comparison with other runs. For now, I would mention that Exome capture does not sample genome randomly, so it is not unusual to see what you are reporting.
Leave a comment:
Brian Bushnell replied

10-14-2014, 08:17 AM
This is sometimes a sign of contamination, though if trimming the reads reduces it, that's a bit odd. Is this supposed to be human data? Human should peak around 50%, which does not correspond to either of your peaks. The most important question is what organism this is supposed to be, and what it's average GC% is.

Also, please post an insert-size histogram, which will help determine if the problem is caused by short inserts. You can get one quickly using BBMerge:

bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt
Leave a comment:
Khillo81 replied

10-14-2014, 04:51 AM
Hi!

I have two problems: one is two peaks in the per sequence GC-content and another is a weird profile which I'm attaching here.

We're trying out Agilent's SureSelect enrichment protocol for Exome-Seq and have just concluded our first run on samples that were already done before using Illumina's Nextera kit (so we have another run with which to compare our results). The first run was sequenced on the Illumina HiSeq while this run was done on a MiSeq. Also, the first run was a 100bp paired end run while this was 150bp paired end run. Anyway, upon running a QC on the Fastq files I got this weird profile for the per-sequence GC content. I had already removed the low-quality reads and trimmed the adaptors but that didn't change anything. The only thing that helped was trimming 25 nucleotides from each end of the reads. Since we lose a lot of information that way, I'd prefer not to do this and want to ask if anyone has seen anything like this. I have no idea what might cause this.
Attached Files

GC-content.png (17.1 KB, 269 views)
Last edited by Khillo81; 10-14-2014, 04:55 AM.
Leave a comment:
GenoMax replied

08-19-2014, 03:10 AM
Everything is ok
Leave a comment:
chariko replied

08-19-2014, 01:37 AM
Originally posted by simonandrews View Post

The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.

As you can see in the per base composition plot the C content goes down on position 5 (as seen in the per base GC plot before and goes up on position 9. I assume as the manual tells, the first 12 positions could be a selection bias.
I assume everything is OK then since the GC content in the specie s around 40%,

It was an Nextera MiSeq bacterial genome sequencing experiment.

Thank you very much for your help
Attached Files

per_base_sequence _content.png (27.9 KB, 260 views)
Last edited by chariko; 08-19-2014, 01:40 AM.
Leave a comment:
simonandrews replied

08-19-2014, 12:34 AM
Originally posted by chariko View Post

I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...

The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.
Leave a comment:
chariko replied

08-19-2014, 12:29 AM
Originally posted by nucacidhunter View Post

I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.

I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...
Leave a comment:
nucacidhunter replied

08-18-2014, 03:55 PM
I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.
Leave a comment:
chariko replied

08-18-2014, 06:21 AM
Originally posted by simonandrews View Post

They might show some effects. If you have adapter dimers then you'll see the adapter sequence superimposed on the sequence content graphs. If your adapters have markedly different GC content than your library in general then you might also see an overall effect on the GC level.

In the latest fastqc release there is a graph specifically to measure adapter content which will show exactly what proportion of the library is composed of read-through adapter which will illustrate this much better than trying to use sequence content plots.

I am having a similar problem with my run (2x150), As you can see there are two peaks in my run. I expect to have a 40% of GC content (bacterial genome) but I don know why did I obtain these two peaks.

[PASS] Basic Statistics
[PASS] Per base sequence quality
[PASS] Per sequence quality scores
[FAIL] Per base sequence content
[FAIL] Per base GC content
[WARNING] Per sequence GC content
[PASS] Per base N content
[WARNING] Sequence Length Distribution
[WARNING] Sequence Duplication Levels
[WARNING] Overrepresented sequences
[WARNING] Kmer Content

Oversequencing is probably not the problem because in fact I obtained less reads as expected. Could it be due to a adaptor problem? Any clue would be really appreciated
Attached Files

per_base_gc_content.png (13.9 KB, 329 views)
Leave a comment:
MichalGordon replied

06-16-2014, 02:40 AM
Thank you!
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 59 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 57 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 53 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 56 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News