Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dr khani
    replied
    thank you . i can not run bbmap tools on windows. i get error

    Leave a comment:


  • luc
    replied
    As mentioned above, the two peaks could very well be a sign of a mixed sample (contamination).
    You could remove the all the high GC content reads and see if this improves the assembly.
    BBtools (BBduk?) has a GC content filter.

    Leave a comment:


  • Dr khani
    replied
    my fastq GC content report has two peaks.can any one help me how i can assemble these type of data?
    Attached Files

    Leave a comment:


  • Brian Bushnell
    replied
    Unfortunately, it looks like that tool does not merge reads with insert size shorter than read length, which was the point of the exercise. But from the graph I can infer that maybe 30% of the reads are indeed in that category, so there are a few possibilities:

    1) The twin peaks are indeed from exon-capture bias, though I kind of doubt that, as it does not explain why trimming the reads would reduce it; and I would have expected such a bias to shift the peak center rather than creating a bimodal distribution, but of course it depends on the bait design.
    2) There is an exonic and intronic peak, or gene and non-gene peak. The GC content of a gene changes markedly once you get just outside of its bounds. For example, just upstream of the gene, it becomes very AT-rich, IIRC. But, I don't really like that explanation either.
    3) The adapter-trimming is unsuccessful or incomplete. From your GC content by base position, it looks fairly flat across the read, aside from the first 20 bp... so that doesn't make much sense either. Still, it wouldn't hurt to confirm. What were the total percent of reads and bases trimmed during adapter-trimming? I would expect something like 30% of the reads and maybe 5-10% of the bases. If you are using Nextera adapters, be sure you use those sequences for trimming.


    I suggest that you bin some of your reads by GC - just split them into pairs with GC<50% and GC>50%. Map both to human and look at the mapping rates (ideally, forcing unclipped global alignments). If they are equivalent, then the issue is not caused by contamination or adapter sequence, and it's probably safe to ignore.

    You can split the reads by GC content with my reformat tool:

    reformat.sh in1=read1.fq in2=read2.fq out1=low1.fq out2=low2.fq maxgc=0.5

    reformat.sh in1=read1.fq in2=read2.fq out1=high1.fq out2=high2.fq mingc=0.5

    Leave a comment:


  • Khillo81
    replied
    Thanks for your response. I first have to mention that I don't have a very strong background in bioinformatics and am using the CLC Genomics Workbench (ver. 7.5) which has a GUI and runs on Windows. I have used the Workbench's 'Merge Overlapping Pairs' function to generate the histogram below (I'm guessing it's similar to the BBMerge mentioned by Brian). I also haven't used the FASTQC but the native QC check in the Workbench. I'm attaching the output here. As you can see there is no severe drop in quality along the reads and besides the peaks in GC content observed at the end of the read (as I understand it, typical for Illumina data), the GC content along read length is around 45%. And the samples are human.
    Attached Files

    Leave a comment:


  • nucacidhunter
    replied
    Would you be able to post all of the FastQC output plots for comparison with other runs. For now, I would mention that Exome capture does not sample genome randomly, so it is not unusual to see what you are reporting.

    Leave a comment:


  • Brian Bushnell
    replied
    This is sometimes a sign of contamination, though if trimming the reads reduces it, that's a bit odd. Is this supposed to be human data? Human should peak around 50%, which does not correspond to either of your peaks. The most important question is what organism this is supposed to be, and what it's average GC% is.

    Also, please post an insert-size histogram, which will help determine if the problem is caused by short inserts. You can get one quickly using BBMerge:

    bbmerge.sh in1=read1.fq in2=read2.fq ihist=ihist.txt

    Leave a comment:


  • Khillo81
    replied
    Hi!

    I have two problems: one is two peaks in the per sequence GC-content and another is a weird profile which I'm attaching here.

    We're trying out Agilent's SureSelect enrichment protocol for Exome-Seq and have just concluded our first run on samples that were already done before using Illumina's Nextera kit (so we have another run with which to compare our results). The first run was sequenced on the Illumina HiSeq while this run was done on a MiSeq. Also, the first run was a 100bp paired end run while this was 150bp paired end run. Anyway, upon running a QC on the Fastq files I got this weird profile for the per-sequence GC content. I had already removed the low-quality reads and trimmed the adaptors but that didn't change anything. The only thing that helped was trimming 25 nucleotides from each end of the reads. Since we lose a lot of information that way, I'd prefer not to do this and want to ask if anyone has seen anything like this. I have no idea what might cause this.
    Attached Files
    Last edited by Khillo81; 10-14-2014, 04:55 AM.

    Leave a comment:


  • GenoMax
    replied
    Everything is ok

    Leave a comment:


  • chariko
    replied
    Originally posted by simonandrews View Post
    The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.
    As you can see in the per base composition plot the C content goes down on position 5 (as seen in the per base GC plot before and goes up on position 9. I assume as the manual tells, the first 12 positions could be a selection bias.
    I assume everything is OK then since the GC content in the specie s around 40%,


    It was an Nextera MiSeq bacterial genome sequencing experiment.

    Thank you very much for your help
    Attached Files
    Last edited by chariko; 08-19-2014, 01:40 AM.

    Leave a comment:


  • simonandrews
    replied
    Originally posted by chariko View Post
    I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...
    The per base GC plot was removed in the latest version since it mostly replicated information which was in the per base composition plot. You should still be able to see the biased positions as a deviation in the composition of C or G content at the same positions, but it's possible it's not enough of a deviation to trigger a warning.

    Leave a comment:


  • chariko
    replied
    Originally posted by nucacidhunter View Post
    I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.
    I updated FastQC to the 11.2 version and my error disappeared. I wonder it was an old version problem...

    Leave a comment:


  • nucacidhunter
    replied
    I think it will be helpful if you could provide more information such as library type, input material, kit used for library prep and graphs from new version of FastQC.

    Leave a comment:


  • chariko
    replied
    Originally posted by simonandrews View Post
    They might show some effects. If you have adapter dimers then you'll see the adapter sequence superimposed on the sequence content graphs. If your adapters have markedly different GC content than your library in general then you might also see an overall effect on the GC level.

    In the latest fastqc release there is a graph specifically to measure adapter content which will show exactly what proportion of the library is composed of read-through adapter which will illustrate this much better than trying to use sequence content plots.
    I am having a similar problem with my run (2x150), As you can see there are two peaks in my run. I expect to have a 40% of GC content (bacterial genome) but I don know why did I obtain these two peaks.

    [PASS] Basic Statistics
    [PASS] Per base sequence quality
    [PASS] Per sequence quality scores
    [FAIL] Per base sequence content
    [FAIL] Per base GC content
    [WARNING] Per sequence GC content
    [PASS] Per base N content
    [WARNING] Sequence Length Distribution
    [WARNING] Sequence Duplication Levels
    [WARNING] Overrepresented sequences
    [WARNING] Kmer Content

    Oversequencing is probably not the problem because in fact I obtained less reads as expected. Could it be due to a adaptor problem? Any clue would be really appreciated
    Attached Files

    Leave a comment:


  • MichalGordon
    replied
    Thank you!

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
59 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
57 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 09:21 AM
0 responses
53 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-04-2024, 09:00 AM
0 responses
56 views
0 likes
Last Post seqadmin  
Working...
X