Illumina PE 100bp and allele content

yog77 replied

07-13-2011, 03:02 AM
Thanks All for the informative answers.

Hi all I have another Question regarding this approach:

Q2) Secondly a concern some of my colleagues have highlighted is that using a PCR approach we would get over-representation of the start and end of the reads (i.e. the start and end of the desired 200bp amplicon) and a much lower if not absent coverage of the middle portion. Would anyone have any comments as to whether this would be the case and if so are there ways around this?
Leave a comment:
Jeremy37 replied

07-01-2011, 09:11 AM
You should ask whoever is doing the sequencing for you what to expect in terms of read quality towards the end of the reads. When we were sequencing with the GA, we had extremely high quality all the way to base 100 -- often a median > Q35.

With the hiseqs they have been optimizing some things. We have had some runs "fail"... but they were eventually redone, and typically we have quality >Q30 at read 100. Especially since you expect to have many mismatches, I think that the value of longer reads would be very high for your application. You can always quality trim your reads using a tool like fastx (fastq_quality_trimmer). Even when the quality trails off at the end of a run, you will get a large fraction of the reads that don't need to be trimmed.
Leave a comment:
yog77 replied

07-01-2011, 08:32 AM
Originally posted by fkrueger View Post

I would assume that you wouldn't lose many reads due to ambiguous mapping if you aligned 2x50 or even 2x75bp reads the whole genome instead of just your region of interest. It might be a bit quicker but shouldn't make such a big difference. If in doubt you could just compare the number of mapped reads against the whole genome with your region of interest, and if they don't differ very much i would possibly use the whole genome approach as this can be informative whether your experiment worked the way you intended and it is probably easier to justify for a publication at some point...

If I had a choice I would opt for 2x50 or 2x75bp reads, the latter might need to be run through a quality and/or adapter trimmer just to be sure. Low quality sequence can lead to wrong methylation calls, in rare cases even to mis-mappings (which generally produce random methylation calls). And of course many mismatches can bring down your mapping efficiency quite quickly if you use reasonably strict mapping parameters. So I suggest short to medium reads and possibly quality trimming, then you should be fine. Let me know if I can be of any further help with your project.

Thanks for the informative reply will give it some more thought and may some more questions thanks
Leave a comment:
lh3 replied

07-01-2011, 05:53 AM
All the HiSeq data I have seen so far have good quality at the end. Another potential concern is that not all BS mappers are optimized for 100bp reads. They may have better performance for 50bp reads.
Leave a comment:
ECO replied

07-01-2011, 05:50 AM
Moving to ILMN forum.
Leave a comment:
fkrueger replied

07-01-2011, 04:53 AM
I agree that longer = better IF quality stays up until the end. The latest iPS BS-Seq datasets from Lister et al. have excellent qualities for reads >100bp for instance. However we have received loads of emails from people where the quality of their data deteriorated quite early on (as mentioned above).
Leave a comment:
lh3 replied

07-01-2011, 04:47 AM
How long reads you can sequence depends on many factors, such as machine, chemistry and optimization. All the HiSeq users I know can confidently get 2*100bp reads without much quality drop at the end. I have seen optimized GAIIx can also reach this level of accuracy. With 100bp reads, we have much fewer alignment artifacts than using 2*50bp reads. If your machine (e.g. HiSeq) can do that and you are not very constrained by the funding, you should try to get 2*100 reads. Roche used to advertise "longer is better". That is true.

Also, in the previous post, I just want to say overlapping ends does not cause mapping problems. How to deal with them is largely the task of downstream tools.

Last edited by lh3; 07-01-2011, 04:51 AM.
Leave a comment:
fkrueger replied

07-01-2011, 04:01 AM
I would assume that you wouldn't lose many reads due to ambiguous mapping if you aligned 2x50 or even 2x75bp reads the whole genome instead of just your region of interest. It might be a bit quicker but shouldn't make such a big difference. If in doubt you could just compare the number of mapped reads against the whole genome with your region of interest, and if they don't differ very much i would possibly use the whole genome approach as this can be informative whether your experiment worked the way you intended and it is probably easier to justify for a publication at some point...

If I had a choice I would opt for 2x50 or 2x75bp reads, the latter might need to be run through a quality and/or adapter trimmer just to be sure. Low quality sequence can lead to wrong methylation calls, in rare cases even to mis-mappings (which generally produce random methylation calls). And of course many mismatches can bring down your mapping efficiency quite quickly if you use reasonably strict mapping parameters. So I suggest short to medium reads and possibly quality trimming, then you should be fine. Let me know if I can be of any further help with your project.
Leave a comment:
yog77 replied

07-01-2011, 03:40 AM
Thank you all for your comments they have been really helpful as I don't have any hands on experience with Illumina sequencing just yet - it's been "Illuminating"

Jeremy37 - My reason for hoping for longer reads is to associate methylation on specific reads (originating from a single cluser - a bit like a single molecule) and it's associated SNP's, and so the longer the read the more potenital SNPs to try and associate the methylation status with.

fkrueger - Ok I get that overlapping won't be an issue, but now appreciate that the quality is going to drop off from 50-75bp onwards so wasting half the money! Also Iam aware that the C or T SNPs will need to be confirmed by genomic re-seq.

As you have experience with BS-seq, I just wondered wouldn't it be less complex mapping to a defined region (such as my 300kb gene region) than a whole genome and so we might be more sucessful in mapping the poorer quality end of reads? OR would you still reconmend shorter, say 75bp PE reads in the hope that the last 25 bases are OKish quality, or in your experience this would still be poor quality for BS-seq??
Leave a comment:
fkrueger replied

06-30-2011, 10:56 PM
Technically, overlapping reads should not be a problem (unless they are completely contained within each other). However if reads overlap you will potentially call the methylation state of the overlapping part twice, and you need to think about a strategy how to deal with this (i.e. use methylation calls from only a single read, from both reads...).

I am also quite concerned about a read length of 100bp. From our experience the basecall qualities drop steadily towards the end of reads, and this usually starts from bp 50-70bp. BS-Seq is very dependent on good quality reads, especially if you also want to look at SNPs later on. We have seen numerous examples where long reads (75-108bp) had to be trimmed uniformly to ~50bp (or using adaptive quality trimmers) in order to obtain a good mapping efficiency. This essentially means wasting half of the data und thus money. If I understood it correctly you should have many different products of your amplified gene, and I think more but shorter reads will be more useful than one 2x100bp run with low qualities.

If you have good coverage SNP calling is possible, but it is a bit trickier than normal because SNPs concerning Cs or Ts can only be called by looking at reads from the opposing strand (before BS conversion).
Leave a comment:
lh3 replied

06-30-2011, 01:38 PM
Overlapping end is not a problem for all the major read mappers. It could propose minor issues for SNP calling, but just minor.
Leave a comment:
Jeremy37 replied

06-30-2011, 09:28 AM
I don't see any problem with what you're trying to do.
I'm not sure why you are concerned about the read length though. It seems to me that you could do this even with 50 bp reads if you wanted. You would be demultiplexing the samples yourself using your adapter sequences, I guess.

I'm not sure how the SNP calling would work, since with the bisulphite treatment (which I just had to look up) you're going to have a lot of differences from the reference. I think you need someone who knows about methylation analysis to comment...
Leave a comment:
yog77 replied

06-30-2011, 08:49 AM
Originally posted by TonyBrooks View Post

Have you thought about combining 150bp paired end reads into one psuedo read of 200bp using the 50bp overlap?
These guys used that approach for their metagenomic work, but it should also work for other applications.

http://www.plosone.org/article/info%...l.pone.0011840

Thanks will look into that
Leave a comment:
yog77 replied

06-30-2011, 08:44 AM
Sorry Im new on here and not sure if you directly got my response Jeremy37

I was hoping to generate similar sized bisulphite PCR amplicons (~200bp) so there would be no need for size selection and this is a reliably obtainable size for bisulphite PCR.

I plan to do bisulphite sequencing and aligning to a small genomic region (a large gene) where all my amplicons will come from (300kb region of bisulphite converted sequnce and which will be used specifically for the alignment) some of these ~200bp amplicons will overlap with one another (say where I was interested in a streach of 3kb or so).

In essence I want as much read length (100bp x2) from the 200bp PCR amplicons as possible to look at methylated CpGs and SNPs in the same amplicon and so I was thinking there was going to be no insert as I want data on the whole 200bp PCR amplicon - is this possible to achieve?
Leave a comment:
TonyBrooks replied

06-30-2011, 08:02 AM
Have you thought about combining 150bp paired end reads into one psuedo read of 200bp using the 50bp overlap?
These guys used that approach for their metagenomic work, but it should also work for other applications.

Unlocking Short Read Sequencing for Metagenomics

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011840

Background Different high-throughput nucleic acid sequencing platforms are currently available but a trade-off currently exists between the cost and number of reads that can be generated versus the read length that can be achieved. Methodology/Principal Findings We describe an experimental and computational pipeline yielding millions of reads that can exceed 200 bp with quality scores approaching that of traditional Sanger sequencing. The method combines an automatable gel-less library construction step with paired-end sequencing on a short-read instrument. With appropriately sized library inserts, mate-pair sequences can overlap, and we describe the SHERA software package that joins them to form a longer composite read. Conclusions/Significance This strategy is broadly applicable to sequencing applications that benefit from low-cost high-throughput sequencing, but require longer read lengths. We demonstrate that our approach enables metagenomic analyses using the Illumina Genome Analyzer, with low error rates, and at a fraction of the cost of pyrosequencing.
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News