New to Next Gen not sure where to go from here

Melissa replied

09-08-2013, 09:59 AM
3) This data will not be in my MSc and I won't be doing much work with it beyond posting here; but I would like to include a short section in my thesis on the data we collect here. Is it possible to get average read lengths, fold number and other quality statistics about the data to include in my thesis, or does that not make any sense?
>>> I would advise against including the data in your thesis. It might not fit very well in your thesis. More importantly, it might invite questions from your examiners. Or worst, confuse them.

Another advice is forget everything you just learned from this post until you have finished writing your thesis.
Leave a comment:
tjeffe01 replied

08-15-2013, 06:55 AM
Thanks for all the info everybody

Wow, this forum is amazing, thanks for answering everyone.

A couple updates:
- After talking to our colleague who actually arranged the sequencing we purchased 4 lanes, he purchased 12 in total. I apologize if I don't use the technical jargon properly. My background is in molecular genetics and biochemistry and while I understand how RNAseq and other next gen sequencing technology works I'm not very familiar with the technical side of the process or with the computer science side.

- In terms of candidate genes we have sequenced the EPSPS gene and found no polymorphisms between resistant or susceptible plants and haven't found any evidence of gene duplication or over expression. When it comes to glyphosate resistance, resistance tends to either be EPSPS mutation or some other mechanism unrelated to EPSPS (for example changes in translocation patterns or sequestration in the vacuole) that are responsible. Having ruled EPSPS mutation as a mechanisms we are attempting to use a reverse genetic approach to find anything that may be linked to our resistance trait.

- I've forwarded all of this information on to my supervisor and he has decided that he either needs to sit on this data for a few years until the technology advances enough to make it easier, or he needs to hire a bioinformatician.

- Unfortunately sequencing the genome is not really a possibility. Funding for plant genomes is cropping up (excuse the pun) but is heavily focused solely on crop plants. There just isn't funding available for sequencing a genome for a plant species that only effects a small portion of the U.S and Canada. In addition there's no guarantee that sequencing the genome would lead to a new control mechanism for resistant plants. Farmers and other researchers would rather just blast the plants with more roundup and other herbicides.

Thanks everyone, I'll update you as new data becomes available,
Taylor
Leave a comment:
GenoMax replied

08-12-2013, 09:36 AM
Originally posted by westerman View Post

That is what I meant -- 450M total reads, 225M PE reads.

That is illumina marketing speak

I prefer to think of it as 225M unique clusters. The libraries being referred to must be extra good quality libraries, something that is not guaranteed (there is beginner's luck, if applicable here).

We have given tjeff01 a ton of advice but Taylor is not going to do much with the data as indicated in the original post. Hopefully these suggestions will be passed on to the person who would be doing the data analysis.

Taylor: Do report back in this thread as to how the sequencing turned out. Good luck.
Leave a comment:
westerman replied

08-12-2013, 09:01 AM
Originally posted by Wallysb01 View Post

Do you mean 450M reads as in 225M PE reads? I don't think counting a read on the same fragment twice is the right thing to do here, if that's in deed what you're doing.

That is what I meant -- 450M total reads, 225M PE reads. For 12 samples per lane that would give about 37M total reads or about 3700 Mbases per sample.

Considering assembly only and assuming that the samples are not overwhelmed with rRNA or other highly expressed transcripts, then there are about 14800 Mbases to work with -- if all 4 samples are merged together to create the assembly (which is the only way I would do it.) With the transcriptome being, what around 100M base pairs?, then we have 148x coverage. Even with rRNA and the highly expressed genes using up a lot of the bases the coverage should be high enough for a rough assembly. Perhaps not high enough to tease out very low expression transcripts but still good enough to get a handle on the transcriptome.

On a side note, I am finding this thread to be interesting.
Leave a comment:
chadn737 replied

08-09-2013, 05:03 PM
I understand the pitfalls of this suggestion, so nobody rip me to pieces.

I suggest it only because there are VERY obvious candidate genes in this experimental design. For those not familiar with how roundup works, the target enzyme is EPSPS and roundup-resistant crops carry a resistant EPSPS gene (no offense to those who know all this already, I just don't want to be attacked for suggesting this).

Find some sequences of candidate genes, whether from sunflower or other organisms. You may even be able to find the sequence of EPSPS from ragweed in a database somewhere. Then just align your sequences against this small reference. Obviously, this can lead to a lot of misalignment, but it would give a very quick look at any reads aligning to candidate genes. I would suggest this only as an initial quick dirty look at your data while you are running a de novo assembly or something, not as an approach to getting your data published.

What do people think?

Last edited by chadn737; 08-09-2013, 05:08 PM.
Leave a comment:
Wallysb01 replied

08-09-2013, 04:45 PM
Originally posted by chadn737 View Post

I would actually combine the reads from samples to make the assembly, or at least combine the reads from resistant varieties and combine the reads from non-resistant varieties. Then realign your reads back to this reference transcriptome to get differential expression.

This, definitely this. I'd suggest assembling them all together, given your fairly limited sequencing depth. But I'd do both myself, and compare.
Leave a comment:
chadn737 replied

08-09-2013, 04:40 PM
Since it is a roundup-resistant weed, you have obvious candidate genes. Part of the problem is that you don't have biological replicates, if I read it right, at least not for expression.

What you do have is biological replicates in sequence....

The first thing I would try to do is mine this data for any sequence variants in candidate genes...especially EPSPS. Imagine if you find variants in EPSPS in the resistant variety that are not in the non-resistant variety. That would be a very obvious candidate for resistance.

You could try to make a de novo transcriptome assembly. I would actually combine the reads from samples to make the assembly, or at least combine the reads from resistant varieties and combine the reads from non-resistant varieties. Then realign your reads back to this reference transcriptome to get differential expression.

Last edited by chadn737; 08-09-2013, 04:42 PM.
Leave a comment:
Wallysb01 replied

08-09-2013, 04:15 PM
Originally posted by westerman View Post

A recent one-lane 8-sample experiment (similar to tjeffe01's) that recently came through our center yielded a total of 450M reads. So assuming that tjeffe01's sequencing center can balance across those 12 samples then he will get over 30M reads per sample. From that Trinity will be able to provide a nice assembly. Not to human/mouse standards but for us plant & animal guys ... well, we just take what we can.

Do you mean 450M reads as in 225M PE reads? I don't think counting a read on the same fragment twice is the right thing to do here, if that's in deed what you're doing.

But I guess we should ask tjeffe01, how many PE reads did you get for each sample? Or is it not completed yet?
Leave a comment:
westerman replied

08-09-2013, 11:42 AM
Originally posted by Wallysb01 View Post

So 18Gbp? That genome isn't getting sequenced anytime soon.

Most plants aren't.

On the bright side of working with the un-characterized part of Life is that experiments don't have to stick to rigorous statistical principles. Which, in your case, is a good thing since you don't have biological nor technical replicates. Instead you can treat this as a "fishing expedition".

@Genomax. I agree that there isn't a lot of data but they should be able to get enough even with 1/3 of a lane. For rnaSeq we shoot for at least 30M reads per sample. A recent one-lane 8-sample experiment (similar to tjeffe01's) that recently came through our center yielded a total of 450M reads. So assuming that tjeffe01's sequencing center can balance across those 12 samples then he will get over 30M reads per sample. From that Trinity will be able to provide a nice assembly. Not to human/mouse standards but for us plant & animal guys ... well, we just take what we can.

I should emphasize what Wallysb01 said. Trinity does the assembly and, via its Trinnotate package -- the annotation and expression analysis. I am a bit behind the times by still using Blast and Blast2Go for my annotation but Trinity is becoming a one-stop solution.
Leave a comment:
Wallysb01 replied

08-09-2013, 11:02 AM
Originally posted by tjeffe01 View Post

Using c-value our genome size is about 1.8 x 10^10 bp.

So 18Gbp? That genome isn't getting sequenced anytime soon.
Leave a comment:
Wallysb01 replied

08-09-2013, 11:01 AM
Originally posted by GenoMax View Post

If this is a single lane of sequencing with 12 samples (if that is what you mean by lane of 12) then that would not be a lot of data.

Indeed. In fact, its probably far too little. Ideally, you're looking at about 12M reads per sample. That's just not enough sequencing depth. A single lane really shouldn't be split with more than 6 ways, or equivalent (i.e. 12 samples spread over 2 lanes). And the fact that there isn't a reference genome makes it even harder, as to do any meaningful analysis, genes/transcripts first need to be assembled, which requires much higher coverage than pure DE analysis.

So, its probably a good thing this is just a "future direction" for the OP's thesis.
Leave a comment:
GenoMax replied

08-09-2013, 10:16 AM
Originally posted by tjeffe01 View Post

I also should clarify that we didn't buy four lanes. Our colleague bought a lane of 12 to himself and we bought 4 spots on that lane. We provided the sequencing facility with RNA from 4 plants representing 4 different states: Resistant sprayed (after 2 hours) and unsprayed to look for differential expression and susceptible sprayed (after 2 hours) and unsprayed to eliminate differences that are just a normal response to glyphosate.

If this is a single lane of sequencing with 12 samples (if that is what you mean by lane of 12) then that would not be a lot of data.
Leave a comment:
tjeffe01 replied

08-09-2013, 10:10 AM
Thanks for your replies everyone.

I feel I should clarify my third point. I don't necessarily want to include the metrics about the data as a part of my thesis. I guess I would like to be able to include a paragraph and part of a slide in my defense about where the research is going beyond the work I've done so far. Being able to cite some metrics about the data sounds a little more scientific than "We ran RNAseq and got back a lot of data"

I also should clarify that we didn't buy four lanes. Our colleague bought a lane of 12 to himself and we bought 4 spots on that lane. We provided the sequencing facility with RNA from 4 plants representing 4 different states: Resistant sprayed (after 2 hours) and unsprayed to look for differential expression and susceptible sprayed (after 2 hours) and unsprayed to eliminate differences that are just a normal response to glyphosate.

The closest plant with a sequenced and aligned genome is Sunflower, which is much too far to be of use.

Using c-value our genome size is about 1.8 x 10^10 bp.

Like I said, this isn't part of my project. I did the RNA extraction and the paper work but that is where my responsibility ends in my opinion. Look like I need to make the recommendation to my supervisor that if he really wants to work with this data he needs to get a genome sequence first. Otherwise he could use trinity but he'll probably need to hire a new grad student or post doc to do it.

Thanks for all of the answers everyone.
Leave a comment:
Wallysb01 replied

08-09-2013, 09:33 AM
4 lanes is a lot these days. You can expect to get about 800 Million reads from that, and you really only need maybe 30-50 Million per replicate, and 3 replicates per sample. So you can easily stick in about 12-20 total samples in those 4 lanes.

So, you should have some sort of experimental design. Don't just throw some random stuff in each lane. You'll over sequence them and end up regretting it later. Now, you're probably not going to find many people able to help with the experimental design. But you should think about if there are is some sort of developmental time course, drug/condition treatment and control, or different tissues from an adult, that would actually give some interesting comparisons. Once you have a few conditions/tissues/time-points picked out, you need to have 3 (or more) replicates for any sort of meaningful statistical analysis.

Now, if you can't come up with more than 2-3 different samples to sequences, you may want to consider genome sequencing. But how big is your genome, or do you have any idea? Because if you want maybe 9 samples for RNA-seq, that only leaves you with about 400M reads for the genome. With 2x100bp reads, you're really only going to have useful depth of sequencing for 2Gbp size genomes or smaller. And even then, its going to be pretty fragments due to lack of matepair reads (though you could do 300bp and 800bp libraries now). But if you plan to continue working on this species, it may be useful to get the genome sequencing effort started, adding things like mate pair libraries at a later in time. This is something your advisor should be heavily involved in deciding, since most genome sequencing projects out live a single graduate student (especially if you're already in year 3 or 4).

Now, for RNAseq analysis without a genome, I highly recommend trinity (linked above), it makes assembly, orthology assignment and expression analysis all very user friendly (for command line stuff).
Leave a comment:
JackieBadger replied

08-09-2013, 09:25 AM
can also use MIRA and Newbler for transcriptome assembly
Leave a comment:

Previous 1 2 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News