This is all gonna be in pseudo code/explanations, but can someone verify my pipeline, or let me know if there's a better method to doing something?
The project has 10 samples (5 male: 1 control, 2 experiments with a replicate each, same w/ female) of an organism w/o a reference genome. We are using de novo assembly to assemble all 10 samples. (Trinity/Oases/Bridger, etc)
After we assemble the samples, we want to create a reference so we can use it for Differential Expression. We will merge the 10 assembles together, run CD-HIT-EST to remove redundancy, and then proceed to annotate the fasta. We plan on using blastx, and save the output to an xml. Import the xml into Blast2GO, remove all the non-annotated transcripts, and export the annotated fasta. This fasta will use as our reference for mapping.
We take the above annotated fasta, and map it back to the raw reads using bowtie2 or BWA, generate SAMs. Then use samtools to sorted BAMs.
Convert our annotated reference to gff3, and use HTseq-count to evaluate counts. Then run DESeq to get our DE genes.
Does this sound like a good plan?
We're currently at the "reference transcript" stage, and we will be submitting the reference to our local blast cluster in the next few days. I just want to verify that what I'm thinking is correct, or if there's something else I should be doing.
Thank you!
The project has 10 samples (5 male: 1 control, 2 experiments with a replicate each, same w/ female) of an organism w/o a reference genome. We are using de novo assembly to assemble all 10 samples. (Trinity/Oases/Bridger, etc)
After we assemble the samples, we want to create a reference so we can use it for Differential Expression. We will merge the 10 assembles together, run CD-HIT-EST to remove redundancy, and then proceed to annotate the fasta. We plan on using blastx, and save the output to an xml. Import the xml into Blast2GO, remove all the non-annotated transcripts, and export the annotated fasta. This fasta will use as our reference for mapping.
We take the above annotated fasta, and map it back to the raw reads using bowtie2 or BWA, generate SAMs. Then use samtools to sorted BAMs.
Convert our annotated reference to gff3, and use HTseq-count to evaluate counts. Then run DESeq to get our DE genes.
Does this sound like a good plan?
We're currently at the "reference transcript" stage, and we will be submitting the reference to our local blast cluster in the next few days. I just want to verify that what I'm thinking is correct, or if there's something else I should be doing.
Thank you!
Comment