Sorry for the length of the post, long-time bioinfo man, total noob for NGS.
I am trying to help a lab that sequenced human fecal samples from various individuals with two distinct phenotypes. They used a company that used Illumina GA to generate 76bp reads (non-paired) and I have the data as 4 pools, 2 pools for phenotype A and 2 for B. Each pool is in fact a pool of RNA extracted from 4 individuals. Sequences are in fact only 72bp long as there was some individual-tag removed by the company.
Now from what I can understand communicating with the company is that the lab jumped the gun and went straight to metabolome-analyses without doing any of the prerequisite analyses such as 16S profiling and paired-end "deep" sequencing (they have obtained between 3-4 million sequences per pool with the current data files after cleaning (by the company)).
I can match about 10% of the data to human RNAs (using UCSC known genes or the genome assemblies as a reference) but only 3-5% against NCBI bacterial genomes. I have also downloaded tons of data from other meta-bacterial sequencing projects and while I can get more hits (still < 10%), there is nothing in the way of annotations on these sequences and I have in no way helped matters using these publicly available scaffolds.
My goal is to do something with the data. At least get a dendogram that shows that pool 1 and 2 are separable from pools 3 and 4. But since the majority of the data is just raw short sequences and I cannot align it to a reference nor assemble it, I do not really know what I am supposed to do with it.
Can someone please point me in the right direction? I have read quite a bit of this forum today and there is a ton of info, maybe too much in terms of the different programs available, but my problem just does not appear to fit in any of the categories and does not seem to be solvable based on the descriptions of the 10-15 programs I have read about.
Thanks in advance
I am trying to help a lab that sequenced human fecal samples from various individuals with two distinct phenotypes. They used a company that used Illumina GA to generate 76bp reads (non-paired) and I have the data as 4 pools, 2 pools for phenotype A and 2 for B. Each pool is in fact a pool of RNA extracted from 4 individuals. Sequences are in fact only 72bp long as there was some individual-tag removed by the company.
Now from what I can understand communicating with the company is that the lab jumped the gun and went straight to metabolome-analyses without doing any of the prerequisite analyses such as 16S profiling and paired-end "deep" sequencing (they have obtained between 3-4 million sequences per pool with the current data files after cleaning (by the company)).
I can match about 10% of the data to human RNAs (using UCSC known genes or the genome assemblies as a reference) but only 3-5% against NCBI bacterial genomes. I have also downloaded tons of data from other meta-bacterial sequencing projects and while I can get more hits (still < 10%), there is nothing in the way of annotations on these sequences and I have in no way helped matters using these publicly available scaffolds.
My goal is to do something with the data. At least get a dendogram that shows that pool 1 and 2 are separable from pools 3 and 4. But since the majority of the data is just raw short sequences and I cannot align it to a reference nor assemble it, I do not really know what I am supposed to do with it.
Can someone please point me in the right direction? I have read quite a bit of this forum today and there is a ton of info, maybe too much in terms of the different programs available, but my problem just does not appear to fit in any of the categories and does not seem to be solvable based on the descriptions of the 10-15 programs I have read about.
Thanks in advance
Comment