So, first and foremost, greetings to all of you! Been a browser here for a time, but this is my first post and I thought I'd make it a good one
Up front, I am not a Biologist or Bioinformatician...merely a curious and adventurous manager of HPC for an R1 university. That being said, I am working on a de novo assembly of a 1.7Gb reptillian genome from 209GB of Illumina data. While I have successfully assembled a few Mb sized organisms in the past, I have never tried anything on this scale! The data I currently have is as follows:
Illumina HiSeq 2000 Mate Pairs (Reverse Comped - 2012)
-rw-r--r-- 1 jpummil jpummil 21G Mar 25 21:49 RattleSnakeReverseCompedRead1.fastq
-rw-r--r-- 1 jpummil jpummil 21G Mar 25 21:50 RattleSnakeReverseCompedRead2.fastq
Illumina HiSeq 2000 Pair Ended (2012)
-rw-r--r-- 1 jpummil jpummil 27G Mar 25 21:53 s_1_1_sequence.fastq
-rw-r--r-- 1 jpummil jpummil 27G Mar 25 21:53 s_1_2_sequence.fastq
Illumina HiSeq 2000 Pair Ended (2012)
-rw-r--r-- 1 jpummil jpummil 24G Mar 25 21:54 s_2_1_sequence.fastq
-rw-r--r-- 1 jpummil jpummil 24G Mar 25 21:54 s_2_2_sequence.fastq
Illumina HiSeq 2000 Pair Ended (2014)
-rw-r--r-- 1 jpummil jpummil 19G Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
-rw-r--r-- 1 jpummil jpummil 19G Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq
Illumina MiSeq 2000 Pair Ended (2014)
-rw-r--r-- 1 jpummil jpummil 16G Mar 25 21:57 MIKE_S1_L001_R1_001.fastq
-rw-r--r-- 1 jpummil jpummil 16G Mar 25 21:57 MIKE_S1_L001_R2_001.fastq
So, as I look at this pile of data, I am thinking that my steps should be as follows:
Trim adapters and poor data with FastX or Trimmomatic
Maybe make a second copy of the trimmed data and use FLASH to merge reads?
Have considered doing something with digital normalization to select just the "best" data?
Need to determine the best way of selecting parameters such as kmer size to optimize.
I have been using Ray and Velvet in my early experiments on the 2012 data with not a lot of success as I think our overall coverage was poor. BUT...with the new data added, I want to start off on a fresh note and learn something as I go that I can document and pass on to the REAL biologists if/when they are confronted with such a project
If there are responses to this thread, it is highly likely that I will ask some very non-biological questions as follow-ups, but it is all in the process of learning and I will be very appreciative of explanations and pointers.
Cheers,
--Jeff
Up front, I am not a Biologist or Bioinformatician...merely a curious and adventurous manager of HPC for an R1 university. That being said, I am working on a de novo assembly of a 1.7Gb reptillian genome from 209GB of Illumina data. While I have successfully assembled a few Mb sized organisms in the past, I have never tried anything on this scale! The data I currently have is as follows:
Illumina HiSeq 2000 Mate Pairs (Reverse Comped - 2012)
-rw-r--r-- 1 jpummil jpummil 21G Mar 25 21:49 RattleSnakeReverseCompedRead1.fastq
-rw-r--r-- 1 jpummil jpummil 21G Mar 25 21:50 RattleSnakeReverseCompedRead2.fastq
Illumina HiSeq 2000 Pair Ended (2012)
-rw-r--r-- 1 jpummil jpummil 27G Mar 25 21:53 s_1_1_sequence.fastq
-rw-r--r-- 1 jpummil jpummil 27G Mar 25 21:53 s_1_2_sequence.fastq
Illumina HiSeq 2000 Pair Ended (2012)
-rw-r--r-- 1 jpummil jpummil 24G Mar 25 21:54 s_2_1_sequence.fastq
-rw-r--r-- 1 jpummil jpummil 24G Mar 25 21:54 s_2_2_sequence.fastq
Illumina HiSeq 2000 Pair Ended (2014)
-rw-r--r-- 1 jpummil jpummil 19G Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R1_001.fastq
-rw-r--r-- 1 jpummil jpummil 19G Mar 25 21:57 3_Snake_4117_TSDR27_ATTCCT_L003_R2_001.fastq
Illumina MiSeq 2000 Pair Ended (2014)
-rw-r--r-- 1 jpummil jpummil 16G Mar 25 21:57 MIKE_S1_L001_R1_001.fastq
-rw-r--r-- 1 jpummil jpummil 16G Mar 25 21:57 MIKE_S1_L001_R2_001.fastq
So, as I look at this pile of data, I am thinking that my steps should be as follows:
Trim adapters and poor data with FastX or Trimmomatic
Maybe make a second copy of the trimmed data and use FLASH to merge reads?
Have considered doing something with digital normalization to select just the "best" data?
Need to determine the best way of selecting parameters such as kmer size to optimize.
I have been using Ray and Velvet in my early experiments on the 2012 data with not a lot of success as I think our overall coverage was poor. BUT...with the new data added, I want to start off on a fresh note and learn something as I go that I can document and pass on to the REAL biologists if/when they are confronted with such a project
If there are responses to this thread, it is highly likely that I will ask some very non-biological questions as follow-ups, but it is all in the process of learning and I will be very appreciative of explanations and pointers.
Cheers,
--Jeff
Comment