Seqanswers Leaderboard Ad

**Brian Bushnell** · 04-25-2014, 07:23 PM

Originally posted by jpummil View Post

So, I do recall the group who sequenced the MP data mentioning an insertion size of 6600? You also mentioned a few posts back using the MP data for scaffolding...did you intend for me to omit it from the initial assemblies and perform a secondary scaffolding step with that data?

Ahhh...yup, gzip allowed me to attach the files. If I can figure out why the PE file in question didn't merge well, I want to correct that and then try an assembly in Ray using the merged data (as single reads) and the unmerged files as paired ends as you indicated.

That PE library seems to peak at 479bp. So if the reads are less than 250bp, they generally won't overlap.

Your MP library seems to peak at 91 bp (having a short-insert peak is not atypical for LMP libraries), but it also seems to have another peak at around 6000bp, which is good. You could map all of the LMP data to the contigs for scaffolding. It may or may not be useful to first remove all reads that map paired to the same contig.

Whether or not you include the LMP data in with the normal data for contig generation is up to you; I don't know exactly how Ray uses different libraries information. AllPathsLG, for example, requires you to give it both short and long insert data and uses the two datasets in different ways. There are also various standalone scaffolding tools for use with LMP data and assembled contigs, but I don't know which ones are good.

**jpummil** · 04-29-2014, 12:17 PM

Miss me?

Been running different variations of the assembly to see if any trends toward improvement would be revealed. As of yet, little seems to make a difference. I even tried adding in the 20GB of .sff data to the last run in Ray...also to no real improvement.

Merging data seemed to not improve things either. Though a couple of sets of the Illumina data didn't merge, I did have three sets that did, so I ran them as such (with the remains, still paired, run as well).

So now, I am branching off to see if different assemblers will yield any drastic results. I think first up will be SOAPdenovo2 since it has well been proven in large assemblies such as this. Just working on the .config file now. I expect I'll try Velvet soon as well, but it will require a move to a big SGI I have time on since the memory requirements the last time I ran this (with about 50GB less data) exceeded 1TB.

**Brian Bushnell** · 04-29-2014, 12:27 PM

Originally posted by jpummil View Post

Miss me?

Been running different variations of the assembly to see if any trends toward improvement would be revealed. As of yet, little seems to make a difference. I even tried adding in the 20GB of .sff data to the last run in Ray...also to no real improvement.

Merging data seemed to not improve things either. Though a couple of sets of the Illumina data didn't merge, I did have three sets that did, so I ran them as such (with the remains, still paired, run as well).

So now, I am branching off to see if different assemblers will yield any drastic results. I think first up will be SOAPdenovo2 since it has well been proven in large assemblies such as this. Just working on the .config file now. I expect I'll try Velvet soon as well, but it will require a move to a big SGI I have time on since the memory requirements the last time I ran this (with about 50GB less data) exceeded 1TB.

In our most recent direct comparison of Soap to Ray on the same data (a metagenome), Soap gave a somewhat more fragmented assembly, but with a much lower error rate. Soap gave terrible results when using merged reads, in that test, while Ray was run with merged reads.

**jpummil** · 05-14-2014, 07:58 AM

Sorry all, got side tracked. Back on things again now :-D

I have a velveth run completed on the data, and am awaiting a re-run on the velvetg run (ran out of ram...specified 1.5TB and it wouldn't quite fit)

In the mean time, I need to finish up my SOAP config file and run that as well.

I would also like to use bbtrim to do some quality trimming. I have already used bbduk to remove adapters, but have not yet run a quality trim. Why am I running a big assembly prior to doing this? To see what I end up with both ways!

So, for using bbtrim on PE data, what additional flags are needed to ensure the data stays "paired"? Is defining in= and in2= along with out= and out2= enough? I plan to leave the trimq at 10.

**Brian Bushnell** · 05-14-2014, 08:51 AM

Originally posted by jpummil View Post

So, for using bbtrim on PE data, what additional flags are needed to ensure the data stays "paired"? Is defining in= and in2= along with out= and out2= enough? I plan to leave the trimq at 10.

Yes, that's sufficient. As long as you define both in and in2, all of my tools will keep pairs together.

**jpummil** · 05-14-2014, 09:23 AM

Thanks Brian!

As I am just doing the quality trimming, i won't be providing a reference, correct? I assume this was for the adapter trimming portion of the trim.

Also, will I need to change any settings to go ahead and do the Mate Paired reads I am using? They have already been trimmed of adapters and reverse comped.

Essentially, I will be running this: bbtrim.sh in1=$DATA/New_PE_R1.fastq in2=$DATA/New_PE_R2.fastq out=$DATA2/New_PE_R1.fastq out2=$DATA2/New_PE_R2.fastq trimq=10 qtrim=rl

**Brian Bushnell** · 05-14-2014, 09:27 AM

That's right; a reference is not needed for quality trimming. And that command line looks fine. You could, optionally, include a flag like 'minlength=50' to throw away reads that end up shorter than 50bp after trimming, as they won't be very useful.

**jpummil** · 05-14-2014, 09:57 AM

Wow! It's certainly FAST! Trimmed all 209GB in ubder 7 minutes using 12 compute cores ;-)

Before:

-rw-rw-r-- 1 jpummil jpummil 16G Apr 21 11:46 MiSeq_PE_R1.fastq
-rw-rw-r-- 1 jpummil jpummil 16G Apr 21 11:46 MiSeq_PE_R2.fastq
-rw-rw-r-- 1 jpummil jpummil 19G Apr 18 20:14 New_PE_R1.fastq
-rw-rw-r-- 1 jpummil jpummil 19G Apr 18 20:14 New_PE_R2.fastq
-rw-rw-r-- 1 jpummil jpummil 21G Apr 21 15:55 Old_MP_RC01.fastq
-rw-rw-r-- 1 jpummil jpummil 21G Apr 21 15:54 Old_MP_RC02.fastq
-rw-rw-r-- 1 jpummil jpummil 23G Apr 21 11:52 Old_PE_s_1_1.fastq
-rw-rw-r-- 1 jpummil jpummil 23G Apr 21 11:52 Old_PE_s_1_2.fastq
-rw-rw-r-- 1 jpummil jpummil 20G Apr 21 11:54 Old_PE_s_2_1.fastq
-rw-rw-r-- 1 jpummil jpummil 20G Apr 21 11:54 Old_PE_s_2_2.fastq
After:
-rw-rw-r-- 1 jpummil jpummil 16G May 14 12:36 MiSeq_PE_R1.fastq
-rw-rw-r-- 1 jpummil jpummil 16G May 14 12:36 MiSeq_PE_R2.fastq
-rw-rw-r-- 1 jpummil jpummil 18G May 14 12:35 New_PE_R1.fastq
-rw-rw-r-- 1 jpummil jpummil 18G May 14 12:35 New_PE_R2.fastq
-rw-rw-r-- 1 jpummil jpummil 19G May 14 12:41 Old_MP_RC01.fastq
-rw-rw-r-- 1 jpummil jpummil 18G May 14 12:41 Old_MP_RC02.fastq
-rw-rw-r-- 1 jpummil jpummil 19G May 14 12:38 Old_PE_s_1_1.fastq
-rw-rw-r-- 1 jpummil jpummil 21G May 14 12:38 Old_PE_s_1_2.fastq
-rw-rw-r-- 1 jpummil jpummil 17G May 14 12:39 Old_PE_s_2_1.fastq
-rw-rw-r-- 1 jpummil jpummil 18G May 14 12:39 Old_PE_s_2_2.fastq

And, despite differences in the file sizes of the paired reads, line numbers are still identical as you stated :-D

**Brian Bushnell** · 05-14-2014, 10:04 AM

I have fun making programs fast =)

**jpummil** · 05-16-2014, 12:11 PM

Still some oddities when I use Ray for assembly (Velvet run is still cooking on a big mem machine...). So, by default, doesn't Ray scaffold things together before it completes? Just finished a run with the quality trimmed data and as I delve into it, I can see that we get both a Contigs.fasta and a Scaffolds.fasta of the same size:

[jpummil@razor-l1 RayOut57-TRIM-ALL-454]$ ls -lh
total 2.3G
drwxrwx--- 7 jpummil jpummil 32K May 16 14:58 BiologicalAbundances
-rw-rw-r-- 1 jpummil jpummil 13M May 16 14:07 ContigLengths.txt
-rw-rw-r-- 1 jpummil jpummil 1.2G May 16 12:59 Contigs.fasta
-rw-rw-r-- 1 jpummil jpummil 371 May 14 21:11 CoverageDistributionAnalysis.txt
-rw-rw-r-- 1 jpummil jpummil 275K May 14 21:11 CoverageDistribution.txt
-rw-rw-r-- 1 jpummil jpummil 696 May 15 04:21 degreeDistribution.txt
-rw-rw-r-- 1 jpummil jpummil 2.2K May 16 14:58 ElapsedTime.txt
-rw-rw-r-- 1 jpummil jpummil 884 May 14 16:05 FilePartition.txt
-rw-rw-r-- 1 jpummil jpummil 2.6K May 15 04:20 GraphPartition.txt
-rw-rw-r-- 1 jpummil jpummil 28K May 15 17:35 LibraryData.xml
-rw-rw-r-- 1 jpummil jpummil 2.0K May 15 17:35 LibraryStatistics.txt
-rw-rw-r-- 1 jpummil jpummil 151 May 16 14:58 NeighbourhoodRelations.txt
-rw-rw-r-- 1 jpummil jpummil 1.9K May 14 15:56 NetworkTest.txt
-rw-rw-r-- 1 jpummil jpummil 3.0K May 14 16:05 NumberOfSequences.txt
-rw-rw-r-- 1 jpummil jpummil 468 May 16 14:58 OutputNumbers.txt
-rw-rw-r-- 1 jpummil jpummil 15M May 16 12:28 ParallelPaths.txt
drwxrwxr-x 3 jpummil jpummil 32K May 16 13:36 quast_results
-rw-rw-r-- 1 jpummil jpummil 1.5K May 14 15:56 RayCommand.txt
-rw-rw-r-- 1 jpummil jpummil 18 May 14 15:56 RayPlatform_Version.txt
-rw-rw-r-- 1 jpummil jpummil 1.5K May 14 15:56 RaySmartCommand.txt
-rw-rw-r-- 1 jpummil jpummil 10 May 14 15:56 RayVersion.txt
-rw-rw-r-- 1 jpummil jpummil 25M May 16 14:07 ScaffoldComponents.txt
-rw-rw-r-- 1 jpummil jpummil 11M May 16 14:07 ScaffoldLengths.txt
-rw-rw-r-- 1 jpummil jpummil 4.2M May 16 14:07 ScaffoldLinks.txt
-rw-rw-r-- 1 jpummil jpummil 1.2G May 16 14:44 Scaffolds.fasta
-rw-rw-r-- 1 jpummil jpummil 27K May 15 17:25 SeedLengthDistribution.txt
-rw-rw-r-- 1 jpummil jpummil 2.1K May 14 16:05 SequencePartition.txt

Running Quast on both, the Scaffolds.fasta seems to have slightly better N50, a much larger "longest contig", and substantially fewer contigs:

Contigs.fasta
Assembly Contigs
# contigs (>= 0 bp) 598776
# contigs (>= 1000 bp) 348277
Total length (>= 0 bp) 1143773762
Total length (>= 1000 bp) 999982483
# contigs 490368
Largest contig 38923
Total length 1105527612
GC (%) 38.05
N50 3183
N75 1701
L50 98949
L75 218345
# N's per 100 kbp 0.00

Scaffolds.fasta
Assembly Scaffolds
# contigs (>= 0 bp) 536335
# contigs (>= 1000 bp) 297797
Total length (>= 0 bp) 1165606230
Total length (>= 1000 bp) 1031474385
# contigs 429232
Largest contig 64271
Total length 1127878012
GC (%) 38.05
N50 4279
N75 2048
L50 73691
L75 168757
# N's per 100 kbp 1936.87

Obviously I am confused over some details again...what am i neglecting?

**Brian Bushnell** · 05-16-2014, 12:41 PM

Scaffolding will find read pairs (typically from the LMP library) that map to different contigs, and merge those contigs together. The overall number of bases should change very little, aside from the addition of a bunch of Ns that fill gaps between contigs. So, after scaffolding you would expect fewer contigs (since some got merged) that are, on average, longer, and with more Ns. So, your results look like what I would expect, although often scaffolding can have a much greater impact than it did in this case.

**jpummil** · 05-23-2014, 07:21 AM

Finally got a run completed with Velvet on a big SGI:

Job Name: Velvet
resources_used.memused=1491094532kb
resources_used.walltime=212:08:33

Used all the Illumina data (trimmed of adapters, but not yet quality trimmed), kmer set at 57, and insert lengths I attained by the mapping process Brian suggested.

Assembly contigs
# contigs (>= 0 bp) 675115
# contigs (>= 1000 bp) 334078
Total length (>= 0 bp) 1573635522
Total length (>= 1000 bp) 1378785964
# contigs 523061
Largest contig 334422
Total length 1514624042
GC (%) 38.38
N50 6382
N75 2114
L50 59468
L75 166760
# N's per 100 kbp 24159.38

A couple of other metrics from Velvet:
Median coverage depth = 15.875000
Using 517419173/737118134 reads (just 2/3 of the reads utilized???)

I wonder if I could improve things a bit by setting a value for -cov_cutoff and -exp_cov rather than using "auto"?
./velvetg Snakev2-k57 -cov_cutoff auto -exp_cov auto -min_contig_lgth 300 -ins_length 479 -ins_length2 150 -ins_length3
150 -ins_length4 150 -ins_length5 6000 -shortMatePaired5 yes

Also ran KMERGenie against all the sequencer data (except the 454 stuff) and it came up with a kmer estimate of 65. I think my next run with Velvet may be to select 65 for my kmer estimate and use the data with both adapters removed AND quality trimmed.

Thought I might also re-run FastQC to double check for highly repetitive data as well. haven't checked since I used BBduk and BBtrim. I might have missed some stuff as I only specified one adapter to remove...

Attached Files

KMERGenie.png (20.3 KB, 40 views)

**Brian Bushnell** · 05-23-2014, 10:24 AM

Originally posted by jpummil View Post

A couple of other metrics from Velvet:
Median coverage depth = 15.875000
Using 517419173/737118134 reads (just 2/3 of the reads utilized???)

I wonder if I could improve things a bit by setting a value for -cov_cutoff and -exp_cov rather than using "auto"?

I can only suggest that you try assembling various different ways and see how it affects the assembly, in things like N50 and mapping rates and mapping error rates based on mapping the source reads back to the assembly.

Also ran KMERGenie against all the sequencer data (except the 454 stuff) and it came up with a kmer estimate of 65. I think my next run with Velvet may be to select 65 for my kmer estimate and use the data with both adapters removed AND quality trimmed.

As long as you have sufficient coverage, and sufficiently regular coverage, increasing the kmer length should ALWAYS give you a better assembly until you approach read length. The optimal kmer length can be predicted by by various tools, but again, the only way to know for sure is to actually assemble with different lengths.

I do suggest quality-trimming, though, at minimum to Q6. Going up from there may improve things, but again, that's just another variable to test

It's usually best to only vary one parameter at a time when optimizing assemblies, though, keeping the others fixed.

**jpummil** · 06-02-2014, 10:08 AM

So, my LMP data is still "dirty" and while I am re-working it, i wanted to ask a few additional questions. First, "what's in a name"? In other words, do the names of files coming from your sequencer provider lend to any hints about the actual process? For example, my Illumina LMP data was named as follows:

Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq
Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq

Next, how does one determine whether or not the files need to be reverse complimented? Is it possible to just reference something in the files that acts as a flag to alert you as to whether or not this additional step needs to be taken? I had been doing it "just because" someone had told me it probably needed it...rather than asking for the details. Now I am curious...

So, as I wanted to re-trace the whole process "to be sure", I went back and ran FastQC on both of the files again. Interestingly, they have a different set of " overrepresented sequences" from file to file:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC 12525355 14.98623291791357 TruSeq Adapter, Index 11 (100% over 50bp)
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATG 1466041 1.7540765825168967 TruSeq Adapter, Index 11 (100% over 49bp)

and

GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG 10203340 12.20800765971617 Illumina Single End PCR Primer 1 (100% over 50bp)
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCC 1435340 1.717343704541553 Illumina Single End PCR Primer 1 (100% over 50bp)
GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG 437070 0.5229418903841435 Illumina Single End PCR Primer 1 (98% over 50bp)
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGGCGCCG 111902 0.13388757731660014 Illumina Single End PCR Primer 1 (98% over 50bp)
GATCGGAAGAGCGTCGTGTAGGGGAAGAGTGTAGATCTCGGTGGTCGCCG 96938 0.11598357464492669 Illumina Single End PCR Primer 1 (98% over 50bp)

These are identified as adapters on the first file, and single end PCR primers on the second file. Can I just add all of these to the trim list in bbtrim and do away with them to start with?

**Brian Bushnell** · 06-02-2014, 10:31 AM

Originally posted by jpummil View Post

So, my LMP data is still "dirty" and while I am re-working it, i wanted to ask a few additional questions. First, "what's in a name"? In other words, do the names of files coming from your sequencer provider lend to any hints about the actual process? For example, my Illumina LMP data was named as follows:

Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq
Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq

No, someone named them that. GGCTAC is probably a bar code and L3 probably means they came from lane 3; for the others, you could contact whoever gave you the data, as they probably have some conventions. For LMP it is usually important to know the protocol, as different protocols have different adapter types and orientations.

Next, how does one determine whether or not the files need to be reverse complimented? Is it possible to just reference something in the files that acts as a flag to alert you as to whether or not this additional step needs to be taken? I had been doing it "just because" someone had told me it probably needed it...rather than asking for the details. Now I am curious...

You should map the reads to a draft assembly and see what their orientations are; different methods give different orientations - not simply reverse-complemented, but things like same direction, innie, outie, etc. Also, assemblers will often let you specify an orientation rather than requiring you to change the reads yourself.

So, as I wanted to re-trace the whole process "to be sure", I went back and ran FastQC on both of the files again. Interestingly, they have a different set of " overrepresented sequences" from file to file:

GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC 12525355 14.98623291791357 TruSeq Adapter, Index 11 (100% over 50bp)
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATG 1466041 1.7540765825168967 TruSeq Adapter, Index 11 (100% over 49bp)

and

GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG 10203340 12.20800765971617 Illumina Single End PCR Primer 1 (100% over 50bp)
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCC 1435340 1.717343704541553 Illumina Single End PCR Primer 1 (100% over 50bp)
GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG 437070 0.5229418903841435 Illumina Single End PCR Primer 1 (98% over 50bp)
GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGGCGCCG 111902 0.13388757731660014 Illumina Single End PCR Primer 1 (98% over 50bp)
GATCGGAAGAGCGTCGTGTAGGGGAAGAGTGTAGATCTCGGTGGTCGCCG 96938 0.11598357464492669 Illumina Single End PCR Primer 1 (98% over 50bp)

These are identified as adapters on the first file, and single end PCR primers on the second file. Can I just add all of these to the trim list in bbtrim and do away with them to start with?

Probably... depending on the orientation. As long as the bases to the left of these sequences are 'good' and the ones to the right are 'bad' (which is the case for some common LMP protocols), then yes, just don't reverse-complement read 2 beforehand.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News