Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by jpummil View Post
    So, I do recall the group who sequenced the MP data mentioning an insertion size of 6600? You also mentioned a few posts back using the MP data for scaffolding...did you intend for me to omit it from the initial assemblies and perform a secondary scaffolding step with that data?

    Ahhh...yup, gzip allowed me to attach the files. If I can figure out why the PE file in question didn't merge well, I want to correct that and then try an assembly in Ray using the merged data (as single reads) and the unmerged files as paired ends as you indicated.
    That PE library seems to peak at 479bp. So if the reads are less than 250bp, they generally won't overlap.

    Your MP library seems to peak at 91 bp (having a short-insert peak is not atypical for LMP libraries), but it also seems to have another peak at around 6000bp, which is good. You could map all of the LMP data to the contigs for scaffolding. It may or may not be useful to first remove all reads that map paired to the same contig.

    Whether or not you include the LMP data in with the normal data for contig generation is up to you; I don't know exactly how Ray uses different libraries information. AllPathsLG, for example, requires you to give it both short and long insert data and uses the two datasets in different ways. There are also various standalone scaffolding tools for use with LMP data and assembled contigs, but I don't know which ones are good.
    Last edited by Brian Bushnell; 04-26-2014, 08:12 AM.

    Comment


    • #32
      Miss me?

      Been running different variations of the assembly to see if any trends toward improvement would be revealed. As of yet, little seems to make a difference. I even tried adding in the 20GB of .sff data to the last run in Ray...also to no real improvement.

      Merging data seemed to not improve things either. Though a couple of sets of the Illumina data didn't merge, I did have three sets that did, so I ran them as such (with the remains, still paired, run as well).

      So now, I am branching off to see if different assemblers will yield any drastic results. I think first up will be SOAPdenovo2 since it has well been proven in large assemblies such as this. Just working on the .config file now. I expect I'll try Velvet soon as well, but it will require a move to a big SGI I have time on since the memory requirements the last time I ran this (with about 50GB less data) exceeded 1TB.

      Comment


      • #33
        Originally posted by jpummil View Post
        Miss me?

        Been running different variations of the assembly to see if any trends toward improvement would be revealed. As of yet, little seems to make a difference. I even tried adding in the 20GB of .sff data to the last run in Ray...also to no real improvement.

        Merging data seemed to not improve things either. Though a couple of sets of the Illumina data didn't merge, I did have three sets that did, so I ran them as such (with the remains, still paired, run as well).

        So now, I am branching off to see if different assemblers will yield any drastic results. I think first up will be SOAPdenovo2 since it has well been proven in large assemblies such as this. Just working on the .config file now. I expect I'll try Velvet soon as well, but it will require a move to a big SGI I have time on since the memory requirements the last time I ran this (with about 50GB less data) exceeded 1TB.
        In our most recent direct comparison of Soap to Ray on the same data (a metagenome), Soap gave a somewhat more fragmented assembly, but with a much lower error rate. Soap gave terrible results when using merged reads, in that test, while Ray was run with merged reads.

        Comment


        • #34
          Sorry all, got side tracked. Back on things again now :-D

          I have a velveth run completed on the data, and am awaiting a re-run on the velvetg run (ran out of ram...specified 1.5TB and it wouldn't quite fit)

          In the mean time, I need to finish up my SOAP config file and run that as well.

          I would also like to use bbtrim to do some quality trimming. I have already used bbduk to remove adapters, but have not yet run a quality trim. Why am I running a big assembly prior to doing this? To see what I end up with both ways!

          So, for using bbtrim on PE data, what additional flags are needed to ensure the data stays "paired"? Is defining in= and in2= along with out= and out2= enough? I plan to leave the trimq at 10.

          Comment


          • #35
            Originally posted by jpummil View Post
            So, for using bbtrim on PE data, what additional flags are needed to ensure the data stays "paired"? Is defining in= and in2= along with out= and out2= enough? I plan to leave the trimq at 10.
            Yes, that's sufficient. As long as you define both in and in2, all of my tools will keep pairs together.

            Comment


            • #36
              Thanks Brian!

              As I am just doing the quality trimming, i won't be providing a reference, correct? I assume this was for the adapter trimming portion of the trim.

              Also, will I need to change any settings to go ahead and do the Mate Paired reads I am using? They have already been trimmed of adapters and reverse comped.

              Essentially, I will be running this: bbtrim.sh in1=$DATA/New_PE_R1.fastq in2=$DATA/New_PE_R2.fastq out=$DATA2/New_PE_R1.fastq out2=$DATA2/New_PE_R2.fastq trimq=10 qtrim=rl

              Comment


              • #37
                That's right; a reference is not needed for quality trimming. And that command line looks fine. You could, optionally, include a flag like 'minlength=50' to throw away reads that end up shorter than 50bp after trimming, as they won't be very useful.

                Comment


                • #38
                  Wow! It's certainly FAST! Trimmed all 209GB in ubder 7 minutes using 12 compute cores ;-)

                  Before:

                  -rw-rw-r-- 1 jpummil jpummil 16G Apr 21 11:46 MiSeq_PE_R1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 16G Apr 21 11:46 MiSeq_PE_R2.fastq
                  -rw-rw-r-- 1 jpummil jpummil 19G Apr 18 20:14 New_PE_R1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 19G Apr 18 20:14 New_PE_R2.fastq
                  -rw-rw-r-- 1 jpummil jpummil 21G Apr 21 15:55 Old_MP_RC01.fastq
                  -rw-rw-r-- 1 jpummil jpummil 21G Apr 21 15:54 Old_MP_RC02.fastq
                  -rw-rw-r-- 1 jpummil jpummil 23G Apr 21 11:52 Old_PE_s_1_1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 23G Apr 21 11:52 Old_PE_s_1_2.fastq
                  -rw-rw-r-- 1 jpummil jpummil 20G Apr 21 11:54 Old_PE_s_2_1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 20G Apr 21 11:54 Old_PE_s_2_2.fastq
                  After:
                  -rw-rw-r-- 1 jpummil jpummil 16G May 14 12:36 MiSeq_PE_R1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 16G May 14 12:36 MiSeq_PE_R2.fastq
                  -rw-rw-r-- 1 jpummil jpummil 18G May 14 12:35 New_PE_R1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 18G May 14 12:35 New_PE_R2.fastq
                  -rw-rw-r-- 1 jpummil jpummil 19G May 14 12:41 Old_MP_RC01.fastq
                  -rw-rw-r-- 1 jpummil jpummil 18G May 14 12:41 Old_MP_RC02.fastq
                  -rw-rw-r-- 1 jpummil jpummil 19G May 14 12:38 Old_PE_s_1_1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 21G May 14 12:38 Old_PE_s_1_2.fastq
                  -rw-rw-r-- 1 jpummil jpummil 17G May 14 12:39 Old_PE_s_2_1.fastq
                  -rw-rw-r-- 1 jpummil jpummil 18G May 14 12:39 Old_PE_s_2_2.fastq

                  And, despite differences in the file sizes of the paired reads, line numbers are still identical as you stated :-D

                  Comment


                  • #39
                    I have fun making programs fast =)

                    Comment


                    • #40
                      Still some oddities when I use Ray for assembly (Velvet run is still cooking on a big mem machine...). So, by default, doesn't Ray scaffold things together before it completes? Just finished a run with the quality trimmed data and as I delve into it, I can see that we get both a Contigs.fasta and a Scaffolds.fasta of the same size:

                      [jpummil@razor-l1 RayOut57-TRIM-ALL-454]$ ls -lh
                      total 2.3G
                      drwxrwx--- 7 jpummil jpummil 32K May 16 14:58 BiologicalAbundances
                      -rw-rw-r-- 1 jpummil jpummil 13M May 16 14:07 ContigLengths.txt
                      -rw-rw-r-- 1 jpummil jpummil 1.2G May 16 12:59 Contigs.fasta
                      -rw-rw-r-- 1 jpummil jpummil 371 May 14 21:11 CoverageDistributionAnalysis.txt
                      -rw-rw-r-- 1 jpummil jpummil 275K May 14 21:11 CoverageDistribution.txt
                      -rw-rw-r-- 1 jpummil jpummil 696 May 15 04:21 degreeDistribution.txt
                      -rw-rw-r-- 1 jpummil jpummil 2.2K May 16 14:58 ElapsedTime.txt
                      -rw-rw-r-- 1 jpummil jpummil 884 May 14 16:05 FilePartition.txt
                      -rw-rw-r-- 1 jpummil jpummil 2.6K May 15 04:20 GraphPartition.txt
                      -rw-rw-r-- 1 jpummil jpummil 28K May 15 17:35 LibraryData.xml
                      -rw-rw-r-- 1 jpummil jpummil 2.0K May 15 17:35 LibraryStatistics.txt
                      -rw-rw-r-- 1 jpummil jpummil 151 May 16 14:58 NeighbourhoodRelations.txt
                      -rw-rw-r-- 1 jpummil jpummil 1.9K May 14 15:56 NetworkTest.txt
                      -rw-rw-r-- 1 jpummil jpummil 3.0K May 14 16:05 NumberOfSequences.txt
                      -rw-rw-r-- 1 jpummil jpummil 468 May 16 14:58 OutputNumbers.txt
                      -rw-rw-r-- 1 jpummil jpummil 15M May 16 12:28 ParallelPaths.txt
                      drwxrwxr-x 3 jpummil jpummil 32K May 16 13:36 quast_results
                      -rw-rw-r-- 1 jpummil jpummil 1.5K May 14 15:56 RayCommand.txt
                      -rw-rw-r-- 1 jpummil jpummil 18 May 14 15:56 RayPlatform_Version.txt
                      -rw-rw-r-- 1 jpummil jpummil 1.5K May 14 15:56 RaySmartCommand.txt
                      -rw-rw-r-- 1 jpummil jpummil 10 May 14 15:56 RayVersion.txt
                      -rw-rw-r-- 1 jpummil jpummil 25M May 16 14:07 ScaffoldComponents.txt
                      -rw-rw-r-- 1 jpummil jpummil 11M May 16 14:07 ScaffoldLengths.txt
                      -rw-rw-r-- 1 jpummil jpummil 4.2M May 16 14:07 ScaffoldLinks.txt
                      -rw-rw-r-- 1 jpummil jpummil 1.2G May 16 14:44 Scaffolds.fasta
                      -rw-rw-r-- 1 jpummil jpummil 27K May 15 17:25 SeedLengthDistribution.txt
                      -rw-rw-r-- 1 jpummil jpummil 2.1K May 14 16:05 SequencePartition.txt

                      Running Quast on both, the Scaffolds.fasta seems to have slightly better N50, a much larger "longest contig", and substantially fewer contigs:

                      Contigs.fasta
                      Assembly Contigs
                      # contigs (>= 0 bp) 598776
                      # contigs (>= 1000 bp) 348277
                      Total length (>= 0 bp) 1143773762
                      Total length (>= 1000 bp) 999982483
                      # contigs 490368
                      Largest contig 38923
                      Total length 1105527612
                      GC (%) 38.05
                      N50 3183
                      N75 1701
                      L50 98949
                      L75 218345
                      # N's per 100 kbp 0.00

                      Scaffolds.fasta
                      Assembly Scaffolds
                      # contigs (>= 0 bp) 536335
                      # contigs (>= 1000 bp) 297797
                      Total length (>= 0 bp) 1165606230
                      Total length (>= 1000 bp) 1031474385
                      # contigs 429232
                      Largest contig 64271
                      Total length 1127878012
                      GC (%) 38.05
                      N50 4279
                      N75 2048
                      L50 73691
                      L75 168757
                      # N's per 100 kbp 1936.87

                      Obviously I am confused over some details again...what am i neglecting?

                      Comment


                      • #41
                        Scaffolding will find read pairs (typically from the LMP library) that map to different contigs, and merge those contigs together. The overall number of bases should change very little, aside from the addition of a bunch of Ns that fill gaps between contigs. So, after scaffolding you would expect fewer contigs (since some got merged) that are, on average, longer, and with more Ns. So, your results look like what I would expect, although often scaffolding can have a much greater impact than it did in this case.

                        Comment


                        • #42
                          Finally got a run completed with Velvet on a big SGI:

                          Job Name: Velvet
                          resources_used.memused=1491094532kb
                          resources_used.walltime=212:08:33

                          Used all the Illumina data (trimmed of adapters, but not yet quality trimmed), kmer set at 57, and insert lengths I attained by the mapping process Brian suggested.

                          Assembly contigs
                          # contigs (>= 0 bp) 675115
                          # contigs (>= 1000 bp) 334078
                          Total length (>= 0 bp) 1573635522
                          Total length (>= 1000 bp) 1378785964
                          # contigs 523061
                          Largest contig 334422
                          Total length 1514624042
                          GC (%) 38.38
                          N50 6382
                          N75 2114
                          L50 59468
                          L75 166760
                          # N's per 100 kbp 24159.38

                          A couple of other metrics from Velvet:
                          Median coverage depth = 15.875000
                          Using 517419173/737118134 reads (just 2/3 of the reads utilized???)


                          I wonder if I could improve things a bit by setting a value for -cov_cutoff and -exp_cov rather than using "auto"?
                          ./velvetg Snakev2-k57 -cov_cutoff auto -exp_cov auto -min_contig_lgth 300 -ins_length 479 -ins_length2 150 -ins_length3
                          150 -ins_length4 150 -ins_length5 6000 -shortMatePaired5 yes

                          Also ran KMERGenie against all the sequencer data (except the 454 stuff) and it came up with a kmer estimate of 65. I think my next run with Velvet may be to select 65 for my kmer estimate and use the data with both adapters removed AND quality trimmed.

                          Thought I might also re-run FastQC to double check for highly repetitive data as well. haven't checked since I used BBduk and BBtrim. I might have missed some stuff as I only specified one adapter to remove...
                          Attached Files
                          Last edited by jpummil; 05-23-2014, 07:26 AM.

                          Comment


                          • #43
                            Originally posted by jpummil View Post
                            A couple of other metrics from Velvet:
                            Median coverage depth = 15.875000
                            Using 517419173/737118134 reads (just 2/3 of the reads utilized???)

                            I wonder if I could improve things a bit by setting a value for -cov_cutoff and -exp_cov rather than using "auto"?
                            I can only suggest that you try assembling various different ways and see how it affects the assembly, in things like N50 and mapping rates and mapping error rates based on mapping the source reads back to the assembly.

                            Also ran KMERGenie against all the sequencer data (except the 454 stuff) and it came up with a kmer estimate of 65. I think my next run with Velvet may be to select 65 for my kmer estimate and use the data with both adapters removed AND quality trimmed.
                            As long as you have sufficient coverage, and sufficiently regular coverage, increasing the kmer length should ALWAYS give you a better assembly until you approach read length. The optimal kmer length can be predicted by by various tools, but again, the only way to know for sure is to actually assemble with different lengths.

                            I do suggest quality-trimming, though, at minimum to Q6. Going up from there may improve things, but again, that's just another variable to test

                            It's usually best to only vary one parameter at a time when optimizing assemblies, though, keeping the others fixed.

                            Comment


                            • #44
                              So, my LMP data is still "dirty" and while I am re-working it, i wanted to ask a few additional questions. First, "what's in a name"? In other words, do the names of files coming from your sequencer provider lend to any hints about the actual process? For example, my Illumina LMP data was named as follows:

                              Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq
                              Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq

                              Next, how does one determine whether or not the files need to be reverse complimented? Is it possible to just reference something in the files that acts as a flag to alert you as to whether or not this additional step needs to be taken? I had been doing it "just because" someone had told me it probably needed it...rather than asking for the details. Now I am curious...

                              So, as I wanted to re-trace the whole process "to be sure", I went back and ran FastQC on both of the files again. Interestingly, they have a different set of " overrepresented sequences" from file to file:

                              GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC 12525355 14.98623291791357 TruSeq Adapter, Index 11 (100% over 50bp)
                              AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATG 1466041 1.7540765825168967 TruSeq Adapter, Index 11 (100% over 49bp)

                              and

                              GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG 10203340 12.20800765971617 Illumina Single End PCR Primer 1 (100% over 50bp)
                              AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCC 1435340 1.717343704541553 Illumina Single End PCR Primer 1 (100% over 50bp)
                              GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG 437070 0.5229418903841435 Illumina Single End PCR Primer 1 (98% over 50bp)
                              GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGGCGCCG 111902 0.13388757731660014 Illumina Single End PCR Primer 1 (98% over 50bp)
                              GATCGGAAGAGCGTCGTGTAGGGGAAGAGTGTAGATCTCGGTGGTCGCCG 96938 0.11598357464492669 Illumina Single End PCR Primer 1 (98% over 50bp)

                              These are identified as adapters on the first file, and single end PCR primers on the second file. Can I just add all of these to the trim list in bbtrim and do away with them to start with?

                              Comment


                              • #45
                                Originally posted by jpummil View Post
                                So, my LMP data is still "dirty" and while I am re-working it, i wanted to ask a few additional questions. First, "what's in a name"? In other words, do the names of files coming from your sequencer provider lend to any hints about the actual process? For example, my Illumina LMP data was named as follows:

                                Rattle_Snake__RS_212M__L3_GGCTAC_L003_R1_001.fastq
                                Rattle_Snake__RS_212M__L3_GGCTAC_L003_R2_001.fastq
                                No, someone named them that. GGCTAC is probably a bar code and L3 probably means they came from lane 3; for the others, you could contact whoever gave you the data, as they probably have some conventions. For LMP it is usually important to know the protocol, as different protocols have different adapter types and orientations.

                                Next, how does one determine whether or not the files need to be reverse complimented? Is it possible to just reference something in the files that acts as a flag to alert you as to whether or not this additional step needs to be taken? I had been doing it "just because" someone had told me it probably needed it...rather than asking for the details. Now I am curious...
                                You should map the reads to a draft assembly and see what their orientations are; different methods give different orientations - not simply reverse-complemented, but things like same direction, innie, outie, etc. Also, assemblers will often let you specify an orientation rather than requiring you to change the reads yourself.

                                So, as I wanted to re-trace the whole process "to be sure", I went back and ran FastQC on both of the files again. Interestingly, they have a different set of " overrepresented sequences" from file to file:

                                GATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATGC 12525355 14.98623291791357 TruSeq Adapter, Index 11 (100% over 50bp)
                                AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGGCTACATCTCGTATG 1466041 1.7540765825168967 TruSeq Adapter, Index 11 (100% over 49bp)

                                and

                                GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCG 10203340 12.20800765971617 Illumina Single End PCR Primer 1 (100% over 50bp)
                                AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCC 1435340 1.717343704541553 Illumina Single End PCR Primer 1 (100% over 50bp)
                                GATCGGAAGAGCGTCGTGTAGGGAAAGAGGGTAGATCTCGGTGGTCGCCG 437070 0.5229418903841435 Illumina Single End PCR Primer 1 (98% over 50bp)
                                GATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGGCGCCG 111902 0.13388757731660014 Illumina Single End PCR Primer 1 (98% over 50bp)
                                GATCGGAAGAGCGTCGTGTAGGGGAAGAGTGTAGATCTCGGTGGTCGCCG 96938 0.11598357464492669 Illumina Single End PCR Primer 1 (98% over 50bp)

                                These are identified as adapters on the first file, and single end PCR primers on the second file. Can I just add all of these to the trim list in bbtrim and do away with them to start with?
                                Probably... depending on the orientation. As long as the bases to the left of these sequences are 'good' and the ones to the right are 'bad' (which is the case for some common LMP protocols), then yes, just don't reverse-complement read 2 beforehand.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  Essential Discoveries and Tools in Epitranscriptomics
                                  by seqadmin




                                  The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                                  04-22-2024, 07:01 AM
                                • seqadmin
                                  Current Approaches to Protein Sequencing
                                  by seqadmin


                                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                  04-04-2024, 04:25 PM

                                ad_right_rmr

                                Collapse

                                News

                                Collapse

                                Topics Statistics Last Post
                                Started by seqadmin, 04-25-2024, 11:49 AM
                                0 responses
                                19 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-24-2024, 08:47 AM
                                0 responses
                                20 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-11-2024, 12:08 PM
                                0 responses
                                62 views
                                0 likes
                                Last Post seqadmin  
                                Started by seqadmin, 04-10-2024, 10:19 PM
                                0 responses
                                60 views
                                0 likes
                                Last Post seqadmin  
                                Working...
                                X