Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • [help] de novo Genome Assembly : beginner

    Hi there.

    I am completely new in the world of (de novo) genome assembly and I don't know what to begin with. When I asked help at the department they said "go to seqanswers", so here I am to have some help...

    I have been given some sequencing data about an insect (colza pollen beetle) and have to make a genome assembly. This is Illumina data in paired-end format.

    There are 3 fastq files :
    - lane 5/1 : 11 423 167 reads of length 76
    - lane 5/2 : 11 423 167 reads of length 76
    - lane 7 : 9 294 857 reads of length 152

    An average beetle genome size is said to be about 650Mbp.

    Apparently "we" have a server with 192GB RAM where SOAPdenovo is/will be installed.

    I have been told to first control the sequences quality so after a few surfing I found "FASTQC" (with a good Youtube tutorial). I don't know what I have to do after... at all.

    I am not here to ask you to do the job in my place & I know a will have a lot of reading & research, but i would know what is the main guide-line to follow, what are the things to mind about, the traps to prevent, etc.

    Thank you in advance for any kind of help,

    M.

    (PS: accordingly to the FASTQC tutorial, data quality are quite poor, i can post output on demand)

  • #2
    Hey Meli,

    The first thing would be to trim the primers/adapter/barcodes. I do this by mapping the know sequences (primers/adapter/barcodes) to the reads and then trimming them with a perl script or something like that.

    Next would be to get the closest possible reference sequence (if know and/or available) and map your paired reads to them to filter out the good, the bad, and the ugly. If the reference is not known or close enough then it may be worthwhile to skip this step.

    After that I generally filter my reads based on quality score. Trimming the actual reads to a shorter size has also produced very good results, so if you're not getting the assemblies you want with the full reads, I STRONGLY recommend to try it out.

    Soap is a good program but there are many others and, as is usually the case, you really have to pick an assembler that fits your data. I will recommend Velvet and ABySS for starters. There are also a lot of good papers about how assemblers perform. Here are some of my favorites:

    New sequencing technology has dramatically altered the landscape of whole-genome sequencing, allowing scientists to initiate numerous projects to decode the genomes of previously unsequenced organisms. The lowest-cost technology can generate deep coverage of most species, including mammals, in just …

    The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.

    A new generation of sequencing technologies is revolutionizing molecular biology. Illumina's Solexa and Applied Biosystems' SOLiD generate gigabases of nucleotide sequence per week. However, a perceived limitation of these ultra-high-throughput technologies is their short read-lengths. De novo assem …


    Also, it would be beneficial to install AMOScmp so that you can use its tools to help analyze your assemblies. This technique has a learning curve but it's so fun! Be patient and don't be scared to ask question... there's a lot of data out there.

    I hope this helps and good luck!

    Comment


    • #3
      thanks you very much, I will take a look at all this tomorrow because for the very moment i am a bit upset about all that :/

      Comment


      • #4
        Hi,
        1). you can use the Sickle tool(https://github.com/najoshi/sickle) for data preprocessing, and then view the data statistics with FastQC or use FastX tools(http://hannonlab.cshl.edu/fastx_toolkit/).
        2). You can try velvet assembler(http://www.ebi.ac.uk/~zerbino/velvet/) from k-mer 21 to 65 with increment of 2. and Expected coverage you can use Auto or can calculate using R as explained in the manual and coverage cutoff from 2 to 15. Or try other Assemblers like Soapdenovo or Abyss.
        3). Choose the assembly with best N50 and other parameters(Genome size, Largest Contigs, Reads used, Number of contigs).
        4). Use Minimus2 or Minimus2_blat(http://sourceforge.net/apps/mediawik...etting_Started) for merging assemblies And Bambus2/SSPACE(http://www.baseclear.com/landingpages/sspacev12/) for scaffolding. SSPACE is very easy to use with very simple input options.
        5). Check the Completeness of the genome using CEGMA pipeline(http://korflab.ucdavis.edu/Datasets/cegma/).
        6). RepeatMasker(http://www.repeatmasker.org/) or other tools for repeat elements prediction and AUGUSTUS(http://augustus.gobics.de/) or other tools Genescan, GeneId for gene predictions.
        7). Finally MUMMER(http://mummer.sourceforge.net/) for comparative analysis.

        Best Wishes,
        Rahul
        Rahul Sharma,
        Ph.D
        Frankfurt am Main, Germany

        Comment


        • #5
          OK thanks for all this help !

          I asked to have primers and adapters sequences in order to cut them off (I though this was already done when i received fastq files but actually i have so high percentage of sequence duplication (92%!!) that i suppose there are still in the reads).

          I have been told to try to find a reference genome close enough to rely on it for assembly.
          I am currently on NCBI taxonomy browser but i still can't find anything close to any insect.

          The softwares indicated for this kind of assembly are
          - Velvet
          - Mira
          - SOAPdenovo
          - Bowtie (?)

          I am looking for installing them.

          Comment


          • #6
            why don't you have a look at wiki

            Comment


            • #7
              +1, thank you

              Comment


              • #8
                j'attend d'avoir les séquences des primers et adapters ainsi que les codes d'accès pour le serveur distant (un genre de supercalculateur : UPPMAX, UPNEXT)

                en attendant je suis un peu "coincé" quelles autres types d'informations (en dehors des analyses qualité fournies par Fastx Toolkit et FASTQC) puis-je obtenir de mes "simples" fichiers FASTQ ?

                Merci encore pour votre aide

                Comment


                • #9
                  Oops I just figured out I wrote in French, sorry, whatever it was not important.

                  I just cannot understand why all reads are EXACTLY the same length (76).
                  Reads come from lane 5, but I have file "lane-5-1" and "lane-5-2", why is this splitted in 2 ? Because of the paired-end ? I mean one is 5'-3' and the other 3'-5' ?
                  All reads from lane 5-1 and lane 5-2 are same length and numbers of reads are equals... ?

                  Comment


                  • #10
                    Originally posted by Meligethes View Post

                    I just cannot understand why all reads are EXACTLY the same length (76).
                    Reads come from lane 5, but I have file "lane-5-1" and "lane-5-2", why is this splitted in 2 ? Because of the paired-end ? I mean one is 5'-3' and the other 3'-5' ?
                    All reads from lane 5-1 and lane 5-2 are same length and numbers of reads are equals... ?
                    Illumina (and SOLiD) technology inherently generate reads of exactly the same length, unless you have trimmed them. The machine reads the data in cycles, and each cycle can acquire one and only one base.

                    If the two lanes are paired ends, then the identifiers should be the same or very similar (perhaps with /1 /2 or such as difference); look at the first read identifier in each file.

                    Comment


                    • #11
                      Ok thank you I got it, but how does the machine manage to know that "sequence xx" in this position is the same as "sequence xx" in this other position on other lane ??

                      I searched on the internet but it didn't help me about this...

                      Comment


                      • #12
                        Originally posted by Meligethes View Post
                        Ok thank you I got it, but how does the machine manage to know that "sequence xx" in this position is the same as "sequence xx" in this other position on other lane ??

                        I searched on the internet but it didn't help me about this...
                        Optically -- the system uses high-precision imagery & aligns images between the first read & the second read. Indeed, it takes a set of images for each cycle and must align these to call the bases for a single end.

                        Comment


                        • #13
                          Do you mean that the machine has 2 main cycles :
                          1 ) only forward cycles in each cluster position
                          2 ) only reverse cycles in each cluster position

                          Then "align" images and same points are from the same cluster so the same fragment ?

                          Sorry I feel bit an idiot about this but I really don't figure out how this works and "because this is paired-end technique or because this is high end optical lasers" is really not sufficent for me
                          Last edited by Meligethes; 03-19-2012, 01:48 PM.

                          Comment


                          • #14
                            Yes.

                            The system runs through all of read 1. Then there is a clever molecular biology scheme which flips things around and then read 2 is generated.

                            Bridged amplification & clustering followed by sequencing by synthesis. (Genome Analyzer / HiSeq / MiSeq)

                            Comment


                            • #15
                              Originally posted by Meligethes View Post
                              Hi there.

                              I am completely new in the world of (de novo) genome assembly and I don't know what to begin with. When I asked help at the department they said "go to seqanswers", so here I am to have some help...

                              I have been given some sequencing data about an insect (colza pollen beetle) and have to make a genome assembly. This is Illumina data in paired-end format.

                              There are 3 fastq files :
                              - lane 5/1 : 11 423 167 reads of length 76
                              - lane 5/2 : 11 423 167 reads of length 76
                              - lane 7 : 9 294 857 reads of length 152

                              An average beetle genome size is said to be about 650Mbp.

                              Apparently "we" have a server with 192GB RAM where SOAPdenovo is/will be installed.

                              I have been told to first control the sequences quality so after a few surfing I found "FASTQC" (with a good Youtube tutorial). I don't know what I have to do after... at all.

                              I am not here to ask you to do the job in my place & I know a will have a lot of reading & research, but i would know what is the main guide-line to follow, what are the things to mind about, the traps to prevent, etc.

                              Thank you in advance for any kind of help,

                              M.

                              (PS: accordingly to the FASTQC tutorial, data quality are quite poor, i can post output on demand)
                              Hello,


                              You may want to try Ray, a easy to use distributed assembler.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Essential Discoveries and Tools in Epitranscriptomics
                                by seqadmin


                                The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist on Modified Bases...
                                Yesterday, 07:01 AM
                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              39 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              41 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              35 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              55 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X