Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Celera Assembler (WGS) - splice site file?

    Hi,

    I want to use the Celera Assembler (WGS) in my assembly pipeline in order to compare the results to Phred / Phrap. I read that to vector / quality trim my reads, I should use Lucy, but on this point I am confused.

    What is the "sequence of the vector splice site"?


    I am reading this: http://www.cbcb.umd.edu/research/CeleraAssembler.shtml

    "Each vector file [one per vector] must be accompanied by a splice site file containing the sequence within the vector that is adjacent to the splice sites used in the project. In case your project uses an adapter it should be included in the splice file. ... The vector file must contain a single FASTA-formatted sequence representing the entire sequencing vector. The splice file contains 4 FASTA records corresponding to approximately 200 bp flanking either side of the splice site, presented in both the forward and reverse-complemented orientation."


    Unfortunately I don't understand what this means, specifically, what is the splice site file and how do I identify the splice sites? Typically will this refer to the sequencing vector or the cloning vector (BAC)?

    The project uses the pSMART-HCKan (AF532107) sequencing vector from the Lucigen CLONESMART Blunt Cloning Kit ... does that mean anything to anyone?

    Should I just use the 200 bp either side of the primer sites?


    Sorry for the potentially very dumb question!

    Dan.
    Homepage: Dan Bolser
    MetaBase the database of biological databases.

  • #2
    Since I at least have something working for this question, I thought I'd update the thread. No clear answers exactly, but I got something that seemed to work (hopefully useful for someone) ...

    Some of what I eventually worked out on this topic is described here:





    And here is some info from an email exchange with Sven Klages (user 'sven').

    > What is the "sequence of the vector splice site"?

    The flanking bases of the cloning site, e.g. pUC19/SmaI:
    Figure
    ======



    ----f2------------------------->
    ----f1------------------------->
    |========================= GGG/CCC =========================|
    <-------------------------r1----
    <-------------------------r2----


    f1 = for.begin
    f2 = for.end
    r1 = rev.begin
    r2 = rev.end

    OVERLAPS f1/f2 and/or r1/r2 ~ 50bp

    So your splice site file could look like this (sequences
    shortened, [...]):

    >pUC19.for.begin
    attcgccattcaggctgcgcaactgttgggaagggcgatcggtgcgggcctcttcgctat
    [...]
    >pUC19.for.end
    tttcccagtcacgacgttgtaaaacgacggccagtgaattcgagctcggtaCCCGGGgat
    [...]
    >pUC19.rev.begin
    gggcagtgagcgcaacgcaattaatgtgagttagctcactcattaggcaccccaggcttt
    [...]
    >pUC19.rev.end
    aggaaacagctatgaccatgattacgccaagcttgcatgcctgcaggtcgactctagagg
    [...]

    "man lucy" will tell you more (after compiling).



    But I still didn't understand! Sven continued...

    roughly, you take the 5' flanking sequence,
    CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT

    and the 3' flanking sequence,
    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG

    and join it to form

    >pSMART-HCAmp.for.begin
    CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT
    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG
    >pSMART-HCAmp.for.end
    CAGTCCAGTTACGCTGGAGTCTGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT
    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAG

    Which is pretty much the the same for 'begin' and 'end' ..
    This is not what is proposed, but it should work.

    You should "reverse complement" if you need reverse clipping
    as well.

    >pSMART-HCAmp.rev.begin
    [sequence]
    >pSMART-HCAmp.rev.end
    [sequence]

    lucy is pretty "tolerant" ...

    Just use 'lucy' with the flag '-debug FILENAME' to see if clipping
    was successful.


    If you're expecting any adaptors they should be included in
    the sequence as they are read by sequencing,

    Vector-Adaptor-(INSERT)-Adaptor-Vector



    So I said...

    Thanks Sven, its all clear now. Just to make sure I understand though,
    the GenBank sequence for this pSMART vector (pSMART-HCKan, AF532107.1)
    just 'happens' to start with:

    GACGAATTCTCTAGATATCGCTCAATACTGACCATTTAAATCATACCTGACCTCCATAGCAGAAAGTCAA


    and just 'happens' to end with:

    TGAGGCTCGTCCTGAATGATATCAAGCTTGAATTCGTT


    but actually, I need some detailed knowledge of where on the vector
    sequence the sequence 'insert site' (or splice site) is before I can
    create what you did above?



    And Sven said...

    Yes, you should know about the insert location.
    But that's easy, isn't it?

    If you have the whole sequence you should design the splice file as
    mentioned.


    ----f2------------------------->
    ----f1------------------------->
    |========================= INSERT =========================|

    <-------------------------r1----
    <-------------------------r2----


    f1 = for.begin
    f2 = for.end
    r1 = rev.begin
    r2 = rev.end

    OVERLAPS f1/f2 and/or r1/r2 ~ 50bp, individual length of f1,f2,r1,r2 ~150bp.
    Homepage: Dan Bolser
    MetaBase the database of biological databases.

    Comment


    • #3
      keep in mind that you should use a non-proportional font (fixed) so that it makes sense.

      btw, it's not really clear to me what is unclear to you ... ;-)

      Sven
      Last edited by sklages; 09-28-2009, 01:52 AM. Reason: .. rethinking ..

      Comment


      • #4
        It's unclear to me how, given an arbitrary vector sequence, one generates the associated .splice file.

        Given the position of the splice site, I guess its straight forward.

        Could you demo some simple script for doing this?
        Homepage: Dan Bolser
        MetaBase the database of biological databases.

        Comment


        • #5
          Script in terms of "perl script"? I never do this automatically ..

          You need to know your 5' vector/adaptor sequences, re sites if applicable and the 3' vector/adaptor/whatever sequences ... and then create a multi fasta file as mentioned before.

          Code:
          [FONT=Courier New]                                  ----f2------------------------->
                                ----f1------------------------->
          |======================[]=====================|
                                            <-------------------------r1----
                               <-------------------------r2----[/FONT]
          I am afraid I am missing something?

          cheers,
          Sven

          Comment

          Latest Articles

          Collapse

          • seqadmin
            Recent Advances in Sequencing Technologies
            by seqadmin







            Innovations in next-generation sequencing technologies and techniques are driving more precise and comprehensive exploration of complex biological systems. Current advancements include improved accessibility for long-read sequencing and significant progress in single-cell and 3D genomics. This article explores some of the most impactful developments in the field over the past year.

            Long-Read Sequencing
            Long-read sequencing has...
            12-02-2024, 01:49 PM
          • seqadmin
            Genetic Variation in Immunogenetics and Antibody Diversity
            by seqadmin



            The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
            11-06-2024, 07:24 PM

          ad_right_rmr

          Collapse

          News

          Collapse

          Topics Statistics Last Post
          Started by seqadmin, 12-02-2024, 09:29 AM
          0 responses
          139 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-02-2024, 09:06 AM
          0 responses
          50 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 12-02-2024, 08:03 AM
          0 responses
          38 views
          0 likes
          Last Post seqadmin  
          Started by seqadmin, 11-22-2024, 07:36 AM
          0 responses
          70 views
          0 likes
          Last Post seqadmin  
          Working...
          X