Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • sff_extract with titanium linker

    A quick question...

    I have some sff files from a paired end 454 run using the titanium linker. When I extract the fastq data from the sff files using:
    sff_extract.py -Q -l linker.fasta *.sff

    Do I need to include the reverse compliment in the linker.fasta file like:
    >titanium_linker_seq
    TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
    >titanium_linker_seq_rc
    CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA


    I'm using sff_extract 0.2.8

    Cheers,
    Nathan

  • #2
    Apologies for replying to my own thread, but I thought it'd help with future replies and archiving purposes.

    According to section 5.5.4.3 Extracting paired-end data from SFF (pg 82-83) of the "Sequence assembly with MIRA3 - The Definative Guide":

    The paired-end protocol of 454 will generate reads which contain the forward and reverse direction in one read, separated by a
    linker. You have to know the linker sequence! Ask your sequencing provider to give it to you. If standard protocols were used,
    then the linker sequence for GS20 and FLX will be
    >flxlinker
    GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC

    while for Titanium data, you need to use two linker sequences
    >titlinker1
    TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG
    >titlinker2
    CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
    So I guess I do need the reverse complement of the titanium linker after all.

    Comment


    • #3
      Hello,

      I will be using MIRA too for performing 454 assembly.
      I would be glad if you could answer to my questions.

      I am confused about what an Insert actually is ? And what is an insert size ?
      I am dealing with 454 paired end data. How is a Linker different from an Adaptor?

      I will be using MIRA's sff extract to extract the fasta and qual files from the sff files.
      So does the sff files have read info this way:

      |-----75----|------------------------100-----------------|-----75-----|

      i.e - Seq.forward - Linker - Seq.reverse ??

      In the above, what is an insert ? and whats the size ?
      So do each of the reads in an sff have the above format ?

      So in the MIRA's mates.file setup to perform scaffolding, can I just use the following default format for the mate pairs ?

      pair (.*)\.f (.*)\.r

      Please help me out with my confusions:

      Thanks

      Aarthi

      Comment


      • #4
        Hi Aarthi,

        454 inserts - Please see the blue box on the 2nd page of the following 454 flyer for a diagrammatic explanation of what inserts are and how the sequences are generated and where the linker/adaptor might be positioned anywhere in the sequence (or in fact, nowhere):
        http://454.com/downloads/De-Novo-Com...omes-Flyer.pdf

        Standard 454 protocols can generate 3kb, 8kb, 20kb paired end libraries. Essentially you have the linker/adapter flanked by some of the DNA from your organism that was approx 3kb, 8kb or 20kb apart in your original organism.

        MIRA's sff_extract will take care of reorientating the sequences to a more standard format expected by most assembly software.

        The default format for 454 pairs are as you describe and MIRA will take care of this. If sff_extract didn't find the linker/adaptor sequence in a read, it will treat it as a single end (shotgun) read without a pair i.e. maximise the usage of raw data.

        The sff_extract (v0.2.8) command I use for creating a FASTQ file from the SFF file from a 20kb paired end library is as follows:
        sff_extract -Q -c -l 454_titanium_linker.fasta -i “insert_size:20000,insert_stdev:5000” sff_file.sff

        Hope this helps.
        Nathan

        Comment


        • #5
          Thank you so much

          A comapany called 'Seqwright' sequenced the data for us and provided with 3kb, 8kb and 20kb libraries. So now i understand that the insert size is 3kb, 8kb and 20kb respectively.

          But they had done an initial assembly with newbler for us and in the newbler metrics, in the paired read status section a 'pairDistanceAvg' is given. So is that the insert size ?

          They have given 2 sff files per library, since they say- 'reason you have two files per run is because it’s sequenced on the DNA chip with two regions'.

          e.g for the 3kb library for sff file 1 - the pairdistanceavg = 2247.3, pairDistdev=561.8 and for sff file 2 pairdistavg is 2254.9 and pairDistdev 563.7.
          Why are these not 3kb ? And to enter the stddev, do i sum up both of them or do I take the average ??

          And since they are 2 sff files per run, for 'sff_extract' can I give in the 2 sff files as input along with the script u mentioned above and will it output the fasta, qual and xml files into just one file ??

          Comment


          • #6
            And I would like to add about using shotgun sff files in the bambus scaffolding step.
            Please let me know if I got this right.

            When the sheared DNA fragments are circularized with an adaptor/linker, they are fragmented again. And some of these fragments will have the adaptor flanked by read pairs approx 150bp on each side, and there will be some other fragments with NO adaptor in between them obviously. So these frags with no adaptors are the shotgun sequences ?
            Which is why you provide the shotgun sequence sff files to bambus so that it will not miss out on that data ?
            Did I get it correct ?

            Comment


            • #7
              Also for the mates.file as I am required to provide the minimum insert size(which is mean of insert size-stddev) and maximum insert size(which is mean of insert size-stddev). I hope I got these right ?

              So for example to consider the mean insert size of the 3kb run, would it just be 3000 or would it be the average of the numbers 2247.3 and 2254.9 of the 2 sff files that I mentioned earlier ??
              Same is applied for the standard deviations. Do i again consider the averages ? (561.8+563.7/2) ??

              Did you setup a mates files yet for scaffolding ? If yes may I know how u set it up with respect to the naming convention?

              Comment


              • #8
                Originally posted by aarthi.talla View Post
                But they had done an initial assembly with newbler for us and in the newbler metrics, in the paired read status section a 'pairDistanceAvg' is given. So is that the insert size ?
                Once Newbler does the assembly and generates contigs, it calculates the average distance between reads in a pair to derive these statistics. NOTE: This is/can only be done for pairs that map to the same contig. Therefore it is an estimate of the actual distance separating read pairs in that library. It should be similar to the size of the library that was being prepared.

                Originally posted by aarthi.talla View Post
                for the 3kb library for sff file 1 - the pairdistanceavg = 2247.3, pairDistdev=561.8 and for sff file 2 pairdistavg is 2254.9 and pairDistdev 563.7.
                Why are these not 3kb ? And to enter the stddev, do i sum up both of them or do I take the average ??
                I don't have first hand experience with Newbler, but I'd think your pairdistanceavg and pairDistdev values are in the ball-park region for a 3kb library - maybe someone else will correct me?

                Originally posted by aarthi.talla View Post
                And since they are 2 sff files per run, for 'sff_extract' can I give in the 2 sff files as input along with the script u mentioned above and will it output the fasta, qual and xml files into just one file ??
                Have an experiment with sff_extract and the different command line arguments. Also do a sff_extract -h to view a list of options in your version of sff_extract. In the command I provided, the -Q options specifies that you want the sequence and qualities in a single FASTQ file - by default this info goes into 2 files: sequences into a FASTA file and the qualities into a QUAL file. I'd probably use your pairdistanceavg and pairDistdev values for each library as values for the insert_size and insert_stdev part of the -i option to sff_extract. This will add an estimate of your library's insert size and SD to the traceinfo XML file, which is used by some assemblers.

                Some links you, or readers of this post, might find useful:

                Comment


                • #9
                  Originally posted by aarthi.talla View Post
                  And I would like to add about using shotgun sff files in the bambus scaffolding step.
                  Please let me know if I got this right.

                  When the sheared DNA fragments are circularized with an adaptor/linker, they are fragmented again. And some of these fragments will have the adaptor flanked by read pairs approx 150bp on each side, and there will be some other fragments with NO adaptor in between them obviously. So these frags with no adaptors are the shotgun sequences ?
                  Which is why you provide the shotgun sequence sff files to bambus so that it will not miss out on that data ?
                  Did I get it correct ?
                  Almost, but some misunderstanding. There are different library preps for creating shotgun and paired end libraries. Have a look at 454 documentation on the creation of paired end libraries and you'll find that there is a step to enrich for those biotinylated DNA fragments containing linkers using Streptavidin beads. However you will still get DNA fragments that contain no linker. Reads not containing your linker means that it is not part of a pair. However, they can still be used as a single end read in an assembly. Some assembly software may handle these reads in the same FASTQ file as the pairs, but others may need you to pull them out into a separate FASTQ file - look at the documentation for your chosen assembler.

                  Some links you might find useful

                  Comment


                  • #10
                    Originally posted by aarthi.talla View Post
                    Also for the mates.file as I am required to provide the minimum insert size(which is mean of insert size-stddev) and maximum insert size(which is mean of insert size-stddev). I hope I got these right ?

                    So for example to consider the mean insert size of the 3kb run, would it just be 3000 or would it be the average of the numbers 2247.3 and 2254.9 of the 2 sff files that I mentioned earlier ??
                    Same is applied for the standard deviations. Do i again consider the averages ? (561.8+563.7/2) ??

                    Did you setup a mates files yet for scaffolding ? If yes may I know how u set it up with respect to the naming convention?
                    The size SD's are not actually calculated but are simply 0.25 * average insert size. Sorry, I can't be much help with this. However, you may find the following links useful:

                    Comment


                    • #11
                      Thankyou very much ! that was really helpful.

                      I am sorry to bother you with all the questions.
                      Can I ask you one last question.

                      For the scaffolding with bambus is it necessary that we provide the mates file ?
                      If yes, since we cannot read the sff file, do all the sff's contain the format that i mentioned ? (.*)\.f (.*)\.r (with an 'f' and and 'r' to it) ? And can I just blindly assume to give in this ??

                      may I know if you have provided the mates and the conf file ?

                      Thanks !!

                      Comment


                      • #12
                        Originally posted by aarthi.talla View Post
                        Thankyou very much ! that was really helpful.

                        I am sorry to bother you with all the questions.
                        Can I ask you one last question.
                        Yep, no worries!

                        Originally posted by aarthi.talla View Post
                        For the scaffolding with bambus is it necessary that we provide the mates file ?
                        I have no experience with BAMBUS so can't really comment. However, I can point you to the online manual.


                        Originally posted by aarthi.talla View Post
                        If yes, since we cannot read the sff file, do all the sff's contain the format that i mentioned ? (.*)\.f (.*)\.r (with an 'f' and and 'r' to it) ? And can I just blindly assume to give in this ??

                        may I know if you have provided the mates and the conf file ?

                        Thanks !!
                        I think there is some confusion about the SFF file. The Standard Flowgram Format (SFF) is a binary file containing the raw basecall information and quality values for reads. Generally speaking, you would have to extract the sequence data and associated quality values from the SFF file into a plain text format such as a FASTQ file or FASTA+QUAL files - these latter file formats are a more of a standard. In doing so, you'll also want to split the read into the pairs where you find the linker sequence. The tool sff_extract does this all for you and generates individual sequences for the paired ends, appending .f and .r to the end of the sequence name so that other software can easily identify which sequences are paired.

                        e.g. you would have a simplified workflow something like this:
                        Code:
                        file.sff ----> sff_extract ----> file.fastq ----> chosen_assembly_tool ----> assembly_output
                        However, as I said, I'm don't know about BAMBUS specifically.

                        Here's some more resources you may find useful:

                        Comment


                        • #13
                          Thank you

                          Do we have to perform the step of appending the shotgun files to the extracted paried end fasta files ? Or can we do the assembly of the extracted files by sff extract directly ?

                          Because when I appended the shotgun files to the extracted paired end fasta files and performed the assembly , it shows memory allocation problem !! The memory of my linux machine is 7GB.. isnt that enough ? how much memory does mira require to perform the asembly ?

                          may I know how u performed your assembly ? did u append the shotgun files or just did an assembly of the paired ends extracted fasta files ??

                          Thanks

                          Comment


                          • #14
                            have you used MIRA to perform assembly ? If NO, then which wud u suggest ? Since you have converted the sff's to fastq, i assume you used an illumina denovo software??

                            Comment


                            • #15
                              Originally posted by aarthi.talla View Post
                              Thank you

                              Do we have to perform the step of appending the shotgun files to the extracted paried end fasta files ? Or can we do the assembly of the extracted files by sff extract directly ?

                              Because when I appended the shotgun files to the extracted paired end fasta files and performed the assembly , it shows memory allocation problem !! The memory of my linux machine is 7GB.. isnt that enough ? how much memory does mira require to perform the asembly ?

                              may I know how u performed your assembly ? did u append the shotgun files or just did an assembly of the paired ends extracted fasta files ??

                              Thanks
                              I'm not sure exactly what you're trying to do....It would be helpful if you posted the MIRA commands you have tried and the exact error returned. That way there are no misinterpretations in communicating your questions.

                              MIRA is an Overlap/Layout/Consensus (OLC) type assembler. They inherently require lots of memory for all but the smallest genomes. Try using the miramem command to estimate what the memory requirement is likely to be.

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Genetic Variation in Immunogenetics and Antibody Diversity
                                by seqadmin



                                The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
                                11-06-2024, 07:24 PM
                              • seqadmin
                                Choosing Between NGS and qPCR
                                by seqadmin



                                Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
                                10-18-2024, 07:11 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Today, 11:09 AM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Today, 06:13 AM
                              0 responses
                              20 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 11-01-2024, 06:09 AM
                              0 responses
                              30 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 10-30-2024, 05:31 AM
                              0 responses
                              21 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X