Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • 454 output files

    Hello, I'm starting in the transcriptome analysis. My question is about 454 files. I have two folders with 2 SFF files each. SFF files of the first folder about 300 000 readings, whereas in the second folder SFF files comprise about 160 000 readings.

    As I understand it is not possible to have two readings with the same name, but in the second folder files are read with the same name as the first, but with different lengths. For example:

    >EVADNQG01CUUPQ
    tcagTCAGAAACCGCTTCGATAAGAGAGACCCACTGGGCCAAAGTTACATCACATACTATTAACTTGCGTTGAACCACAGGTTCGCATCAAGTATATGTTCACATc

    >EVADNQG01CUUPQ
    tcagTCAGAAACCGCTTCGATAAGAGAGACC ACTGG CAAAGT ACATCACATACTAT AACT GCGT GAA CACAG TCGCAt nagtatatgtcacatc

    I guess that some criteria were eliminated readings and created the other two files with less reading, but what is the reason to have gaps and not just N in the intermediate regions?

    Thanks

  • #2
    What have you done? What folders are you referring to? From which files are the read examples you showed? What is the problem? ;-)

    Sven

    Comment


    • #3
      Hello Sven, some time ago sent a transcriptome sequencing, they sent the two files sff (about 300 thousand readings) in a folder called "one". Because nobody knew in the laboratory of bioinformatics, the service also took place assembly. So far so good.

      But now I'm training for a new transcriptome analysis, so check every service delivered by the sequencing and assembly, and I found a folder called "two" with two sff files (about 160 000 readings). In comparing the files in both folders found there reading with the same name. The readings above are an example, the first belongs to the files with 300 thousand readings while the second belongs to the files of 160 000 readings.

      My hypothesis is that they eliminated low quality readings, contaminated with vector etc and therefore there were only 160 000 readings and the people in charge of the work created a new folder ("two"). From this hypothesis, my question is: Why do the readings (as in the example above) does not have the same length, but there are gaps and why not masked with an N.

      Comment


      • #4
        Originally posted by Avila View Post
        Hello Sven, some time ago sent a transcriptome sequencing, they sent the two files sff (about 300 thousand readings) in a folder called "one". Because nobody knew in the laboratory of bioinformatics, the service also took place assembly. So far so good.

        But now I'm training for a new transcriptome analysis, so check every service delivered by the sequencing and assembly, and I found a folder called "two" with two sff files (about 160 000 readings). In comparing the files in both folders found there reading with the same name. The readings above are an example, the first belongs to the files with 300 thousand readings while the second belongs to the files of 160 000 readings.

        My hypothesis is that they eliminated low quality readings, contaminated with vector etc and therefore there were only 160 000 readings and the people in charge of the work created a new folder ("two"). From this hypothesis, my question is: Why do the readings (as in the example above) does not have the same length, but there are gaps and why not masked with an N.
        Hard to guess ... maybe one has been amplicon processed, the other shotgun?
        If you have access to the SFF tools, sffinfo -m might help.

        But the simplest and most effective solution is to just ask your sequencing provider about the data provided. I'd not rely on a "hypothesis" ;-)

        Sven

        Comment


        • #5
          What strikes me is that the second sequence is different, in particular some of the homopolymers are one base shorter, and there is some more trimming at the 3' end. It might indeed be amplicon versus shotgun analysis/basecalling, but the homopolymer shortening still surprises me (I need to check our own amplicon runs to see if they show the same pattern).

          Comment


          • #6
            Back when the amplicon processing pipeline first became available, I tried reprocessing amplicon data with the shotgun pipeline to get more data. I found that indeed there was more data, but no more useful data. The most noticeable differences were in homopolymer regions where the shotgun-processed data tended to have more variation in length. The amplicon-processed data was usually correct in the homopolymers whereas the shotgun-processed data would have a number of reads with an extra base or, less-commonly, missing a base in homopolymers of about 4 or more.

            Long story short, yes, the difference here could be amplicon vs. shotgun processing, as I have seen some homopolymer shortening using the amplicon pipeline. However, I never saw it as extensive as in this example, especially in 2-base homopolymers, so the extent of it here surprises me and makes me think it may be something else. I don't have any other ideas, though.

            Comment


            • #7
              Apparently this is rare, I'll try to get service response sequencing.
              I do not understand what you mean with amplicon run.

              Thanks

              Comment


              • #8
                Originally posted by Avila View Post
                I do not understand what you mean with amplicon run.
                When you sequence PCR products (amplicons) on the GS FLX, you need to use another pipeline for basecalling than for shotgun runs. This has to do with the usually much stronger sequencing signals from PCR products.

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Current Approaches to Protein Sequencing
                  by seqadmin


                  Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                  04-04-2024, 04:25 PM
                • seqadmin
                  Strategies for Sequencing Challenging Samples
                  by seqadmin


                  Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                  03-22-2024, 06:39 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 04-11-2024, 12:08 PM
                0 responses
                25 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 10:19 PM
                0 responses
                27 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-10-2024, 09:21 AM
                0 responses
                24 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 04-04-2024, 09:00 AM
                0 responses
                52 views
                0 likes
                Last Post seqadmin  
                Working...
                X