Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting and removing MIDs from fastq file (Roche 454)

    Hi everybody,

    I have some 454 data in FNA/QUAL format and .fastq (I don't have access to the original sff file). The run was multiplexed using 3' and 5' MIDs and I'm now trying to sort these apart. I need the resulting files to have quality associated. Anyone know of good tools for this??

    Thanks!!
    Lizzy

  • #2
    I'd do this with the FASTQ files (easier than filtering the FASTA and QUAL files and keeping them synchronized).

    I'm familiar with 5' MID tags, but I don't quite know what you mean by 3' MID tags.

    Have you checked if your reads have had the Roche quality trimming applied or not?

    I would take the MIDs, compute all variants with one (or maybe two) base changes, and look for sequences which start with them (i.e. start with a desired 5' MID). Personally I'd use Biopython for this.

    However, by far the easiest way would be to do this with the raw SFF file and the Roche off instrument application tools which include MID handling. Always ask for the SFF file if doing 454 analysis - it gives you far more options.
    Last edited by maubp; 12-30-2010, 06:29 PM. Reason: typo

    Comment


    • #3
      Originally posted by maubp View Post
      I'm familiar with 5' MID tags, but I don't quite know what you mean by 3' MID tags.
      This is the info that I had on the MIDs used in our run... I assumed it meant there was one at 5' and another at 3'. Is that what these are?
      Code:
      RLMIDs
      {
       mid = "RL1", "ACACGACGACT", 2, "AGTCGTGGTGT";
       mid = "RL2", "ACACGTAGTAT", 2, "ATACTAGGTGT";
       mid = "RL3", "ACACTACTCGT", 2, "ACGAGTGGTGT";
       mid = "RL4", "ACGACACGTAT", 2, "ATACGTGGCGT";
       mid = "RL5", "ACGAGTAGACT", 2, "AGTCTACGCGT";
       mid = "RL6", "ACGCGTCTAGT", 2, "ACTAGAGGCGT";
       mid = "RL7", "ACGTACACACT", 2, "AGTGTGTGCGT";
       mid = "RL8", "ACGTACTGTGT", 2, "ACACAGTGCGT";
       mid = "RL9", "ACGTAGATCGT", 2, "ACGATCTGCGT";
       mid = "RL10", "ACTACGTCTCT", 2, "AGAGACGGAGT";
       mid = "RL11", "ACTATACGAGT", 2, "ACTCGTAGAGT";
       mid = "RL12", "ACTCGCGTCGT", 2, "ACGACGGGAGT";
      }
      Originally posted by maubp View Post
      However, by far the easiest way would be to do this with the raw SFF file and the Roche off instrument application tools which include MID handling.
      I'd love to-- but this sequencing was done a while ago and somehow that file has gone a-stray

      I also found this other thread where folks have been discussing this...
      http://seqanswers.com/forums/showthr...highlight=mids

      Comment


      • #4
        Originally posted by ewilbanks View Post
        This is the info that I had on the MIDs used in our run... I assumed it meant there was one at 5' and another at 3'. Is that what these are?
        I guess so, in which case RL is probably short for rapid library preparation method. We've only ever used 5' MID tags so I can't give you any first hand advice, but the thread you mention looks useful.
        Originally posted by ewilbanks View Post
        I'd love to-- but this sequencing was done a while ago and somehow that file has gone a-stray
        That's a shame.

        Originally posted by ewilbanks View Post
        I also found this other thread where folks have been discussing this...
        http://seqanswers.com/forums/showthr...highlight=mids
        Post #7 by kmcarr looks particularly helpful.

        Comment


        • #5
          ah RL = rapid library! Thanks!! Yeah, I'm just sorting on the 5' MID (fastx toolkit) and then I'll trim out any 3's hanging around. Thanks again for your help!

          Comment


          • #6
            help!!

            hi..
            I am very new to NGS data analysis.
            I am trying to sort my fastq files based on MID tags and I am trying to do that using FASTX_BARCODE_SPLITTER.... but then it generates txt filed wdout any content in it. and the unmatched folder gets all the fastq contents copied into it.
            Earlier I was successful while i tried sorting only the fasta files....but thi time with fastq its showing some issues...
            any suggestions how to get it worked right?

            Comment


            • #7
              If you have the SFF files, I'd use them with the Roche tools to split on the MID barcodes.

              Comment


              • #8
                hi there..
                I finally managed to get my fastq files MID sorted and also MID trmmed .. i used fastx_barcode_splitter (for sorting) and fastx_trimmer (for removing the MID tags). But had to do some prior manipulations to my fastq files.
                like converting all the lower cases to upper cases and removing the 'tcag' primer from before the beginning of the lines having the MID tags!
                Thanks anyways!!

                Comment


                • #9
                  help in undertsanting using sfffile prgram over command line

                  I am using sfffile program on command line to sort my sff files by MIDs and remove MIDs. I think I am going wrong somewhere.
                  sfffile -o roche454_new.sff -e mid.lst -nmft sff/roche454.sff

                  Also, I am a little confused about using options (-s) and (-i).
                  Can anyone please suggest how to do that??

                  Comment


                  • #10
                    Originally posted by prisnirath View Post
                    I am using sfffile program on command line to sort my sff files by MIDs and remove MIDs. I think I am going wrong somewhere.
                    sfffile -o roche454_new.sff -e mid.lst -nmft sff/roche454.sff

                    Also, I am a little confused about using options (-s) and (-i).
                    Can anyone please suggest how to do that??
                    If you just want to split your sff files according to some standard sets of MIDs mentioned in you system-mid-file (MIDConfig.parse) you just want to use:

                    Code:
                    sfffile -s RLMIDs MY_SFF_FILE.sff
                    If you have a custom MID file with a MID group named "SPC_MIDs"

                    Code:
                    sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff
                    The '-i'/'-e' is just for including/excluding certain reads (acc).

                    hth,
                    Sven
                    Last edited by sklages; 05-26-2011, 04:06 AM.

                    Comment


                    • #11
                      Thank you!!
                      I understand it now!
                      But still a little confused...
                      I have got my MID files in CSV format and I have converted this file into txt, tab delimited and fasta file.
                      Which format should I be using here?

                      Comment


                      • #12
                        Originally posted by prisnirath View Post
                        Thank you!!
                        I understand it now!
                        But still a little confused...
                        I have got my MID files in CSV format and I have converted this file into txt, tab delimited and fasta file.
                        Which format should I be using here?
                        Now, I am confused :-)

                        You should have your data in SFF files, your MIDs in roche conform "parse" format, e.g.
                        Code:
                        CUSTOM_MULTIPLEX
                        {
                            mid = "MID4000", "ACACGT", 0;
                            mid = "MID4001", "ACGTAC", 0;
                            mid = "MID4002", "ACTGCA", 0;
                            mid = "MID4003", "AGAGTC", 0;
                        }
                        where '0' stands for the allowed number of mismatches for a MID to be still valid.

                        If you use "Rapid Libraries" you might want to check 3' ends as well,
                        Code:
                        RLMIDs
                        {
                            mid = "RL1",   "ACACGACGACT", 1, "AGTCGTGGTGT";
                            mid = "RL2",   "ACACGTAGTAT", 1, "ATACTAGGTGT";
                            mid = "RL3",   "ACACTACTCGT", 1, "ACGAGTGGTGT";
                            mid = "RL4",   "ACGACACGTAT", 1, "ATACGTGGCGT";
                        }
                        Again, the number stands for allowed mismatches in MID recognition.
                        The second sequence in this format has no influence on splitting, it just gets trimmed (if found). Splitting is exclusively done on MIDs present at the 5' end.

                        hth,
                        Sven

                        Comment


                        • #13
                          i have got my SFF files...true!!
                          I got a MID file in csv format.
                          And I have parsed it to a tab-delimited file.
                          My question is while using ::
                          sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff

                          MyMIDfile.parse :: MID file (right??)

                          ...what file format shoud I be using for parsing it to its acceptable format?

                          I took suggestions from the thread http://seqanswers.com/forums/showthread.php?t=10825
                          and I am getting error!!
                          sfffile -s Y -mcf file2.txt -o reg1 GGDP4G001.sff >MIDyieldR1.txt
                          Error: Invalid file format 2: file2.txt

                          Comment


                          • #14
                            ACGAGTGCGTGTAGCGCGACGGCCAGT
                            ACGAGTGCGTCAGGGCGCAGCGATGAC
                            ACGCTCGACAGTAGCGCGACGGCCAGT
                            ACGCTCGACACAGGGCGCAGCGATGAC
                            AGACGCACTCGTAGCGCGACGGCCAGT
                            AGACGCACTCCAGGGCGCAGCGATGAC
                            AGCACTGTAGGTAGCGCGACGGCCAGT
                            AGCACTGTAGCAGGGCGCAGCGATGAC
                            ;
                            ;
                            ;
                            this is the format of my txt MID file

                            Comment


                            • #15
                              Originally posted by prisnirath View Post
                              i have got my SFF files...true!!
                              I got a MID file in csv format.
                              And I have parsed it to a tab-delimited file.
                              My question is while using ::
                              sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff

                              MyMIDfile.parse :: MID file (right??)

                              ...what file format shoud I be using for parsing it to its acceptable format?

                              I took suggestions from the thread http://seqanswers.com/forums/showthread.php?t=10825
                              and I am getting error!!
                              sfffile -s Y -mcf file2.txt -o reg1 GGDP4G001.sff >MIDyieldR1.txt
                              Error: Invalid file format 2: file2.txt
                              Have you read my post? I have described the format you should use for sfffile to split SFFs according to their MIDs ...

                              Just another ... the output of sfffile is a new SFF; no need to redirect (to a text file) ..

                              hth,
                              Sven
                              Last edited by sklages; 05-26-2011, 04:37 AM. Reason: typo

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Current Approaches to Protein Sequencing
                                by seqadmin


                                Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                                04-04-2024, 04:25 PM
                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 04-11-2024, 12:08 PM
                              0 responses
                              18 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 10:19 PM
                              0 responses
                              22 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-10-2024, 09:21 AM
                              0 responses
                              16 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 04-04-2024, 09:00 AM
                              0 responses
                              47 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X