Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting and removing MIDs from fastq file (Roche 454)

    Hi everybody,

    I have some 454 data in FNA/QUAL format and .fastq (I don't have access to the original sff file). The run was multiplexed using 3' and 5' MIDs and I'm now trying to sort these apart. I need the resulting files to have quality associated. Anyone know of good tools for this??


  • #2
    I'd do this with the FASTQ files (easier than filtering the FASTA and QUAL files and keeping them synchronized).

    I'm familiar with 5' MID tags, but I don't quite know what you mean by 3' MID tags.

    Have you checked if your reads have had the Roche quality trimming applied or not?

    I would take the MIDs, compute all variants with one (or maybe two) base changes, and look for sequences which start with them (i.e. start with a desired 5' MID). Personally I'd use Biopython for this.

    However, by far the easiest way would be to do this with the raw SFF file and the Roche off instrument application tools which include MID handling. Always ask for the SFF file if doing 454 analysis - it gives you far more options.
    Last edited by maubp; 12-30-2010, 06:29 PM. Reason: typo


    • #3
      Originally posted by maubp View Post
      I'm familiar with 5' MID tags, but I don't quite know what you mean by 3' MID tags.
      This is the info that I had on the MIDs used in our run... I assumed it meant there was one at 5' and another at 3'. Is that what these are?
       mid = "RL1", "ACACGACGACT", 2, "AGTCGTGGTGT";
       mid = "RL2", "ACACGTAGTAT", 2, "ATACTAGGTGT";
       mid = "RL3", "ACACTACTCGT", 2, "ACGAGTGGTGT";
       mid = "RL4", "ACGACACGTAT", 2, "ATACGTGGCGT";
       mid = "RL5", "ACGAGTAGACT", 2, "AGTCTACGCGT";
       mid = "RL6", "ACGCGTCTAGT", 2, "ACTAGAGGCGT";
       mid = "RL7", "ACGTACACACT", 2, "AGTGTGTGCGT";
       mid = "RL8", "ACGTACTGTGT", 2, "ACACAGTGCGT";
       mid = "RL9", "ACGTAGATCGT", 2, "ACGATCTGCGT";
       mid = "RL10", "ACTACGTCTCT", 2, "AGAGACGGAGT";
       mid = "RL11", "ACTATACGAGT", 2, "ACTCGTAGAGT";
       mid = "RL12", "ACTCGCGTCGT", 2, "ACGACGGGAGT";
      Originally posted by maubp View Post
      However, by far the easiest way would be to do this with the raw SFF file and the Roche off instrument application tools which include MID handling.
      I'd love to-- but this sequencing was done a while ago and somehow that file has gone a-stray

      I also found this other thread where folks have been discussing this...


      • #4
        Originally posted by ewilbanks View Post
        This is the info that I had on the MIDs used in our run... I assumed it meant there was one at 5' and another at 3'. Is that what these are?
        I guess so, in which case RL is probably short for rapid library preparation method. We've only ever used 5' MID tags so I can't give you any first hand advice, but the thread you mention looks useful.
        Originally posted by ewilbanks View Post
        I'd love to-- but this sequencing was done a while ago and somehow that file has gone a-stray
        That's a shame.

        Originally posted by ewilbanks View Post
        I also found this other thread where folks have been discussing this...
        Post #7 by kmcarr looks particularly helpful.


        • #5
          ah RL = rapid library! Thanks!! Yeah, I'm just sorting on the 5' MID (fastx toolkit) and then I'll trim out any 3's hanging around. Thanks again for your help!


          • #6

            I am very new to NGS data analysis.
            I am trying to sort my fastq files based on MID tags and I am trying to do that using FASTX_BARCODE_SPLITTER.... but then it generates txt filed wdout any content in it. and the unmatched folder gets all the fastq contents copied into it.
            Earlier I was successful while i tried sorting only the fasta files....but thi time with fastq its showing some issues...
            any suggestions how to get it worked right?


            • #7
              If you have the SFF files, I'd use them with the Roche tools to split on the MID barcodes.


              • #8
                hi there..
                I finally managed to get my fastq files MID sorted and also MID trmmed .. i used fastx_barcode_splitter (for sorting) and fastx_trimmer (for removing the MID tags). But had to do some prior manipulations to my fastq files.
                like converting all the lower cases to upper cases and removing the 'tcag' primer from before the beginning of the lines having the MID tags!
                Thanks anyways!!


                • #9
                  help in undertsanting using sfffile prgram over command line

                  I am using sfffile program on command line to sort my sff files by MIDs and remove MIDs. I think I am going wrong somewhere.
                  sfffile -o roche454_new.sff -e mid.lst -nmft sff/roche454.sff

                  Also, I am a little confused about using options (-s) and (-i).
                  Can anyone please suggest how to do that??


                  • #10
                    Originally posted by prisnirath View Post
                    I am using sfffile program on command line to sort my sff files by MIDs and remove MIDs. I think I am going wrong somewhere.
                    sfffile -o roche454_new.sff -e mid.lst -nmft sff/roche454.sff

                    Also, I am a little confused about using options (-s) and (-i).
                    Can anyone please suggest how to do that??
                    If you just want to split your sff files according to some standard sets of MIDs mentioned in you system-mid-file (MIDConfig.parse) you just want to use:

                    sfffile -s RLMIDs MY_SFF_FILE.sff
                    If you have a custom MID file with a MID group named "SPC_MIDs"

                    sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff
                    The '-i'/'-e' is just for including/excluding certain reads (acc).

                    Last edited by sklages; 05-26-2011, 04:06 AM.


                    • #11
                      Thank you!!
                      I understand it now!
                      But still a little confused...
                      I have got my MID files in CSV format and I have converted this file into txt, tab delimited and fasta file.
                      Which format should I be using here?


                      • #12
                        Originally posted by prisnirath View Post
                        Thank you!!
                        I understand it now!
                        But still a little confused...
                        I have got my MID files in CSV format and I have converted this file into txt, tab delimited and fasta file.
                        Which format should I be using here?
                        Now, I am confused :-)

                        You should have your data in SFF files, your MIDs in roche conform "parse" format, e.g.
                            mid = "MID4000", "ACACGT", 0;
                            mid = "MID4001", "ACGTAC", 0;
                            mid = "MID4002", "ACTGCA", 0;
                            mid = "MID4003", "AGAGTC", 0;
                        where '0' stands for the allowed number of mismatches for a MID to be still valid.

                        If you use "Rapid Libraries" you might want to check 3' ends as well,
                            mid = "RL1",   "ACACGACGACT", 1, "AGTCGTGGTGT";
                            mid = "RL2",   "ACACGTAGTAT", 1, "ATACTAGGTGT";
                            mid = "RL3",   "ACACTACTCGT", 1, "ACGAGTGGTGT";
                            mid = "RL4",   "ACGACACGTAT", 1, "ATACGTGGCGT";
                        Again, the number stands for allowed mismatches in MID recognition.
                        The second sequence in this format has no influence on splitting, it just gets trimmed (if found). Splitting is exclusively done on MIDs present at the 5' end.



                        • #13
                          i have got my SFF files...true!!
                          I got a MID file in csv format.
                          And I have parsed it to a tab-delimited file.
                          My question is while using ::
                          sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff

                          MyMIDfile.parse :: MID file (right??)

                          ...what file format shoud I be using for parsing it to its acceptable format?

                          I took suggestions from the thread
                          and I am getting error!!
                          sfffile -s Y -mcf file2.txt -o reg1 GGDP4G001.sff >MIDyieldR1.txt
                          Error: Invalid file format 2: file2.txt


                          • #14
                            this is the format of my txt MID file


                            • #15
                              Originally posted by prisnirath View Post
                              i have got my SFF files...true!!
                              I got a MID file in csv format.
                              And I have parsed it to a tab-delimited file.
                              My question is while using ::
                              sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff

                              MyMIDfile.parse :: MID file (right??)

                              ...what file format shoud I be using for parsing it to its acceptable format?

                              I took suggestions from the thread
                              and I am getting error!!
                              sfffile -s Y -mcf file2.txt -o reg1 GGDP4G001.sff >MIDyieldR1.txt
                              Error: Invalid file format 2: file2.txt
                              Have you read my post? I have described the format you should use for sfffile to split SFFs according to their MIDs ...

                              Just another ... the output of sfffile is a new SFF; no need to redirect (to a text file) ..

                              Last edited by sklages; 05-26-2011, 04:37 AM. Reason: typo


                              Latest Articles


                              • seqadmin
                                Addressing Off-Target Effects in CRISPR Technologies
                                by seqadmin

                                The first FDA-approved CRISPR-based therapy marked the transition of therapeutic gene editing from a dream to reality1. CRISPR technologies have streamlined gene editing, and CRISPR screens have become an important approach for identifying genes involved in disease processes2. This technique introduces targeted mutations across numerous genes, enabling large-scale identification of gene functions, interactions, and pathways3. Identifying the full range...
                                08-27-2024, 04:44 AM
                              • seqadmin
                                Selecting and Optimizing mRNA Library Preparations
                                by seqadmin

                                Sequencing mRNA provides a snapshot of cellular activity, allowing researchers to study the dynamics of cellular processes, compare gene expression across different tissue types, and gain insights into the mechanisms of complex diseases. “mRNA’s central role in the dogma of molecular biology makes it a logical and relevant focus for transcriptomic studies,” stated Sebastian Aguilar Pierlé, Ph.D., Application Development Lead at Inorevia. “One of the major hurdles for...
                                08-07-2024, 12:11 PM





                              Topics Statistics Last Post
                              Started by seqadmin, 08-27-2024, 04:40 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 08-22-2024, 05:00 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 08-21-2024, 10:49 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 08-19-2024, 05:12 AM
                              0 responses
                              Last Post seqadmin  