Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sorting and removing MIDs from fastq file (Roche 454)

    Hi everybody,

    I have some 454 data in FNA/QUAL format and .fastq (I don't have access to the original sff file). The run was multiplexed using 3' and 5' MIDs and I'm now trying to sort these apart. I need the resulting files to have quality associated. Anyone know of good tools for this??

    Thanks!!
    Lizzy

  • #2
    I'd do this with the FASTQ files (easier than filtering the FASTA and QUAL files and keeping them synchronized).

    I'm familiar with 5' MID tags, but I don't quite know what you mean by 3' MID tags.

    Have you checked if your reads have had the Roche quality trimming applied or not?

    I would take the MIDs, compute all variants with one (or maybe two) base changes, and look for sequences which start with them (i.e. start with a desired 5' MID). Personally I'd use Biopython for this.

    However, by far the easiest way would be to do this with the raw SFF file and the Roche off instrument application tools which include MID handling. Always ask for the SFF file if doing 454 analysis - it gives you far more options.
    Last edited by maubp; 12-30-2010, 06:29 PM. Reason: typo

    Comment


    • #3
      Originally posted by maubp View Post
      I'm familiar with 5' MID tags, but I don't quite know what you mean by 3' MID tags.
      This is the info that I had on the MIDs used in our run... I assumed it meant there was one at 5' and another at 3'. Is that what these are?
      Code:
      RLMIDs
      {
       mid = "RL1", "ACACGACGACT", 2, "AGTCGTGGTGT";
       mid = "RL2", "ACACGTAGTAT", 2, "ATACTAGGTGT";
       mid = "RL3", "ACACTACTCGT", 2, "ACGAGTGGTGT";
       mid = "RL4", "ACGACACGTAT", 2, "ATACGTGGCGT";
       mid = "RL5", "ACGAGTAGACT", 2, "AGTCTACGCGT";
       mid = "RL6", "ACGCGTCTAGT", 2, "ACTAGAGGCGT";
       mid = "RL7", "ACGTACACACT", 2, "AGTGTGTGCGT";
       mid = "RL8", "ACGTACTGTGT", 2, "ACACAGTGCGT";
       mid = "RL9", "ACGTAGATCGT", 2, "ACGATCTGCGT";
       mid = "RL10", "ACTACGTCTCT", 2, "AGAGACGGAGT";
       mid = "RL11", "ACTATACGAGT", 2, "ACTCGTAGAGT";
       mid = "RL12", "ACTCGCGTCGT", 2, "ACGACGGGAGT";
      }
      Originally posted by maubp View Post
      However, by far the easiest way would be to do this with the raw SFF file and the Roche off instrument application tools which include MID handling.
      I'd love to-- but this sequencing was done a while ago and somehow that file has gone a-stray

      I also found this other thread where folks have been discussing this...
      http://seqanswers.com/forums/showthr...highlight=mids

      Comment


      • #4
        Originally posted by ewilbanks View Post
        This is the info that I had on the MIDs used in our run... I assumed it meant there was one at 5' and another at 3'. Is that what these are?
        I guess so, in which case RL is probably short for rapid library preparation method. We've only ever used 5' MID tags so I can't give you any first hand advice, but the thread you mention looks useful.
        Originally posted by ewilbanks View Post
        I'd love to-- but this sequencing was done a while ago and somehow that file has gone a-stray
        That's a shame.

        Originally posted by ewilbanks View Post
        I also found this other thread where folks have been discussing this...
        http://seqanswers.com/forums/showthr...highlight=mids
        Post #7 by kmcarr looks particularly helpful.

        Comment


        • #5
          ah RL = rapid library! Thanks!! Yeah, I'm just sorting on the 5' MID (fastx toolkit) and then I'll trim out any 3's hanging around. Thanks again for your help!

          Comment


          • #6
            help!!

            hi..
            I am very new to NGS data analysis.
            I am trying to sort my fastq files based on MID tags and I am trying to do that using FASTX_BARCODE_SPLITTER.... but then it generates txt filed wdout any content in it. and the unmatched folder gets all the fastq contents copied into it.
            Earlier I was successful while i tried sorting only the fasta files....but thi time with fastq its showing some issues...
            any suggestions how to get it worked right?

            Comment


            • #7
              If you have the SFF files, I'd use them with the Roche tools to split on the MID barcodes.

              Comment


              • #8
                hi there..
                I finally managed to get my fastq files MID sorted and also MID trmmed .. i used fastx_barcode_splitter (for sorting) and fastx_trimmer (for removing the MID tags). But had to do some prior manipulations to my fastq files.
                like converting all the lower cases to upper cases and removing the 'tcag' primer from before the beginning of the lines having the MID tags!
                Thanks anyways!!

                Comment


                • #9
                  help in undertsanting using sfffile prgram over command line

                  I am using sfffile program on command line to sort my sff files by MIDs and remove MIDs. I think I am going wrong somewhere.
                  sfffile -o roche454_new.sff -e mid.lst -nmft sff/roche454.sff

                  Also, I am a little confused about using options (-s) and (-i).
                  Can anyone please suggest how to do that??

                  Comment


                  • #10
                    Originally posted by prisnirath View Post
                    I am using sfffile program on command line to sort my sff files by MIDs and remove MIDs. I think I am going wrong somewhere.
                    sfffile -o roche454_new.sff -e mid.lst -nmft sff/roche454.sff

                    Also, I am a little confused about using options (-s) and (-i).
                    Can anyone please suggest how to do that??
                    If you just want to split your sff files according to some standard sets of MIDs mentioned in you system-mid-file (MIDConfig.parse) you just want to use:

                    Code:
                    sfffile -s RLMIDs MY_SFF_FILE.sff
                    If you have a custom MID file with a MID group named "SPC_MIDs"

                    Code:
                    sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff
                    The '-i'/'-e' is just for including/excluding certain reads (acc).

                    hth,
                    Sven
                    Last edited by sklages; 05-26-2011, 04:06 AM.

                    Comment


                    • #11
                      Thank you!!
                      I understand it now!
                      But still a little confused...
                      I have got my MID files in CSV format and I have converted this file into txt, tab delimited and fasta file.
                      Which format should I be using here?

                      Comment


                      • #12
                        Originally posted by prisnirath View Post
                        Thank you!!
                        I understand it now!
                        But still a little confused...
                        I have got my MID files in CSV format and I have converted this file into txt, tab delimited and fasta file.
                        Which format should I be using here?
                        Now, I am confused :-)

                        You should have your data in SFF files, your MIDs in roche conform "parse" format, e.g.
                        Code:
                        CUSTOM_MULTIPLEX
                        {
                            mid = "MID4000", "ACACGT", 0;
                            mid = "MID4001", "ACGTAC", 0;
                            mid = "MID4002", "ACTGCA", 0;
                            mid = "MID4003", "AGAGTC", 0;
                        }
                        where '0' stands for the allowed number of mismatches for a MID to be still valid.

                        If you use "Rapid Libraries" you might want to check 3' ends as well,
                        Code:
                        RLMIDs
                        {
                            mid = "RL1",   "ACACGACGACT", 1, "AGTCGTGGTGT";
                            mid = "RL2",   "ACACGTAGTAT", 1, "ATACTAGGTGT";
                            mid = "RL3",   "ACACTACTCGT", 1, "ACGAGTGGTGT";
                            mid = "RL4",   "ACGACACGTAT", 1, "ATACGTGGCGT";
                        }
                        Again, the number stands for allowed mismatches in MID recognition.
                        The second sequence in this format has no influence on splitting, it just gets trimmed (if found). Splitting is exclusively done on MIDs present at the 5' end.

                        hth,
                        Sven

                        Comment


                        • #13
                          i have got my SFF files...true!!
                          I got a MID file in csv format.
                          And I have parsed it to a tab-delimited file.
                          My question is while using ::
                          sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff

                          MyMIDfile.parse :: MID file (right??)

                          ...what file format shoud I be using for parsing it to its acceptable format?

                          I took suggestions from the thread http://seqanswers.com/forums/showthread.php?t=10825
                          and I am getting error!!
                          sfffile -s Y -mcf file2.txt -o reg1 GGDP4G001.sff >MIDyieldR1.txt
                          Error: Invalid file format 2: file2.txt

                          Comment


                          • #14
                            ACGAGTGCGTGTAGCGCGACGGCCAGT
                            ACGAGTGCGTCAGGGCGCAGCGATGAC
                            ACGCTCGACAGTAGCGCGACGGCCAGT
                            ACGCTCGACACAGGGCGCAGCGATGAC
                            AGACGCACTCGTAGCGCGACGGCCAGT
                            AGACGCACTCCAGGGCGCAGCGATGAC
                            AGCACTGTAGGTAGCGCGACGGCCAGT
                            AGCACTGTAGCAGGGCGCAGCGATGAC
                            ;
                            ;
                            ;
                            this is the format of my txt MID file

                            Comment


                            • #15
                              Originally posted by prisnirath View Post
                              i have got my SFF files...true!!
                              I got a MID file in csv format.
                              And I have parsed it to a tab-delimited file.
                              My question is while using ::
                              sfffile -s SPC_MIDs -mcf MyMIDfile.parse MY_SFF_FILE.sff

                              MyMIDfile.parse :: MID file (right??)

                              ...what file format shoud I be using for parsing it to its acceptable format?

                              I took suggestions from the thread http://seqanswers.com/forums/showthread.php?t=10825
                              and I am getting error!!
                              sfffile -s Y -mcf file2.txt -o reg1 GGDP4G001.sff >MIDyieldR1.txt
                              Error: Invalid file format 2: file2.txt
                              Have you read my post? I have described the format you should use for sfffile to split SFFs according to their MIDs ...

                              Just another ... the output of sfffile is a new SFF; no need to redirect (to a text file) ..

                              hth,
                              Sven
                              Last edited by sklages; 05-26-2011, 04:37 AM. Reason: typo

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Advanced Tools Transforming the Field of Cytogenomics
                                by seqadmin


                                At the intersection of cytogenetics and genomics lies the exciting field of cytogenomics. It focuses on studying chromosomes at a molecular scale, involving techniques that analyze either the whole genome or particular DNA sequences to examine variations in structure and behavior at the chromosomal or subchromosomal level. By integrating cytogenetic techniques with genomic analysis, researchers can effectively investigate chromosomal abnormalities related to diseases, particularly...
                                09-26-2023, 06:26 AM
                              • seqadmin
                                How RNA-Seq is Transforming Cancer Studies
                                by seqadmin



                                Cancer research has been transformed through numerous molecular techniques, with RNA sequencing (RNA-seq) playing a crucial role in understanding the complexity of the disease. Maša Ivin, Ph.D., Scientific Writer at Lexogen, and Yvonne Goepel Ph.D., Product Manager at Lexogen, remarked that “The high-throughput nature of RNA-seq allows for rapid profiling and deep exploration of the transcriptome.” They emphasized its indispensable role in cancer research, aiding in biomarker...
                                09-07-2023, 11:15 PM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, 09-29-2023, 09:38 AM
                              0 responses
                              10 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-27-2023, 06:57 AM
                              0 responses
                              12 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 09-26-2023, 07:53 AM
                              1 response
                              25 views
                              0 likes
                              Last Post seed_phrase_metal_storage  
                              Started by seqadmin, 09-25-2023, 07:42 AM
                              0 responses
                              17 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X