Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Olalla
    Member
    • Aug 2014
    • 11

    Problem with MIRA

    Hello all

    I am new in bioinformatics and linux, and right now I am starting my "training" with some 454 RNAseq data. The starting data are 454 RNA sequencing reads from 10 different individuals (3 runs each). Until now I have converted my fasta/qual to fastq files and then collapsed the fastq files from the different reads of each individual into a single one, before proceeding with the quality control analysis. All went smooth and the outputs seem to be ok. Now I want to procceed with the assembly and to do this I plan to use MIRA as implemented in Geneious. I have uploaded the trimmed/clipped fastq files and when I select them and try to do an assembly with the default parameters I get the following error message:

    Fatal error (may be due to problems of the input data or parameters):

    ********************************************************************************
    * Some read names were found more than once (see log above). This usually *
    * hints to a serious problem with your input and should really, really be *
    * fixed. You can choose to ignore this error with , but this will *
    * almost certainly lead to problems with result files (ACE and CAF for sure, *
    * maybe also SAM) and probably to other unexpected effects. *

    I have already cecked whether I could have put together fastq files more than once, but when I looked at my scripts there are no errors. I tried assembling files by pairs to see which are the problematic ones, and I get this error message with a few of them so now I need to check where is the problem and if it is actually true that I have repeated read names in these files (although this shouldn't be the case). I would like to find some script that allows me to extract just the read names from these files, so I can then compare them and check if I there are repeated read names between files, but I cannot find anything useful anywhere. Does anyone knows how can I do this, and also anyone has any guess on why is Mira reporting this error?

    Thanks in advance

    Olalla
  • JohnN
    Member
    • Jan 2011
    • 31

    #2
    I'm wondering if the trimming process may not have clipped the read names - are you able to check that?

    Also, mira works better if you do NOT preprocess your data - as mira will trim internally itself.

    HTH

    Comment

    • Olalla
      Member
      • Aug 2014
      • 11

      #3
      Hello John

      Thanks for your reply. The read names ate not clipped. And I guess they shouldnt have same names. I just would need a way to retrieve the list of read names (or at least those matching between files), so I can after that check where is the problem. Any idea about which command can I use for that or of there is any script/program doing that?

      I will also try to run MIRA on the non preprocessed samples, and see if the results are different.
      Thanks for the suggestion

      Thanks again

      Comment

      • JohnN
        Member
        • Jan 2011
        • 31

        #4
        I'm not sure how to answer. Your read names should not have any overlaps. How did you generate your fastq files? I tend to use mira's sff_extract. But there are lots of good sff extractors out there.
        Last edited by JohnN; 10-28-2014, 07:42 AM. Reason: Added sentences about sff_extract

        Comment

        • Olalla
          Member
          • Aug 2014
          • 11

          #5
          Well, I started from the fna and qual files. Basically, I do have ten individuals, and for each one I have results from three runs, so I first did conversion from fna to fastq, then concatenated all files from same individual in a single fastq file, and then I did the QC analysis on those concatenated fastq files. As I told you, everything went fine until I got to the assembly step. Unless this is a problem with MIRA plugin in geneious, I also do not see the reason why I should have repeated red names, as they should be unique strings of characters. By now I extracted just the read names from the files and I am going to compare them by pairs so I can really see whether there are actual repeated read names.... I will see.

          Comment

          • JohnN
            Member
            • Jan 2011
            • 31

            #6
            Looking at your steps, the only place where the read names could be messed up could be in your fna/qual to fastq conversion...

            Could you not take the SFF files from the 454 assembly and convert them directly to fastq using sff_extract or the another tool?

            Comment

            • Olalla
              Member
              • Aug 2014
              • 11

              #7
              Yes, exactly that is what I though, but the scripts are correct (no names messed up), so the only possibility is some mistake when converting sff files into fna and qual. At the moment I do not have access to these files, as they are data that were not mine but from my supervisor (I am just doing the analysis at the moment), but I could have them (I guess).

              Thanks a lot for your suggestions

              Olalla

              Comment

              • Yves
                Junior Member
                • Jul 2012
                • 3

                #8
                MIRA simply does not manage long headlines. So, parse your fastq as follow :

                @M00266:130:000000000-A334F:1:1101:15377:1607 1:N:0:5

                to :

                @1:1101:15377:1607/1

                the missing first part of the headline should be common to all sequences of your file.
                I have a parser for that if you need it, but just a few sed command lines will do the job quickly.

                Comment

                • JohnN
                  Member
                  • Jan 2011
                  • 31

                  #9
                  Or use:

                  parameters = COMMON_SETTINGS -NW:mrnl=0

                  to parse long read names

                  Comment

                  • Olalla
                    Member
                    • Aug 2014
                    • 11

                    #10
                    Ok, I think that I have found the source of the problem, finally

                    So, the read names in my fastq files appear as follows:
                    GIMXFMA02G21Y1 length=60 xy=2788_0299 region=2 run=R_2010_06_09_09_17_36_ that is, long names with spaces. So I think that what MIRA is doing is juts taking the first 14 characters as the read name (e.g. GIMXFMA02G21Y1), which in fact are repeated among many of the files that I do have (I have found common lines when comparing files including only these names in the lines). However, when I search for common lines in files including all information in headers (like GIMXFMA02G21Y1 length=60 xy=2788_0299 region=2 run=R_2010_06_09_09_17_36_), the output of the search is that there are no common lines between any of the files, so I think that I should first eliminate spaces, maybe replacing them by ":". Any suggestion/script on how to do this would be very appreciated.

                    Again, many thanks for your comments and suggestions

                    Olalla

                    Comment

                    • JohnN
                      Member
                      • Jan 2011
                      • 31

                      #11
                      Or just run it including the parameters in my previous post above, and Mira will accept the long file names.

                      Comment

                      • WhatsOEver
                        Senior Member
                        • Apr 2012
                        • 215

                        #12
                        What you could also do is simply:

                        Code:
                        cat ./orginalFile.fastq | sed -e 's/ /_/g' > ./formattedFile.fastq
                        This will replace every space with an underscore (or to whatever you prefer).
                        The command actually doesn't distinguish between header, sequence, comment or qual lines. You should, however, be save to ignore this as your sequence and qual lines must not contain any spaces and for the comment line, it doesn't really matter.

                        Comment

                        • Olalla
                          Member
                          • Aug 2014
                          • 11

                          #13
                          Hello John

                          I already did that, but the error message persists... As I told you, the problem in my case I think that is with the whitespaces. The program seems to stop after finding the first white space in the read name (after the 14-character string that is common in many cases). So what I do need to do now is to replace the white spaces in the read names by colons, and then maybe use the option that you suggest so the program ignores long read names.

                          So what I need to find now is how to delete these white spaces in the read name lines and substitute them... there is where I am now stucked :/ The problem when you start with linux is that it takes lots of time to find adequate commands and/or scripts to do whatever you need

                          Thanks

                          Olalla

                          Comment

                          • Olalla
                            Member
                            • Aug 2014
                            • 11

                            #14
                            Hey whatsoever... thanks for this!! I wil try it now

                            Comment

                            • maubp
                              Peter (Biopython etc)
                              • Jul 2009
                              • 1544

                              #15
                              The read name GIMXFMA02G21Y1 looks like a Roche 454 read name, but it should be unique and only occur once in your FASTQ file. If you are saying it appears several times then it makes sense that MIRA is complaining about duplicates.

                              Comment

                              Latest Articles

                              Collapse

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, 06-05-2026, 10:09 AM
                              0 responses
                              18 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-04-2026, 08:59 AM
                              0 responses
                              34 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 12:03 PM
                              0 responses
                              37 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 06-02-2026, 11:40 AM
                              0 responses
                              24 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...