Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • Wiseone
    Junior Member
    • Apr 2010
    • 7

    File Conversion / Usage with Windows and linux

    Hi All,
    For the past few months I have been using CLC workbench on windows 7 to do de novo assembly of transcriptome sequences. Now, being happy with my contigs I want to move onto other tasks that are primarily in linux. I have had my computer reformatted to a dual boot system to run Ubuntu but I have run into the problem that the Fasta file generated in CLC (windows) will not work in linux.
    For example, I am unable to blast my fasta contig file against other databases. If I run formatdb I get an error message saying the file can't be opened. I am guessing this has to do with the differences between dos and unix in file formatting. I tried dos2unix commands in linux but I still cannot use the file. Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux? As a last resource I can switch CLC to linux but this will require me to change from Ubuntu to Redhat, and I have just finished installing quite a bit of software in ubuntu. I should say that I am completely new to linux.
    Thanks!
  • Bruins
    Member
    • Feb 2010
    • 78

    #2
    Hi,

    I noticed this post because someone in my office had a similar problem (miRNAkey on biolinux, basically Ubuntu, refused to open fasta files created by CLC on Windows).

    So the problem might be with the end-of-line character used, with the text encoding or with the way CLC writes a fasta file and the way formatdb reads it. In theory, dos2unix should take care of the first point. You could also try to open it in gedit (or another Linux text editor). I know that TextWrangler on my osx has an option to show 'hidden characters', I don't know about gedit? You could try opening the file in gedit and then saving it explicitly in the correct encoding (UTF-8 I think).

    In the end, the problem with miRNAkey had nothing to do with newlines or encoding. It expected the sequences to be on one line alone (which makes sense, they're miRNAs). CLC on the other hand used a more 'standard' way of writing fasta files: add a newline after so many bases (75? 60? dunno). However I doubt that formatdb has a problem with this. Check the docs to be save.

    Another problem this person faced was the way the files were copied. He used VMware instead of a dual boot, basically running linux in windows. When he copied the files from Win to Linux, he couldn't open them. When he copied them to USB, remounted the USB to the linux and the copied them, he was fine.

    Hope that helps,
    cheers

    Comment

    • mfursov
      Junior Member
      • Dec 2009
      • 6

      #3
      Originally posted by Wiseone View Post
      Does anyone have a solution whereby I can make my fasta file from windows open usable for a blast in linux?
      Thanks!
      Could you share the your FASTA file with us? The problem looks interesting to me (I'm one of UGENE developers) and I think that as the result of the investigation I will be able both to test the tool and help you solving your problem.
      ---
      http://ugene.unipro.ru

      Comment

      • RDW
        Member
        • Oct 2008
        • 63

        #4
        Just to rule out any issues that aren't OS-specific, is the file compatible with the Windows version of formatdb?

        ftp://ftp.ncbi.nlm.nih.gov/blast/exe...elease/LATEST/

        Comment

        • Wiseone
          Junior Member
          • Apr 2010
          • 7

          #5
          So, everything is now working. Bruins was correct. By opening the fatsa file in G Edit and saving with Linux line endings I was able to use the file.

          Comment

          • boetsie
            Senior Member
            • Feb 2010
            • 245

            #6
            I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

            Code:
            my contigfile = $ARGV[0];
            
            open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
            my ($seq, $prevhead) = ('','');
            while(<IN>){
              s/\r\n/\n/;
              chomp;
              $seq.= $_ if(eof(IN));
              if (/\>(\S+)/ || eof(IN)){
                my $head=$_;
                if($seq ne ""){
                  print "$prevhead\n$seq\n";
                }
                $prevhead = $head;
                $seq = '';
              }else{
                $seq .= $_;
              }
            }
            close IN;
            Boetsie.

            Comment

            • skycreative
              Member
              • Jan 2010
              • 33

              #7
              Originally posted by boetsie View Post
              I see you fixed your problem, however, I don't think you want to open a huge file if you have a lot of contigs. I also had this problem befor and used a perl script to solve it. I converted my file using the s/\r\n/\n/ function. Here is a perl script to convert your contig file;

              Code:
              my contigfile = $ARGV[0];
              
              open(IN,contigfile) || die "Can't open contigfile -- fatal\n";
              my ($seq, $prevhead) = ('','');
              while(<IN>){
                s/\r\n/\n/;
                chomp;
                $seq.= $_ if(eof(IN));
                if (/\>(\S+)/ || eof(IN)){
                  my $head=$_;
                  if($seq ne ""){
                    print "$prevhead\n$seq\n";
                  }
                  $prevhead = $head;
                  $seq = '';
                }else{
                  $seq .= $_;
                }
              }
              close IN;
              Boetsie.
              it is simple, and if add output function will be more effective.
              my $txt;

              for (my $i=0;$i*50<length($seq) ; ){

              $txt.=substr($seq,$i*50,50)."\n";

              $i++;

              }

              print $head,"\n";
              print $txt;

              Comment

              Latest Articles

              Collapse

              ad_right_rmr

              Collapse

              News

              Collapse

              Topics Statistics Last Post
              Started by SEQadmin2, 06-05-2026, 10:09 AM
              0 responses
              16 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-04-2026, 08:59 AM
              0 responses
              34 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 12:03 PM
              0 responses
              36 views
              0 reactions
              Last Post SEQadmin2  
              Started by SEQadmin2, 06-02-2026, 11:40 AM
              0 responses
              23 views
              0 reactions
              Last Post SEQadmin2  
              Working...