Unconfigured Ad

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts
  • chip_seq
    Member
    • Mar 2011
    • 11

    SRA to .csfasta

    Hi All,
    does any one know how to convert .sra files into .csfasata?
  • simonandrews
    Simon Andrews
    • May 2009
    • 870

    #2
    You'd need to use abi-dump from the sra-toolkit.

    Comment

    • chip_seq
      Member
      • Mar 2011
      • 11

      #3
      Hi Simon,
      Thank you very much
      when i tried abi-dump i got .csfasta with following format:

      >SRR089316.sra.1 1_18_263_F3
      T2203022222121332.03122.0.30.1.03100330.2010101.000
      >SRR089316.sra.2 1_18_325_F3
      T1222000000310122.13222.2.23.0.22030010.1100120.000
      >SRR089316.sra.3 1_18_483_F3
      T3211330120000113.00231.0.20.2.30013200.1121300.100

      as you can see after > file name +space that i removed later,also u can see in sequence For ex (T1222000000310122.13222.2.23.0.22030010.1100120.000) there are dots that i also removed but still there is problem in mapping ,do u have any idea?

      Thanks in Advance

      Comment

      • simonandrews
        Simon Andrews
        • May 2009
        • 870

        #4
        You don't want to remove the dots. Those are locations in your read where the color could not be determined (equivalent to an N in base space). Removing the dots will create deletions which won't help your efforts to map the data.

        You'll need to be a bit more specific about what problems you're having in mapping. What program are you using? What command are you running and what do you get?

        Comment

        • chip_seq
          Member
          • Mar 2011
          • 11

          #5
          Hi simon,

          Thank you.
          After using abi-dump i got .csfast file with the following format:
          >SRR089316.sra.1 1_18_263_F3
          T2203022222121332.03122.0.30.1.03100330.2010101.000
          >SRR089316.sra.2 1_18_325_F3
          T1222000000310122.13222.2.23.0.22030010.1100120.000

          when i map using corona lite i run this command

          matching_large_genomes_cmap_save_script.pl -csfasta data_F3.csfasta -dir out_dir_path -cmap cmap -t 35 -e 2 -z 10

          Name "Template::Filters::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::Context::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::Service::BASEARGS" used only once: possible typo at path/Base.pm line 49.
          Name "Template::Provider::BASEARGS" used only once: possible typo at path/pathBase.pm line 49.
          Name "Template::Plugins::BASEARGS" used only once: possible typo at path/Base.pm line 49.

          Read Length Specified: 35, Read Length Detected: 35
          Note, tempdir /scratch not found. Make sure it exists on executing nodes.

          You have 4 seconds to proofread and CTRL-C if appropriate...
          1,2,3,4.
          Making scripts for the following:
          ALIGN_1_1 ALIGN_2_1 ALIGN_3_1 ALIGN_4_1 ALIGN_5_1 ALIGN_6_1 ALIGN_7_1 ALIGN_8_1 ALIGN_9_1 ALIGN_10_1 ALIGN_11_1 ALIGN_12_1 ALIGN_13_1 ALIGN_14_1 ALIGN_15_1 ALIGN_16_1 ALIGN_17_1 ALIGN_18_1 POST_MATCHING_BY_SETS_1 POST_MATCHING_BY_CHR_1 POST_MATCHING_BY_CHR_2 POST_MATCHING_BY_CHR_3 POST_MATCHING_BY_CHR_4 POST_MATCHING_BY_CHR_5 POST_MATCHING_BY_CHR_6 POST_MATCHING_BY_CHR_7 POST_MATCHING_BY_CHR_8 POST_MATCHING_BY_CHR_9 POST_MATCHING_BY_CHR_10 POST_MATCHING_BY_CHR_11 POST_MATCHING_BY_CHR_12 POST_MATCHING_BY_CHR_13 POST_MATCHING_BY_CHR_14 POST_MATCHING_BY_CHR_15 POST_MATCHING_BY_CHR_16 POST_MATCHING_BY_CHR_17 POST_MATCHING_BY_CHR_18 POST_MATCHING_CONCAT_MATCH_FILESstats_flag = 0
          POST_MATCHING_FINAL POST_MATCHING_MAKING_INDEX

          In out_dir
          scripts have been made. Use submit_scripts_to_XXX.pl to submit to a cluster.

          and after running scripts i got:

          S[START]: 2011-04-20 17:32:44.326588000
          StartTime is Wed Apr 20 17:32:44 JST 2011
          Directory is /out_dir
          Running on host
          Job - in Queue
          Preparing out_dir/scripts/output_ALIGN_1_1.txt
          CORONAROOT=/path
          TS[JOB_START]: 2011-04-20 17:32:44.340211000

          genome_file = /home/path/Validated/chrI.fa
          reads_file = path/SRR089316.sra_F3.csfasta
          output_directory = /out_dir/chrI
          tag_length = 50
          number_of_errors = 2
          schema_file = /path/schemas/DBschema
          start = 0
          adj_errors = 0
          maximum_hits = 10
          reference option = 0
          offset = 0

          [WARNING]: Unable to find scratch directory (/scratch).
          *** mapreads will run in current directory ('/out_dir/chrI').
          *** It may run very slowly. matching reads to the genome ...
          running mapreads /path/SRR089316.sra_F3.csfasta /path_of_cmap/Validated/chrI.fa M=2 S=0 u=2 L=50 T=/path/schemas/DBschema A=0 O=0 Z=10 R=0 I=0 q=1 r=1 > /outdir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp
          if [ ! $? -eq 0 ]
          then echo `date` FAILURE. Making SRR089316.sra_F3.csfasta.ma.50.2.tmp failed. >&2;rm /out_dir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp;exit 1
          else mv out_dir/chrI/SRR089316.sra_F3.csfasta.ma.50.2.tmp /out_dir/testmap_16_wed/chrI/SRR089316.sra_F3.csfasta.ma.50.2; echo `date` Making of SRR089316.sra_F3.csfasta.ma.50.2 sucessful.>&2
          fi;

          map start run No. 1
          reads file format is wrong, expecting > sign
          fail to execute command:
          /path/bin/map /out_dir/SRR089316.sra_F3.csfasta / path/Validated/chrI.fa T=20 L=49 C=1 E=.Tmpfile1303288364cKkjWT F=0 D=1 np=1 V=15.000000 u=1 r=0 n=1 Z=10 P="1111111111111100000000000000000000000000000000000" M=0 U=0.000000 H=0 B=1 m=0 | gzip -3 -c -f > .Tmpfile1303288364cKkjWT.out.1 ; exit ${PIPESTATUS[0]}
          Wed Apr 20 17:32:46 JST 2011 FAILURE. Making SRR089316.sra_F3.csfasta.ma.50.2.tmp failed.

          ERROR: mapreads failed


          Thank you in advance.

          Comment

          • simonandrews
            Simon Andrews
            • May 2009
            • 870

            #6
            Originally posted by chip_seq View Post
            reads file format is wrong, expecting > sign
            This seems to be the relevant error. The program doesn't like the format of your csfasta file. This could be something as simple as their being a blank line somewhere in the file, it could be that you have odd line endings or there could be some other formatting problem.

            I'd start by creating a small file out of the first few hundred lines of your csfasta file and checking through it for any formatting problems. If that's OK then run that through your mapping pipeline - if it works then you know that there's a formatting problem elsewhere in your file which you can track down. If it still fails then there's something more fundamentally wrong.

            Comment

            • chip_seq
              Member
              • Mar 2011
              • 11

              #7
              Thank you very much .
              Waiting for your answer.

              Comment

              • simonandrews
                Simon Andrews
                • May 2009
                • 870

                #8
                Originally posted by chip_seq View Post
                Waiting for your answer.
                Did you see the note I posted yesterday? There's not much else anyone here can do - you need to figure out what the formatting problem in your csfasta file is. Try searching with a small section from the top of the file which you can manually review, and then move on from there depending on what you find.

                Comment

                • chip_seq
                  Member
                  • Mar 2011
                  • 11

                  #9
                  I see.Thank you very much.

                  Comment

                  • chip_seq
                    Member
                    • Mar 2011
                    • 11

                    #10
                    Hi Simon,
                    I found this formatting error:
                    >SRR089306.sra.55 3_31_1136^P_F3
                    T20320322233120100222232221320320221322203222222223
                    >SRR089306.sra.56 3_32_245D�^Y_F3
                    T30013201101131222330001113030201223332222222222323
                    >SRR089306.sra.57 3_32_290_F3
                    T03100031011311322322323133331003223002320022233232
                    >SRR089306.sra.58 3_32_337@oT^Y_F3
                    T03321131302130332121103032223221222312223122222222
                    >SRR089306.sra.59 3_32_1472_F3
                    T00101003220302223100012023300321020222220120220222
                    >SRR089306.sra.60 3_32_1533oT^Y_F3
                    T00010310223113300302102232302301222012223122222222

                    Do you know why i got this formatting error and how to fix it?
                    Thanks in Advance

                    Comment

                    • simonandrews
                      Simon Andrews
                      • May 2009
                      • 870

                      #11
                      You could try the following script (only lightly tested) which should find any oddly formatted entries in your file and remove them. Hopefully it should leave you with a file which you can process.

                      Code:
                      #!/usr/bin/perl
                      use warnings;
                      use strict;
                      
                      my ($infile,$outfile) = @ARGV;
                      
                      die "Usage is fix_csfasta.pl [input file] [output file]\n" unless ($outfile);
                      
                      open (IN,$infile) or die "Can't read $infile: $!";
                      open (OUT,'>',$outfile) or die "Can't write to $outfile: $!";
                      
                      while (<IN>) {
                      
                        if (/^>/) {
                          my $header = $_;
                          chomp $header;
                          $header =~ s/[\r\n]//g;
                          $header =~ s/[^>\w_\. ]//g;
                      
                          my $seq = <IN>;
                          chomp $seq;
                          $seq =~ s/[\r\n]//g;
                          unless ($seq =~ /^T[0123\.]+$/) {
                            warn "Skipping odd looking sequence '$seq'\n";
                            next;
                          }
                      
                          print OUT "$header\n$seq\n";
                          
                        }
                        else {
                          warn "Skipping unexpected line : $_";
                        }
                      
                      }

                      Comment

                      • chip_seq
                        Member
                        • Mar 2011
                        • 11

                        #12
                        Thank you very much.
                        however i got many skipped lines ,do those skipped lines will affect the output
                        Skipping odd looking sequence 'Q{???_F3'
                        Skipping unexpected line : T03101002001200001210100000100020001210222303123002
                        Skipping odd looking sequence 'fj?_F3'
                        Skipping unexpected line : T00012002231322013012032211220223110033322330033030
                        Skipping odd looking sequence 'fj?_F3'
                        Skipping unexpected line : T21330231213330011101102123131102012101033000313322
                        Skipping odd looking sequence '_F3'
                        Skipping unexpected line : T22013201203033023103231220203232200101112233003222
                        Skipping odd looking sequence '_F3'
                        Skipping unexpected line : T33022110112231122002232221332332220102223320303320

                        Do you know why i got those odd looking sequences.
                        Thank you very much for you help.

                        Comment

                        • simonandrews
                          Simon Andrews
                          • May 2009
                          • 870

                          #13
                          It looks like you have a load of lines where there is an extra line break in the header line. This will cause the next line (which should be the sequence) to actually be the second part of the header, and the actual sequence will be skipped as the program searches for the next valid line.

                          Have a look and see how many of your sequences are affected. If it's only a small proportion then don't worry about it and just use the cleaned file. If it's a high proportion of your original file then you'd need to do a more sensitive extraction of the useful data (probably by looking for lines which look like valid sequence and using those, whilst discarding the existing headers all together).

                          Comment

                          • chip_seq
                            Member
                            • Mar 2011
                            • 11

                            #14
                            Thank you very much for your kind help

                            Comment

                            • chip_seq
                              Member
                              • Mar 2011
                              • 11

                              #15
                              Hi Simon,

                              Thank you for help previously.
                              after i removed strange characters from seq files and mapped them to genome i got 0% coverage which suggests severe problem although i'm using Corona lite with almost same previous parameters.
                              Any idea?

                              Thank you in advance

                              Comment

                              Latest Articles

                              Collapse

                              • SEQadmin2
                                From Collection to Sequencing: Why Sample Preparation and Preservation Define Sequencing Data
                                by SEQadmin2


                                Data variability is still an issue in sequencing technologies despite the advances in reproducibility and accuracy of these platforms. But the problem does not originate in the sequencing itself, but in the previous steps, before the sample reaches the sequencer.


                                The first step is collection, followed by preservation and sample preparation for analysis. Most scientists overlook those steps, but not being careful might just be skewing the experiment’s results.
                                ...
                                Yesterday, 10:05 AM
                              • SEQadmin2
                                Single-Cell Sequencing at an Inflection Point: Early Impacts of New Platforms and Emerging Trends
                                by SEQadmin2


                                With the launch of new single-cell sequencing platforms in 2026, the field stands at an exciting inflection point. This article surveys the most impactful advances in the field and discusses how they’re reshaping research in cancer, immunology, and beyond.


                                Introduction

                                Single-cell sequencing technologies have undergone remarkable advances over the past decade, transitioning from low-throughput experimental approaches to highly scalable platforms capable of...
                                05-22-2026, 06:42 AM
                              • SEQadmin2
                                Environmental Genomics in the Age of NGS: From Microbes to Conservation Strategies
                                by SEQadmin2

                                Studying ecosystems means dealing with complex, multi-species communities that are hard to observe at scale. This complexity, however, hides many important questions to be answered, from how biogeochemical cycles work and how climate change can affect species distribution to how conservation strategies can work best.


                                Genomics, particularly since the expansion of NGS, has transformed ecosystem ecology. By sequencing environmental DNA, we can now assess biodiversity without direct...
                                05-06-2026, 09:04 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by SEQadmin2, Yesterday, 12:03 PM
                              0 responses
                              19 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, Yesterday, 11:40 AM
                              0 responses
                              14 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-28-2026, 11:40 AM
                              0 responses
                              29 views
                              0 reactions
                              Last Post SEQadmin2  
                              Started by SEQadmin2, 05-26-2026, 10:12 AM
                              0 responses
                              31 views
                              0 reactions
                              Last Post SEQadmin2  
                              Working...