Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • SRA Toolkit and Conversion to Illumina Fastq Format

    Hi Seqers,

    I am trying to convert the SRA ChIP-Seq file (SRA Archive) to Illumina Fastq format. I ran illumina-dump -A <Accession Number> <filename>. I got about more 100 qcal and seq files. Now, I would like to know what should me my input file for ELAND_standalone.pl aligner program.

    Do I have to concatenate all my 100 seq files into 1 file and then run ELAND_standalone.pl ?

    Any help/hints/suggestions/advice would highly be appreciated.

    Thanks.

  • #2
    You could use EMBOSS seqret (or BioPerl or Biopython or ...) to convert from Sanger FASTQ encoding to the old Illumina encoding - but you might also need to massage the record names to suit ELAND.

    Comment


    • #3
      EMBOSS seqret command for converting FAstq Sanger old illumina FASTQ

      I am trying to find what unix command I may use to convert FAStq new QC format (sanger) to old illumina qc format. I have PE data. Thanks
      Last edited by mathew; 05-17-2012, 05:55 AM.

      Comment


      • #4
        Originally posted by mathew View Post
        I am trying to find what unix command I may use to convert FAStq new QC format (sanger) to old illumina qc format. I have PE data. Thanks
        What do you mean by QC?

        EMBOSS seqret (mentioned earlier) can interconvert Sanger FASTQ (used for Illumina 1.8+), the original Solexa to Illumina 1.2 FASTQ (which did not use PHRED scores), and the Illumina 1.3 to 1.7 FASTQ variants.

        Comment


        • #5
          i.e. If your input Sanger encoded FASTQ file is called input.fastq, and you want to turn it into Illumina 1.3+ encoded FASTQ, try:

          seqret -sequence=input.fastq -sformat=fastq-sanger -osformat=fastq-illumina -outseq=output.fastq

          Comment


          • #6
            conversion to illumina 1.3Fastq

            Hi maubp,

            Thanks for your help I ran the command it did not gave me any error. Here is a part of file before running command (before, sanger) and after running comand (after Illumina). I dont see a difference. Am I missed something or did something wrong.
            I just inserted in put and out put file names. Any advice please.
            __
            (Before, sanger)


            @HWI-ST413:193092FACXX:1:1101:1180:1912 1:N:0:
            NATGTACCTGACGAAGCAGCTACCATCTCAGCAGTTGCTGGTCACTGTGCAGTGGAAAAGAGAGAAGTGCATGAAGTCAGCAATTATACTTGGCCTGGAAG
            +
            #1=DDDFFHGHHGHAHGIIIIEIGHGHGIIGDIACDHIIIIHBDHHGIEHIIFHIGCHG@@FGIEHI=CHGEFCB?DCDFAECCECCDACCCC>>A@BBCC
            @HWI-ST413:193092FACXX:1:1101:1225:1915 1:N:0:
            NAACAGAATAAAGATTATAATTACATTTGATTTAGTTCCAAAAACGGAGTCAAAAATCTTAACCTTTGACAAGACCTGTGTAAAGAAGCTGAGGTAAGCAT
            +
            #1:BDAABFDB9E:E:<?IECFEABA<EEF@@C?B?<FFGII<0?09BGFEC>8=)=B8B4=)7.)7=2=D))7)=@B@BBB96>6;ABB5;>@BBB
            @HWI-ST413:193092FACXX:1:1101:1201:1926 1:N:0:
            NAGGCTTCCTTCATCTCTCCTCTACACAATCTCTTCCTAGTCTTGCTATAGCCAAATTTGTCTCCTTGCTGTTTGTGAAGAAGCCAAACATATTTCTACCT
            +
            #1=DADDFHHHHDGIIFCHIGGHGIDHEIJIGEIGIIIFGGHIJIJIJJJDIIEFGIIJJCHIIIIJGGHCHJJIGGIIGHGFEE>CEFECCEEEEECCC>
            @HWI-ST413:193092FACXX:1:1101:1176:1929 1:N:0:
            NCATCTCCAAGTTGCTAAAGCCTAATGAGAAAAAAAAATGGTAAATATCCATATCATCTCTTATGATGAAAAGCTATTATGTTTTCAAAACTTAACTAAAC
            +
            #11AB;BDDDBDBEEEBEAEEBEFIIIEEIIIIIIDIIIEIEEEEEEIEEECCEEII;7?;;ACCDDD;?D@A96(;>D>A>AD?A>A>:9AAAAAAAAA9
            @HWI-ST413:193092FACXX:1:1101:1249:1946 1:N:0:
            NAATTTAACCAACAAGGTGAAATATCTGTTATACCAAAAATTATAAAACATTGAGGAAATTGCCGATGACACAAATAAGTGGAAAGGTATCCCATGTTCAT
            +
            #11ADDDDHDHDHDAFA2<?EDFFF<BHHE@EFEEFEDEHHFBFFCFCGIIIFII9BHCGHGECGGIEDHCCEHDBEEEEDAC@CAB>@@CCCAAC>CD@>
            @HWI-ST413:193092FACXX:1:1101:1227:1952 1:N:0:
            NTCTGCCTTTACCTTCAAAGTCTGAGCAAATATGATTTTATATCTTTTTAATTAGAGATTCTTTTAAAGACCAAGTTACTGCAGTCCTGTCTTGTTCTTCT
            +
            #1=DDDFFHHGGHJJGIJGHEHHHDFHFHFHGIIFFIIIJJCHEIIJIGICHEIIFGIJJJIGGGIGCHCGIIJIHEHCHIGEHHCEHFBE>@DEECDAC@
            @HWI-ST413:193092FACXX:1:1101:1157:1988 1:N:0:
            CNAGAAGCGCTAACAATTATTTTGTATGATCAATAGAGAATTGCAACAGTTTTTGTTGTGTTGATACTCAATGACTTATGATGCTGAAAAACTAGTGAGGA
            +
            @#1ADDDDGFFHHJJJJJIJIIJICGIIHIGIEGCGGHHEHGIDHIEHI@FIHJIIIJGHGGIEGICHEHEEHCB;CFEF@CEECCACDCDDCDDCCCCAA
            @HWI-ST413:193092FACXX:1:1101:1225:2000 1:N:0:
            TTTGTTTACATTCTATTCGATTCCATTCCATTTGAATCAATTATATTGCAATTTATTGCATTGGAGTCCGTTCAAATGCACTCCATACCGTTCCATTCCAT
            +

            ###########################################
            After _ Illumina

            @HWI-ST413:193092FACXX:1:1101:1180:1912 1:N:0:
            NATGTACCTGACGAAGCAGCTACCATCTCAGCAGTTGCTGGTCACTGTGCAGTGGAAAAGAGAGAAGTGCATGAAGTCAGCAATTATACTTGGCCTGGAAG
            +
            BP\ccceegfggfg`gfhhhhdhfgfgfhhfch`bcghhhhgacggfhdghheghfbgf__efhdgh\bgfdeba^cbce`dbbdbbc`bbbb]]`_aabb
            @HWI-ST413:193092FACXX:1:1101:1225:1915 1:N:0:
            NAACAGAATAAAGATTATAATTACATTTGATTTAGTTCCAAAAACGGAGTCAAAAATCTTAACCTTTGACAAGACCTGTGTAAAGAAGCTGAGGTAAGCAT
            +
            BPYac``aYcecaXdYdY[^hdbed`a`[dde__b^a^[eefhh[OYH^OXafedb]W\H\aWaS\HVMHV\Q\cHHVH\_a_aaaXU]UZ`aaTZ]_aaa
            @HWI-ST413:193092FACXX:1:1101:1201:1926 1:N:0:
            NAGGCTTCCTTCATCTCTCCTCTACACAATCTCTTCCTAGTCTTGCTATAGCCAAATTTGTCTCCTTGCTGTTTGTGAAGAAGCCAAACATATTTCTACCT
            +
            BP\c`cceggggcfhhebghffgfhcgdhihfdhfhhheffghihihiiichhdefhhiibghhhhiffgbgiihffhhfgfedd]bdedbbdddddbbb]
            @HWI-ST413:193092FACXX:1:1101:1176:1929 1:N:0:
            NCATCTCCAAGTTGCTAAAGCCTAATGAGAAAAAAAAATGGTAAATATCCATATCATCTCTTATGATGAAAAGCTATTATGTTTTCAAAACTTAACTAAAC
            +
            BPP`aZacccacadddad`ddadehhhddhhhhhhchhhdhddddddhdddbbddhhZV^ZZ`bbcccZ^c_`XUGZ]c]`]`c^`]`]YX`````````X
            @HWI-ST413:193092FACXX:1:1101:1249:1946 1:N:0:
            NAATTTAACCAACAAGGTGAAATATCTGTTATACCAAAAATTATAAAACATTGAGGAAATTGCCGATGACACAAATAAGTGGAAAGGTATCCCATGTTCAT
            +
            BPP`ccccgcgcgc`e`Q[^dceee[aggd_deddedcdggeaeebebfhhhehhXagbfgfdbffhdcgbbdgcaddddc`b_b`a]__bbb``b]bc_]
            @HWI-ST413:193092FACXX:1:1101:1227:1952 1:N:0:
            NTCTGCCTTTACCTTCAAAGTCTGAGCAAATATGATTTTATATCTTTTTAATTAGAGATTCTTTTAAAGACCAAGTTACTGCAGTCCTGTCTTGTTCTTCT
            +
            BP\ccceeggffgiifhifgdgggcegegegfhheehhhiibgdhhihfhbgdhhefhiiihfffhfbgbfhhihgdgbghfdggbdgead]_cddbc`b_
            @HWI-ST413:193092FACXX:1:1101:1157:1988 1:N:0:
            CNAGAAGCGCTAACAATTATTTTGTATGATCAATAGAGAATTGCAACAGTTTTTGTTGTGTTGATACTCAATGACTTATGATGCTGAAAAACTAGTGAGGA
            +
            _BP`ccccfeeggiiiiihihhihbfhhghfhdfbffggdgfhcghdgh_ehgihhhifgffhdfhbgdgddgbaZbede_bddbb`bcbccbccbbbb``
            @HWI-ST413:193092FACXX:1:1101:1225:2000 1:N:0:
            TTTGTTTACATTCTATTCGATTCCATTCCATTTGAATCAATTATATTGCAATTTATTGCATTGGAGTCCGTTCAAATGCACTCCATACCGTTCCATTCCAT
            +
            bbbeceeegggggiiiiiiiiiiiiiihiiiiiiifiiihiifhiiiihiiiiihiiihiiiiiiiiiighhiiiiiiiiiiiihgggggeeeedceeddd
            @HWI-ST413:193092FACXX:1:1101:1361:1913 1:N:0:
            NTCACAGTCCCAGTGGGCCTTGTCTGTCACTGAGTTACAAGCCACACTCAATCCCTGGAGATGCTGAGTGCTGTTAATGGACACGTGATGCCGGCTAAACA
            +
            BP\accdeac`gagfafff`fdg`eadghhh]df^b[cffh`g_efdfhbcgfhffbgggefffgghfbbgggeg_db]ababdb_aabbbb__aY[G]]b
            @HWI-ST413:193092FACXX:1:1101:1439:1915 1:N:0:
            NCATGTCAACTACTTGTGATGAGTTTCTGAGTCTAGCAAAGTCCGTAAACCCTAGTATTTCTCTCCTTTTTTCCCTGCAGAAAGGATCTTGCTCTGTGGCC
            +
            BPYc^caccaea^ef``ba[ba[[^ecgbagd`eadc[ee_fdfdedeccede_^e^_ecaaeffedfdefede_V^^_a_cR]`a`aaa``aaa`]`[_^
            @HWI-ST413:193092FACXX:1:1101:1383:1918 1:N:0:
            NAGTGATCCTCTTAACTAATGCTTAAGCTCCAATTTCTTGCCATAGTGCTTATCACAGATTGTACTCCTAAGACTGACCTCCAGATTTATCTCCTGAAGCA
            +

            Comment


            • #7
              Originally posted by mathew View Post
              Hi maubp,

              Thanks for your help I ran the command it did not gave me any error. Here is a part of file before running command (before, sanger) and after running comand (after Illumina). I dont see a difference. Am I missed something or did something wrong.
              I just inserted in put and out put file names. Any advice please.
              That has changed the data - look at the first record for instance,
              (Before, sanger)
              Code:
              @HWI-ST413:193:D092FACXX:1:1101:1180:1912 1:N:0:
              NATGTACCTGACGAAGCAGCTACCATCTCAGCAGTTGCTGGTCACTGTGCAGTGGAAAAGAGAGAAGTGCATGAAGTCAGCAATTATACTTGGCCTGGAAG
              +
              #1=DDDFFHGHHGHAHGIIIIEIGHGHGIIGDIACDHIIIIHBDHHGIEHIIFHIGCHG@@FGIEHI=CHGEFCB?DCDFAECCECCDACCCC>>A@BBCC
              After _ Illumina
              Code:
              @HWI-ST413:193:D092FACXX:1:1101:1180:1912 1:N:0:
              NATGTACCTGACGAAGCAGCTACCATCTCAGCAGTTGCTGGTCACTGTGCAGTGGAAAAGAGAGAAGTGCATGAAGTCAGCAATTATACTTGGCCTGGAAG
              +
              BP\ccceegfggfg`gfhhhhdhfgfgfhhfch`bcghhhhgacggfhdghheghfbgf__efhdgh\bgfdeba^cbce`dbbdbbc`bbbb]]`_aabb
              The fourth line which is the qualities has changed. I've not doubled checked, but it looks OK.

              Comment


              • #8
                SRA database fastq format

                Hello, I want to ask a quenstion:when I directly download FASTQ format from SRA database, it looks like this, as follows, I want to know how can I convert it to an available data to analyse it directly? I have no idea how to deal with it, can anybody help me ? Thank you!

                @SRR031126.1.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:41.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.1.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:41.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.2.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:69.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.2.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:69.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.3.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:129.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.3.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:129.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.4.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:154.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.4.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:154.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.5.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:171.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.5.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:171.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.6.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:273.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.6.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:273.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.7.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:374.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
                +SRR031126.7.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:374.1 length=76
                !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
                @SRR031126.8.1 SOLEXA-GA02_SRi_AK_BN_test:1:1:0:404.1 length=76
                NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

                Comment


                • #9
                  Something is very wrong with that data - all the bases are N and all the qualities are zero (the "!" is ASCII 33 means it encodes PHRED zero). Perhaps this is just an edge effect - the first and last reads on a Solexa/Illumina run are not as good as those in the middle of the slide.

                  How exactly did you get this data from the SRA?

                  Comment


                  • #10
                    Originally posted by maubp View Post
                    Something is very wrong with that data - all the bases are N and all the qualities are zero (the "!" is ASCII 33 means it encodes PHRED zero). Perhaps this is just an edge effect - the first and last reads on a Solexa/Illumina run are not as good as those in the middle of the slide.

                    How exactly did you get this data from the SRA?
                    I download the data with the selection of fltered download,and then select FASTQ format. BTW, there are two ways that we can get FASTQ format,the one is directly download FASTQ format like that from SRA;the other one is first download .sra files, and then convert to fastq format. Do anyone know the difference of FASTQ files between the two ways ? Thank you!

                    Comment


                    • #11
                      Can it go wrong if I do

                      fastq-dump --split-3 --gzip SRR012345.sra

                      ??

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Strategies for Sequencing Challenging Samples
                        by seqadmin


                        Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                        03-22-2024, 06:39 AM
                      • seqadmin
                        Techniques and Challenges in Conservation Genomics
                        by seqadmin



                        The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                        Avian Conservation
                        Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                        03-08-2024, 10:41 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 03-27-2024, 06:37 PM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-27-2024, 06:07 PM
                      0 responses
                      11 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-22-2024, 10:03 AM
                      0 responses
                      53 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 03-21-2024, 07:32 AM
                      0 responses
                      69 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X