Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Data format of NGS file

    Hello,
    I have been dealing with some strange data recently. I suspect this might be illumina's older casava (pre 1.8) output from the GAIIX. unfortunately the lab where i work at has no clue about this data, and the service provider has shut shop Can anybody tell me in their experience, what data format is it? I assumed it to be the older solexa data format and tried to convert it to fastq using maq sol2sanger, but I am getting an error which goes like:

    "Inconsistent sequence name: HWUSI-EAS1642R_0000:5:1:2782:993#GATCAG/1. Continue anyway.
    Segmentation fault (core dumped)"


    Sample file:

    @HWUSI-EAS1642R_0000:5:1:2782:993#GATCAG/1
    NCATTAAAGCAATCATCCATCCTGACCAAGTAGGTTTTATTCCAGGAATGCAGGGATGGTTTAATATACGAAAATCCATCAATGTAATCCACTATATAAA
    +HWUSI-EAS1642R_000:5:1:2782:993#GATCAG/1
    BGIFGHIHHI[W[YYVYYYYY[Y[[YYYYY[[[Y[VVPVOQQ____________TTVVTVYYYYYYYYYRWWWWW___Y_____PYYYYYWWPWWYYVYR
    @HWUSI-EAS1642R_0000:5:1:3250:998#GATCAG/1
    NACTATCCAGAGTCACTCAAAAGGGAGACAAGCACTTGGTGCCACATCAACACAAAACTCATAAGAGCTAGAAACACACTCAAAATTGATCATTAATATA
    +HWUSI-EAS1642R_000:5:1:3250:998#GATCAG/1
    BIGKMLLQRN_________WV___________________NRRTR_Y_________QQY________________[Y[[PWWWWW[YY[[WYYWY_____


    ANy inputs will be much appreciated. Thanks!

  • #2
    A google search for this error reveals the following, but I am not sure how fiddling with original sequencer output files, or raw fastq files will work out.

    Comment


    • #3
      Originally posted by ron128 View Post
      A google search for this error reveals the following, but I am not sure how fiddling with original sequencer output files, or raw fastq files will work out.

      http://sourceforge.net/p/maq/mailman...u-tokyo.ac.jp/
      HMM I will suggest u not to convert the file format because it's really useless from my experience in this case. I had the same problem and infact what i faced that after converting it into fastq i got NNNNN like this kind of bases all over in my reads... so it's kind of very risky...n I will suggest u to get fastq file...for further work else no use...ggod luck
      !!!

      Comment


      • #4
        Originally posted by paa6 View Post
        HMM I will suggest u not to convert the file format because it's really useless from my experience in this case. I had the same problem and infact what i faced that after converting it into fastq i got NNNNN like this kind of bases all over in my reads... so it's kind of very risky...n I will suggest u to get fastq file...for further work else no use...ggod luck
        !!!
        Thanks a lot for your insight sir! however like i mentioned, the service provider has folded, and the only access to data that we have is this data here. Is there any way I can convert this data to .fastq format? What IS this data format anyways? Thanks for taking the time out to reply. Much Appreciated

        Comment


        • #5
          At least from the 2 reads you posted it looks like phred+64, so some variant of illumina before 1.8. You can get an idea of whether it's Solexa by just grepping for ">" or "?", which won't be present in any of the Illumina formats that have offsets of 64.

          BTW, it's already a fastq file, you just need to let your aligner know how the quality scores are encoded (though many of the newer aligners only support phred+33).
          Last edited by dpryan; 03-26-2014, 02:46 AM.

          Comment


          • #6
            Originally posted by ron128 View Post
            Thanks a lot for your insight sir! however like i mentioned, the service provider has folded, and the only access to data that we have is this data here. Is there any way I can convert this data to .fastq format? What IS this data format anyways? Thanks for taking the time out to reply. Much Appreciated
            ahhh no need to be formal by the way...and I am not sir..I am mam...I have analysed ur file and this is fastq file...and use fastqc for analysis of ur sequence before doing anything further....

            Comment


            • #7
              short read alignment tools ?

              hi, my research is going on short read sequence alignment . how to g.et online short read alignment tools and input data format
              Originally posted by ron128 View Post
              Hello,
              I have been dealing with some strange data recently. I suspect this might be illumina's older casava (pre 1.8) output from the GAIIX. unfortunately the lab where i work at has no clue about this data, and the service provider has shut shop Can anybody tell me in their experience, what data format is it? I assumed it to be the older solexa data format and tried to convert it to fastq using maq sol2sanger, but I am getting an error which goes like:

              "Inconsistent sequence name: HWUSI-EAS1642R_0000:5:1:2782:993#GATCAG/1. Continue anyway.
              Segmentation fault (core dumped)"


              Sample file:

              @HWUSI-EAS1642R_0000:5:1:2782:993#GATCAG/1
              NCATTAAAGCAATCATCCATCCTGACCAAGTAGGTTTTATTCCAGGAATGCAGGGATGGTTTAATATACGAAAATCCATCAATGTAATCCACTATATAAA
              +HWUSI-EAS1642R_000:5:1:2782:993#GATCAG/1
              BGIFGHIHHI[W[YYVYYYYY[Y[[YYYYY[[[Y[VVPVOQQ____________TTVVTVYYYYYYYYYRWWWWW___Y_____PYYYYYWWPWWYYVYR
              @HWUSI-EAS1642R_0000:5:1:3250:998#GATCAG/1
              NACTATCCAGAGTCACTCAAAAGGGAGACAAGCACTTGGTGCCACATCAACACAAAACTCATAAGAGCTAGAAACACACTCAAAATTGATCATTAATATA
              +HWUSI-EAS1642R_000:5:1:3250:998#GATCAG/1
              BIGKMLLQRN_________WV___________________NRRTR_Y_________QQY________________[Y[[PWWWWW[YY[[WYYWY_____


              ANy inputs will be much appreciated. Thanks!

              Comment


              • #8
                Originally posted by rajajjcet View Post
                hi, my research is going on short read sequence alignment . how to g.et online short read alignment tools and input data format
                Most prevalent input data format is FASTQ: http://en.wikipedia.org/wiki/FASTQ_format

                List of NGS tools (look for aligners): http://seqanswers.com/wiki/Software/list

                Data can be downloaded from: http://www.ncbi.nlm.nih.gov/sra

                Comment


                • #9
                  Originally posted by dpryan View Post
                  At least from the 2 reads you posted it looks like phred+64, so some variant of illumina before 1.8. You can get an idea of whether it's Solexa by just grepping for ">" or "?", which won't be present in any of the Illumina formats that have offsets of 64.

                  BTW, it's already a fastq file, you just need to let your aligner know how the quality scores are encoded (though many of the newer aligners only support phred+33).
                  @ Mr Ryan: I might be mistaken, but i was kinda certain thats not a fastq format file. If you look at the fourth line, it has these weird stretches of "________" characters. what gives? I have not come across such a fastq file, with my limited expertise.

                  Besides this, when I use a qc program like prinseq, to perform basic QC, it gives me the following error:
                  "ERROR: input file for -fastq is in UNKNOWN format not in FASTQ format. Exit program."

                  I wonder what could be the issue here. At first I thought this might be an unix issue, so I ran dos2unix on the files. however prinseq is not able to recognize these files anyways. Thanks for your input Mr Ryan

                  Comment


                  • #10
                    "_" is a valid quality score (it's 30 something).

                    Comment


                    • #11
                      Originally posted by paa6 View Post
                      ahhh no need to be formal by the way...and I am not sir..I am mam...I have analysed ur file and this is fastq file...and use fastqc for analysis of ur sequence before doing anything further....
                      http://www.bioinformatics.babraham.a...ojects/fastqc/

                      Ahh sorry for the confusion mam! lol. anways, I wanted to use prinseq with this data, (i prefer it over fastqc because of its DUST low complexity filters as well as the plots which are more intuitive than fastqc). However that is where the trouble started in the first place. Upon running prinseq I got an error that the files were not in fastq format. I tried running dos2unix, since these files have been on a windows system previously and our current server uses unix. Thanks for your suggestion! looks like I am still stuck at phase one of my project.

                      Comment


                      • #12
                        @ron128 if this is GAIIx data then it most certainly is in the phred+64 format. If you do want to run PRINSEQ then use the "-phred64" option to account for that format.

                        Comment


                        • #13
                          Originally posted by GenoMax View Post
                          @ron128 if this is GAIIx data then it most certainly is in the phred+64 format. If you do want to run PRINSEQ then use the "-phred64" option to account for that format.
                          Dear Sir/Madam, I have no clue as to what platform was used for sequencing this data. The only thing which has been told to me is that this data is from NIH3T3 cell lines :/ the sequencing company has shut up shop and unfortunately there is no way to be sure about this. I already tried using the phred64 option in prinseq, but I was getting the same error which says that the input file is not in fastq format. Thanks for your thoughts on this

                          Comment


                          • #14
                            That's an error in prinseq, then. Try some other program (e.g., fastQC). Alternatively, perform the grep command that I suggested earlier.

                            Comment


                            • #15
                              Originally posted by ron128 View Post
                              Ahh sorry for the confusion mam! lol. anways, I wanted to use prinseq with this data, (i prefer it over fastqc because of its DUST low complexity filters as well as the plots which are more intuitive than fastqc). However that is where the trouble started in the first place. Upon running prinseq I got an error that the files were not in fastq format. I tried running dos2unix, since these files have been on a windows system previously and our current server uses unix. Thanks for your suggestion! looks like I am still stuck at phase one of my project.
                              may possible that file is corrupted.....

                              I have again gone through ur file and I have found that ur file has every element for being illumina fastq format except there is something is unusual which I missed out at first place...
                              HWUSI-EAS1642R = this is the instrument name
                              0000 = flow cell lane
                              5 = tile no. within the flowcell lane
                              1 = x co-ordinate
                              2782 = y co-ordinate
                              993 = ??? I have no idea what is this for
                              #GATCAG = index no.
                              /1 = means single end
                              i.e. why I said it's a fastq format...but looks like it's not....or something is wrong...my expertise limited to just illumina

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              8 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X