Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample FASTA data and questions about paired end reads

    Hello,

    I have some basic questions I think.

    1. Where can I find some sample FASTA files? I mean something like a file for the complete sequence (the result) and a file containing the reads which would result in the complete sequence through proper assembly. It should'nt be that big. I want to play with it.

    2. Is it common to represent paired end reads in the FASTA format? How?

    3. I'm not sure if I understand it correctly but are there only two paired end reads? The paired end reads are from the ends of a DNA molecule (http://seqanswers.com/forums/showthread.php?t=503). Therefore we have two paired end reads and a whole bunch of other "normal" reads? Am I correct?


    Btw I'm no bioinformatician so I apologize for the stupid questions in advance.

  • #2
    Fasta file format is meant for plain sequence files (without quality information). There may be extensions of Fasta format but the normal usage is for plain sequence.

    What you are looking for are Fastq format files, which has become the de facto standard for NGS data.

    You can get fastq data files (there is a utility needed to retrieve data called srftoolkit) from NCBI Short Read Archive (SRA) (http://www.ncbi.nlm.nih.gov/sra) Fastq files can be several gigabytes in size.

    Comment


    • #3
      1. Probably JGI. They seem to have a LOT of bacteria datasets, that should be more tractable.
      2. No, that would be incredibly unusual. They're usually stored in fastq, typically in separate files.
      3. Each fragment sequenced produces two reads, one from each end. So if you sequence 100 million fragments, you'll have 200 million reads (100 million pairs). This is as opposed to single-end reads, where you just sequence one end of each fragment.

      Comment


      • #4
        2. fastq format

        Comment


        • #5
          Thank you for your answers.
          So I have to look up fastq and write a parser. I hoped I can avoid it.

          Comment


          • #6
            2. Some assemblers require input in fasta format, e.g. IDBA wants pairs to be consecutive sequences in fasta format and they have bundled a small script for converting from fastq to fasta..

            Also, don't reinvent the wheel. I'm sure there are many OSS fastq parsers available. A good place to start could be https://github.com/samtools/htslib
            savetherhino.org

            Comment


            • #7
              HTSlib doesn't have a fastq parser. Anyway, with any modern data it's fine to assume that fastq entries are always 4 lines, so a parser is then trivial to write.

              Comment


              • #8
                If you're using Python, there's a decent parser already at https://scipher.wordpress.com/2010/0...-fastq-parser/
                Scott Monsma
                Sr Scientist at Lucigen

                Comment

                Latest Articles

                Collapse

                • seqadmin
                  Latest Developments in Precision Medicine
                  by seqadmin



                  Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                  Somatic Genomics
                  “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                  05-24-2024, 01:16 PM
                • seqadmin
                  Recent Advances in Sequencing Analysis Tools
                  by seqadmin


                  The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
                  05-06-2024, 07:48 AM

                ad_right_rmr

                Collapse

                News

                Collapse

                Topics Statistics Last Post
                Started by seqadmin, 05-24-2024, 07:15 AM
                0 responses
                13 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-23-2024, 10:28 AM
                0 responses
                17 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-23-2024, 07:35 AM
                0 responses
                21 views
                0 likes
                Last Post seqadmin  
                Started by seqadmin, 05-22-2024, 02:06 PM
                0 responses
                10 views
                0 likes
                Last Post seqadmin  
                Working...
                X