Announcement

Collapse

Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Questions about .fastq format and size

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Questions about .fastq format and size

    Hello,

    I am working a .fastq file. I would like to use TopHat to convert .fastq to BAM files, and then convert the BAM files to count data to find differentially expressed genes. Upon looking at the .fastq files, I have derived two questions:

    1) An example from my .fastq file is below:

    @SRR452349.7 solid0196_2009082_cho_WT_lib_A_bcSample1_1_21_724_F3 length=50
    T32312021301103013102021230203203000020001100110.00
    +SRR452349.7 solid0196_2009082_cho_WT_lib_A_bcSample1_1_21_724_F3 length=50
    !)?1333=4;;*</&99>([email protected]/(9:=64/:;4-55<6453>241&8!.7

    This looks different than the .fastq file posted on Wiki:

    @SEQ_ID
    GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
    +
    !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

    I am concerned about the second line. My second line has numbers instead of characters, and also contains dots. First, is it okay that it contains numbers instead of characters? Second, is it okay that it contains dots?

    It has been recommended to me to replace each dot of this line with "N". However, I worry that doing so will also replace dots from other lines (such as the dot in the first line @SRR452349.7) . So, is it really necessary for me to do so, and if so, is there a safe script to use?

    2) I have several of these .fastq files to compare, and each is about 222629380 lines (~15G). I am worried about time constraints, and wonder if it would be suitable for me to only use the same first X lines of the .fastq files? If so, what would be a means to determine an appropriate X value? That is, what would be too few lines to extract?

    Also, as I am using TopHat2 to convert the .fastq files (and in so doing, I am aligning the .fastq files with the reference genome), it would not be a problem for me to only include some lines, right?

    *If you see any problems with my general pipeline of converting .fastq to BAM files via TopHat2, and then converting BAM files to count tables, please let me know!

    Thank you!

  • #2
    First off "convert" isn't really the right word for fastq to bam; you "map" or "align" it to a reference. Second, the amount of computation that requires depends on the amount of reads, the error rate of the reads, and the size of the reference.

    They're both fastq, but yours is Solid Colorspace. You have to map it with a colorspace mapper to a colorspace index. BBMap used to handle Solid but it doesn't anymore, sorry. You can always use the Bioscope software, and I think bwa handles colorspace. Not sure about TopHat2, though IIRC TopHat did support colorspace at one time.

    You can throw away as many reads as you want. It's best to do it randomly, though, rather than using only the first X reads, because especially with Solid, the first X reads (say, 1 million) are often very low quality compared to the "middle" reads. But that reduces sensitivity and accuracy. Don't do it unless you have very limited computational resources, which, if you do, then you might need a different plan that does not involve processing NGS data. A lot of genes are expressed at VERY low levels and the difference between 10 reads mapping to a gene and 0 reads is often the difference between being able to determine the differential expression or not, depending on the analysis software. For example, I've read that cufflinks tends to utterly fail in any case where one sample has 0 expression, probably because you can't calculate the log of 0.
    Last edited by Brian Bushnell; 04-28-2014, 04:49 PM.

    Comment


    • #3
      Brian:

      Thank you for the useful information!

      1) If the "mapping" of .fastq to BAM files depends on the reference genome size, then would it even help much for me to only select X lines from the .fastq file, as the reference genome size remains constant?

      2) Thanks for the information. Upon a Google search, it seems TopHat2 does support colorspace, but only with the older version of Bowtie (http://seqanswers.com/forums/showthread.php?t=19096). This seems a bit complicated because I am using a remote server and cannot install things there, and I just think it sounds pretty difficult to run both TopHat2 with an older version of Bowtie. I do not have any experience with TopHat2 and wonder if it is too finicky.

      When you said "You have to map it with a colorspace mapper to a colorspace index.", do you mean before inputting it to TopHat2?

      3) I am beginning to worry this dataset is too large and too esoteric (being a "colorspace"). I am doing this for a quick project. I wonder if anyone knows of a dataset with control and variable (exposed to uv or not, nutrient deprived or not, etc.) for a much smaller transcriptome (say e.coli instead of human) that is not colorspace. I might try to find differentially expressed genes from something like that instead. But where is a trustworthy place to search?

      Sorry for so many questions here!!

      Comment


      • #4
        Originally posted by lanner View Post
        Brian:

        Thank you for the useful information!

        1) If the "mapping" of .fastq to BAM files depends on the reference genome size, then would it even help much for me to only select X lines from the .fastq file, as the reference genome size remains constant?
        Yes, the time will be proportional to (# reads) x (function(size of reference)). What the function is depends on the mapping program. So mapping time is linearly proportional to the number of reads, but also related to the size of the reference (and how repetitive it is). For example, a bacteria might be 3 million bases, while human is 3 billion bases, a factor of 1000 difference. But mapping to a bacteria might only be 20x as fast as mapping to human, for the same number of reads.

        2) Thanks for the information. Upon a Google search, it seems TopHat2 does support colorspace, but only with the older version of Bowtie (http://seqanswers.com/forums/showthread.php?t=19096). This seems a bit complicated because I am using a remote server and cannot install things there, and I just think it sounds pretty difficult to run both TopHat2 with an older version of Bowtie. I do not have any experience with TopHat2 and wonder if it is too finicky.

        When you said "You have to map it with a colorspace mapper to a colorspace index.", do you mean before inputting it to TopHat2?
        No, in this case TopHat2 would be the colorspace mapper. You don't NEED to use a splice-aware aligner for RNA-seq data if all you care about is gene expression quantification. You could directly use Bowtie or bwa, instead. Reads that cross introns will not be mapped, but you can still quantify expression from the reads that don't span introns, with some induced error due to different exon lengths (not good enough for a publication, but good enough for experimenting - incidentally, BBMap is splice-aware and can correctly handle DNA and RNA data). Also, you can map to the transcriptome rather than the genome, which in mammals is maybe 20x smaller (it excludes introns and intra-gene areas). It's less accurate, less repeatable, and highly subjective, though, so I would never put transcriptome-mapped data in a peer-reviewed paper. But for a computationally-limited quick-and-dirty scenario, when you don't care about alternative splicing, transcriptome-mapping should usually give a good approximation of reality, particularly when using reads that are much shorter than exon length.

        If you use bacteria, that's even better! You can use whatever DNA mapper you want, because they have very few introns, so they don't need splice-aware alignment. Of course prokaryotes DO have splicing, it's just rare enough that for a non-publication-quality experiment, you can ignore it.

        3) I am beginning to worry this dataset is too large and too esoteric (being a "colorspace"). I am doing this for a quick project. I wonder if anyone knows of a dataset with control and variable (exposed to uv or not, nutrient deprived or not, etc.) for a much smaller transcriptome (say e.coli instead of human) that is not colorspace. I might try to find differentially expressed genes from something like that instead. But where is a trustworthy place to search?

        Sorry for so many questions here!!
        There is a lot of public sequence data available for all kinds of organisms, at places like NCBI's short read archive. Unfortunately, NCBI has decided to store them in a non-standard "sra" format, probably to make life difficult for everyone, but they do provide a tool that can convert it to fastq. They have data from various platforms - you DO want to avoid Solid data; colorspace is hard to deal with, and it has a very high error rate with very short reads. Any other platform is fine. Most of the data might be human and mouse, but they do have various organisms.
        Last edited by Brian Bushnell; 04-28-2014, 05:55 PM.

        Comment


        • #5
          Thanks, Brian!

          I plan to use DESeq2 for my analysis. I just need to eventually generate a count table. This is taken from their vignette:

          ###################

          For example, the paired-end RNA-Seq reads for the parathyroidSE package were aligned using TopHat2 with 8 threads, with the call:

          tophat2 -o file_tophat_out -p 8 genome file_1.fastq file_2.fastq
          samtools sort -n file_tophat_out/accepted_hits.bam _sorted

          The second line sorts the reads by name rather than by genomic position, which is necessary for counting paired-end reads within Bioconductor. This command uses the SAMtools software3. The BAM files for a number of sequencing runs can then be used to generate count matrices, as described in the following section.

          ###################

          I have the .fastq files (in colorspace). So do you believe I could just use bowtie instead of tophat2 (without having to worry about the .fastq being colorspace and with the benefit of bowtie being faster than tophat2), followed by samtools?

          I am interested in speeding this process up, so I think I will use the human reference transcriptome. Does this seem reasonable:

          mrna.fa.gz - Human mRNA from GenBank (from the website http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/). Mapping the .fastq file to this transcriptome will not likely cause problems with using bowtie and samtools?

          And you think bowtie will be faster than tophat2, right? I am more concerned about speed than accuracy, so thank you for these recommendations! I am also thinking of only keeping 1/10 of the .fastq lines.

          Also, one question: You do not think my .fastq files (the second line) has an issue to have dots there? I know you explained that it being "colorspace" is why it is numbers, not letters, but the dots are also okay?

          Many many many thanks!!

          Comment


          • #6
            Originally posted by lanner View Post
            I have the .fastq files (in colorspace). So do you believe I could just use bowtie instead of tophat2 (without having to worry about the .fastq being colorspace and with the benefit of bowtie being faster than tophat2), followed by samtools?
            That should be fine; and yes, Bowtie is faster than Tophat; and Bowtie is also faster than Bowtie2, which is faster than Tophat2. But you DO have to worry about colorspace - be sure to use the colorspace flags when indexing the reference and when mapping.

            I am interested in speeding this process up, so I think I will use the human reference transcriptome. Does this seem reasonable:

            mrna.fa.gz - Human mRNA from GenBank (from the website http://hgdownload.cse.ucsc.edu/goldenPath/hg19/bigZips/). Mapping the .fastq file to this transcriptome will not likely cause problems with using bowtie and samtools?
            That looks fine. Yes, bowtie should be able to map the majority of short colorspace reads to that reference.

            And you think bowtie will be faster than tophat2, right?
            It is; I've tested them. Bowtie1 is way faster than tophat2 (but, of course, much less sensitive and less accurate; it can't handle indels).

            I am more concerned about speed than accuracy, so thank you for these recommendations! I am also thinking of only keeping 1/10 of the .fastq lines.
            That's fine, as long as the lines you keep are evenly distributed throughout the file - for example, 10% at random, or precisely every 10th line.

            Also, one question: You do not think my .fastq files (the second line) has an issue to have dots there? I know you explained that it being "colorspace" is why it is numbers, not letters, but the dots are also okay?

            Many many many thanks!!
            In Solid colorspace, the dot is a "no-call", where the sequencer cannot determine what the sequence was. This is equivalent to an 'N' in a normal base-space read. Any program that can handle colorspace data will automatically convert a '.' to a no-call.

            Comment


            • #7
              Brian:

              Thanks so much again! You have helped me greatly with all your answers...

              Comment


              • #8
                note that if you have single-end reads (only one FASTQ file), you don't need to use the samtools line to sort the reads by name.

                Comment


                • #9
                  Michael: Thanks! I did not realize that!

                  Comment


                  • #10
                    I am trying to use the htseq-count command with the data I have, but am getting zero counts. I am wondering if it is because the mrna.fa file (listed earlier in this thread) is not compatible with the hg19.gtf file I inputted to htseq-count.

                    I wrote more detail about this problem here:

                    http://seqanswers.com/forums/showthread.php?t=43056

                    Comment


                    • #11
                      gtf files are generally designed to be used with the whole genome fasta, meaning the transcriptome fasta is not needed for mapping or analysis.

                      Comment


                      • #12
                        Thanks, Brian! These past few days )

                        Once I have the count table, I will feel much better, because I am actually familiar with what to do from there!

                        I keep feeling like I am almost done, but now I wonder if I need to go back and redo things?

                        I am not sure what your last comment means? Are you suggesting that I download a different .gtf file for htseq-count? I don't think I can remap because I have to present something about this on Tuesday (

                        The details of my problem are here:

                        http://seqanswers.com/forums/showthread.php?t=43056

                        I read on FAQ of htseq-count that the .gtf file from UCSC should not be used. I am thinking the problem with my work is that mrna.fa that was used to map and create SAM has different names than the .gtf file I use for htseq-count. But I have been searching the internet for four hours, and am not sure what to use for the .gtf.

                        Any ideas greatly appreciated....

                        Comment


                        • #13
                          Originally posted by lanner View Post
                          I am thinking the problem with my work is that mrna.fa that was used to map and create SAM has different names than the .gtf file I use for htseq-count.
                          Yes - analysis downstream of mapping is dependent on the names all matching. The gtf file was probably made to annotate the genome, and thus has chromosome names, while the mrna fasta file would have gene names.

                          If all you need is counts per gene, and you mapped to the transcriptome, then that's fairly easy to accomplish. For example, BBTools has a program called "pileup", that you can run on the raw sam file:

                          pileup.sh -Xmx1g -da in=mapped.sam out=stats.txt

                          This will output information in 6 columns:

                          "ID Length Ref_GC Avg_fold Base_Coverage Read_GC"

                          (The GC ones are probably irrelevant to most people, but the people at JGI like them)

                          Anyway, that will give you, for each gene (named in the ID column), its length and average fold coverage. Using those two things you could calculate the total number of reads mapped to the gene or RPKM. I'm sure you could accomplish similar things with something like samtools or bedtools, but I don't know the command; but personally, I prefer my pileup program because you don't need to convert things to a sorted bam file first.

                          P.S. Although now that I think about it, I have no idea if it will process a colorspace sam file, or throw an exception... with the "-da" flag (disable assertions) in the command line, it will probably work.
                          Last edited by Brian Bushnell; 05-04-2014, 08:34 PM.

                          Comment


                          • #14
                            Brian:

                            Thank you so much! I had never heard of pileup before! And, I searched on Galaxy, and they have the tool there. Which is very ideal for me, as I have trouble installing my own software (I plan to take a course in the fall for that - as it is a must with NGS!)

                            Anyway, I found a program on Galaxy called "Generate pileup from BAM dataset". And I have my BAM files.

                            To test it, I used one BAM file and the mrna.fa file as inputs.

                            It generated a 6-column dataset that is of this type:

                            Six column pileup:

                            1 2 3 4 5 6
                            ---------------------------------
                            chrM 412 A 2 ., II
                            chrM 413 G 4 ..t, IIIH
                            chrM 414 C 4 ...a III2
                            chrM 415 C 4 TTTt III7
                            where:

                            Column Definition
                            ------- ----------------------------
                            1 Chromosome
                            2 Position (1-based)
                            3 Reference base at that position
                            4 Coverage (# reads aligning over that position)
                            5 Bases within reads where (see Galaxy wiki for more info)
                            6 Quality values (phred33 scale, see Galaxy wiki for more)


                            Here is the first ten lines of my actual output:

                            AF116642 792 t 1 ^!. ]
                            AF116642 793 g 1 . ]
                            AF116642 794 t 1 . ]
                            AF116642 795 g 1 . ]
                            AF116642 796 t 1 . ]
                            AF116642 797 g 1 . ]
                            AF116642 798 t 1 . ]
                            AF116642 799 g 1 . ]
                            AF116642 800 t 1 . ]
                            AF116642 801 g 1 . ]

                            I notice that several lines can be for simply one chromosome (here the first ten lines are all for chromosome AF116642).

                            So, I would like to get this into the correct count table format for DESEQ2. This requires just x columns, where the first column is the name of unique chromosomes, and the x-1 columns are each the read counts for the x-1 total BAM files.

                            I am just checking that my thinking is right, but if I just:

                            1) For each BAM file, wrote a script that sums all numbers from the fourth columns corresponding to every unique type of chromosome from the first column, and outputs those two columns (chromosome, and summed read count for that chromosome)
                            2) concatenated the resulting files together into one file that had unique chromosome names as the first column, and summed read counts for that chromosome for each BAM file in all the rest of the columns.*

                            * I noticed when I compared the unique chromosome names from the output of pileup files that they are not the same for each inputted BAM file. So, I would need to simply place a "zero" if a BAM file does not include a chromosome that at least one of the other BAM files contained. Because the output of pileup does not count "zero" chromosomes, it seems...

                            I am not sure if my wording makes sense...?

                            Comment


                            • #15
                              lanner,

                              That's close, but unfortunately, not quite right, because it does not account for read length, gene length, or areas of 0 coverage. What you want is fold coverage, or RPKM/FPKM, or total number of reads aligned. But in your case, there is one line per covered position in a gene, but only if that line has any coverage - from that alone, you cannot determine how long the genes are. Since each read can cover L bases, where L is the length of the read, longer reads would inflate the scores and shorter reads would deflate them. Similarly, longer genes would appear to have higher counts than shorter genes.

                              That said, if you assume every read has the same length (which is usually the case for untrimmed Illumina reads, and typically false for any other platform, including Solid, which normally has read 2 shorter than read 1), your method would work if you divided your result by the gene length, which you can calculate from the fasta file.

                              Ideally, you need some kind of pileup tool that can just count the number of reads mapped to each scaffold, not per-base coverage.

                              You could even write a quick program yourself, parsing the sam file (it's just tab-delimited text, one line per read); just increment a counter in a hashtable for every line that maps to a particular scaffold name. The results might not be exactly right if the reads were paired or if secondary alignments were allowed, but some random tool you download might not be right in those case, either; and anyway, they're a bit subjective. If you only have a bam file, you can translate it back to sam with samtools.

                              Comment

                              Working...
                              X