Welcome to the New Seqanswers!

Welcome to the new Seqanswers! We'd love your feedback, please post any you have to this topic: New Seqanswers Feedback.
See more
See less

Questions about .fastq format and size

  • Filter
  • Time
  • Show
Clear All
new posts

  • #16

    Thanks for all your advice! It turns out I need to rerun BWA to get new SAM files, as they had no hits at all. My SAM file just had a long list of lines like this:

    @SQ SN:AF001540 LN:1781
    @SQ SN:AF001541 LN:1138
    @SQ SN:AF001542 LN:2992
    @SQ SN:AF001543 LN:903
    @SQ SN:AF001544 LN:434
    @SQ SN:AF001545 LN:370
    @SQ SN:AF001546 LN:1142
    @SQ SN:AF001547 LN:1092
    @SQ SN:AF034176 LN:7232
    @SQ SN:AF038950 LN:2384

    followed by the 11-column section like this:

    SRR452342.29913521 4 * 0 0 * * 0 0 AATCCTTTCGTGTTGCCCAAAAATACGCTCCATTTAATCCGTCTTGTCCG 5/&#0/41$-+#(#'%(#$,$$2'%'&#)%)),+&$#1&-'$/+2'#%''
    SRR452342.29913531 4 * 0 0 * * 0 0 GTCCATGTCGTCATTTTAAAGTCTTATTTTGGTAACTAAACAATTACATA &8,/<'(6:/9815-/-6868&)41)'#4#.**/%5%&(&146$*(*$/*
    SRR452342.29913541 4 * 0 0 * * 0 0 CGGGCTACCTCCGCTCCTGTTAGGCTACGCCGTCGAGGCCGTGGGCGGCG 95<5>;8<7<8,;;>98<:>9,98(;,$4#<,##*,#.''&#)+$$,,#&
    SRR452342.29913551 4 * 0 0 * * 0 0 GTCCGGTTAGCCGGGCCGTGTGGGGATTGGATAAGGTAGGGTCTTATAAT #######$%##$'##################$##########%#####'#

    However, once I get SAM files with hits, I think I will try your last suggested approach. It is the most straight-forward. However, I am a bit unclear about what you are suggesting.

    I see that the SAM file has 11-column tab-delimited format. I also see that each column has the following meanings:

    1 QNAME Query NAME of the read or the read pair
    2 FLAG bitwise FLAG (pairing, strand, mate strand, etc.)
    3 RNAME Reference sequence NAME
    4 POS 1-based leftmost POSition of clipped alignment
    5 MAPQ MAPping Quality (Phred-scaled)
    6 CIGAR extended CIGAR string (operations: MIDNSHP)
    7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)
    8 MPOS 1-based leftmost Mate POSition
    9 ISIZE inferred Insert SIZE
    10 SEQ query SEQuence on the same strand as the reference
    11 QUAL query QUALity (ASCII-33=Phred base quality)

    So, when you suggest to "just increment a counter in a hashtable for every line that maps to a particular scaffold name", which columns are your referring to? I assume the scaffold name is the first column. And any line, regardless of its information in the other columns, should add a counter for that scaffold? Each line can only contribute one point for the counter, right?

    For instance, in the four consecutive lines in my .SAM file above, I should get count table like:

    SRR452342.29913521 1
    SRR452342.29913531 1
    SRR452342.29913541 1
    SRR452342.29913551 1

    Since each of those four scaffold names appeared once (even though the columns they had were often of value zero). It just seems strange because I took 1/10 lines from my .fastq file (to keep it small in size), and if I did it this way, I would jus get exactly one count for each scaffold!?

    I must be misinterpreting.

    Also, from the count table, I would like to identify the gene names of DEGs. I am a bit unclear how these scaffold names can be translated to gene names.


    • #17
      When mapping to a transcriptome, you would just use the RNAME column. The QNAME is the name of the read itself (e.g "SRR452342.29913521"); the RNAME (e.g. "AF001540", but always "*" for an unmapped read) is the name of the reference scaffold (gene) it maps to. And you'd skip the lines starting with "@" (although the "@SQ" lines will tell you the name and length of the genes, which may be useful).

      The fourth column (POS) is the location in the gene the read maps to; that's irrelevant for your purposes. Each line is a single read. So you only need to look at column 3, RNAME, and sum the number of lines for each individual RNAME.

      As for translating scaffold names into gene names... that's problematic. Hopefully wherever you got the mrna file also has a translation file. Different sources name genes in all kinds of ways; some places use (mutually incompatible) numbers, some use names (like BRCA), and so forth. It's really a nightmare. You might want to try various places (NCBI, UCSB, etc) to find a transcriptome fasta with scaffold names that are directly useful.


      • #18
        Thank you, Brian. I am glad to know someone intelligent also thinks part of the process can be a "nightmare" ). Thanks again for this information; I learned a lot from it...