Brian:
Thanks for all your advice! It turns out I need to rerun BWA to get new SAM files, as they had no hits at all. My SAM file just had a long list of lines like this:
@SQ SN:AF001540 LN:1781
@SQ SN:AF001541 LN:1138
@SQ SN:AF001542 LN:2992
@SQ SN:AF001543 LN:903
@SQ SN:AF001544 LN:434
@SQ SN:AF001545 LN:370
@SQ SN:AF001546 LN:1142
@SQ SN:AF001547 LN:1092
@SQ SN:AF034176 LN:7232
@SQ SN:AF038950 LN:2384
followed by the 11-column section like this:
SRR452342.29913521 4 * 0 0 * * 0 0 AATCCTTTCGTGTTGCCCAAAAATACGCTCCATTTAATCCGTCTTGTCCG 5/�/41$-+#(#'%(#$,$$2'%'&#)%)),+&$#1&-'$/+2'#%''
SRR452342.29913531 4 * 0 0 * * 0 0 GTCCATGTCGTCATTTTAAAGTCTTATTTTGGTAACTAAACAATTACATA &8,/<'(6:/9815-/-6868&)41)'#4#.**/%5%&(&146$*(*$/*
SRR452342.29913541 4 * 0 0 * * 0 0 CGGGCTACCTCCGCTCCTGTTAGGCTACGCCGTCGAGGCCGTGGGCGGCG 95<5>;8<7<8,;;>98<:>9,98(;,$4#<,##*,#.''&#)+$$,,#&
SRR452342.29913551 4 * 0 0 * * 0 0 GTCCGGTTAGCCGGGCCGTGTGGGGATTGGATAAGGTAGGGTCTTATAAT #######$%##$'##################$##########%#####'#
However, once I get SAM files with hits, I think I will try your last suggested approach. It is the most straight-forward. However, I am a bit unclear about what you are suggesting.
I see that the SAM file has 11-column tab-delimited format. I also see that each column has the following meanings:
1 QNAME Query NAME of the read or the read pair
2 FLAG bitwise FLAG (pairing, strand, mate strand, etc.)
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition of clipped alignment
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string (operations: MIDNSHP)
7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)
8 MPOS 1-based leftmost Mate POSition
9 ISIZE inferred Insert SIZE
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33=Phred base quality)
So, when you suggest to "just increment a counter in a hashtable for every line that maps to a particular scaffold name", which columns are your referring to? I assume the scaffold name is the first column. And any line, regardless of its information in the other columns, should add a counter for that scaffold? Each line can only contribute one point for the counter, right?
For instance, in the four consecutive lines in my .SAM file above, I should get count table like:
SRR452342.29913521 1
SRR452342.29913531 1
SRR452342.29913541 1
SRR452342.29913551 1
Since each of those four scaffold names appeared once (even though the columns they had were often of value zero). It just seems strange because I took 1/10 lines from my .fastq file (to keep it small in size), and if I did it this way, I would jus get exactly one count for each scaffold!?
I must be misinterpreting.
Also, from the count table, I would like to identify the gene names of DEGs. I am a bit unclear how these scaffold names can be translated to gene names.
Thanks for all your advice! It turns out I need to rerun BWA to get new SAM files, as they had no hits at all. My SAM file just had a long list of lines like this:
@SQ SN:AF001540 LN:1781
@SQ SN:AF001541 LN:1138
@SQ SN:AF001542 LN:2992
@SQ SN:AF001543 LN:903
@SQ SN:AF001544 LN:434
@SQ SN:AF001545 LN:370
@SQ SN:AF001546 LN:1142
@SQ SN:AF001547 LN:1092
@SQ SN:AF034176 LN:7232
@SQ SN:AF038950 LN:2384
followed by the 11-column section like this:
SRR452342.29913521 4 * 0 0 * * 0 0 AATCCTTTCGTGTTGCCCAAAAATACGCTCCATTTAATCCGTCTTGTCCG 5/�/41$-+#(#'%(#$,$$2'%'&#)%)),+&$#1&-'$/+2'#%''
SRR452342.29913531 4 * 0 0 * * 0 0 GTCCATGTCGTCATTTTAAAGTCTTATTTTGGTAACTAAACAATTACATA &8,/<'(6:/9815-/-6868&)41)'#4#.**/%5%&(&146$*(*$/*
SRR452342.29913541 4 * 0 0 * * 0 0 CGGGCTACCTCCGCTCCTGTTAGGCTACGCCGTCGAGGCCGTGGGCGGCG 95<5>;8<7<8,;;>98<:>9,98(;,$4#<,##*,#.''&#)+$$,,#&
SRR452342.29913551 4 * 0 0 * * 0 0 GTCCGGTTAGCCGGGCCGTGTGGGGATTGGATAAGGTAGGGTCTTATAAT #######$%##$'##################$##########%#####'#
However, once I get SAM files with hits, I think I will try your last suggested approach. It is the most straight-forward. However, I am a bit unclear about what you are suggesting.
I see that the SAM file has 11-column tab-delimited format. I also see that each column has the following meanings:
1 QNAME Query NAME of the read or the read pair
2 FLAG bitwise FLAG (pairing, strand, mate strand, etc.)
3 RNAME Reference sequence NAME
4 POS 1-based leftmost POSition of clipped alignment
5 MAPQ MAPping Quality (Phred-scaled)
6 CIGAR extended CIGAR string (operations: MIDNSHP)
7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)
8 MPOS 1-based leftmost Mate POSition
9 ISIZE inferred Insert SIZE
10 SEQ query SEQuence on the same strand as the reference
11 QUAL query QUALity (ASCII-33=Phred base quality)
So, when you suggest to "just increment a counter in a hashtable for every line that maps to a particular scaffold name", which columns are your referring to? I assume the scaffold name is the first column. And any line, regardless of its information in the other columns, should add a counter for that scaffold? Each line can only contribute one point for the counter, right?
For instance, in the four consecutive lines in my .SAM file above, I should get count table like:
SRR452342.29913521 1
SRR452342.29913531 1
SRR452342.29913541 1
SRR452342.29913551 1
Since each of those four scaffold names appeared once (even though the columns they had were often of value zero). It just seems strange because I took 1/10 lines from my .fastq file (to keep it small in size), and if I did it this way, I would jus get exactly one count for each scaffold!?
I must be misinterpreting.
Also, from the count table, I would like to identify the gene names of DEGs. I am a bit unclear how these scaffold names can be translated to gene names.
Comment