Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16

    Thanks for all your advice! It turns out I need to rerun BWA to get new SAM files, as they had no hits at all. My SAM file just had a long list of lines like this:

    @SQ SN:AF001540 LN:1781
    @SQ SN:AF001541 LN:1138
    @SQ SN:AF001542 LN:2992
    @SQ SN:AF001543 LN:903
    @SQ SN:AF001544 LN:434
    @SQ SN:AF001545 LN:370
    @SQ SN:AF001546 LN:1142
    @SQ SN:AF001547 LN:1092
    @SQ SN:AF034176 LN:7232
    @SQ SN:AF038950 LN:2384

    followed by the 11-column section like this:

    SRR452342.29913521 4 * 0 0 * * 0 0 AATCCTTTCGTGTTGCCCAAAAATACGCTCCATTTAATCCGTCTTGTCCG 5/&#0/41$-+#(#'%(#$,$$2'%'&#)%)),+&$#1&-'$/+2'#%''
    SRR452342.29913531 4 * 0 0 * * 0 0 GTCCATGTCGTCATTTTAAAGTCTTATTTTGGTAACTAAACAATTACATA &8,/<'(6:/9815-/-6868&)41)'#4#.**/%5%&(&146$*(*$/*
    SRR452342.29913541 4 * 0 0 * * 0 0 CGGGCTACCTCCGCTCCTGTTAGGCTACGCCGTCGAGGCCGTGGGCGGCG 95<5>;8<7<8,;;>98<:>9,98(;,$4#<,##*,#.''&#)+$$,,#&
    SRR452342.29913551 4 * 0 0 * * 0 0 GTCCGGTTAGCCGGGCCGTGTGGGGATTGGATAAGGTAGGGTCTTATAAT #######$%##$'##################$##########%#####'#

    However, once I get SAM files with hits, I think I will try your last suggested approach. It is the most straight-forward. However, I am a bit unclear about what you are suggesting.

    I see that the SAM file has 11-column tab-delimited format. I also see that each column has the following meanings:

    1 QNAME Query NAME of the read or the read pair
    2 FLAG bitwise FLAG (pairing, strand, mate strand, etc.)
    3 RNAME Reference sequence NAME
    4 POS 1-based leftmost POSition of clipped alignment
    5 MAPQ MAPping Quality (Phred-scaled)
    6 CIGAR extended CIGAR string (operations: MIDNSHP)
    7 MRNM Mate Reference NaMe (‘=’ if same as RNAME)
    8 MPOS 1-based leftmost Mate POSition
    9 ISIZE inferred Insert SIZE
    10 SEQ query SEQuence on the same strand as the reference
    11 QUAL query QUALity (ASCII-33=Phred base quality)

    So, when you suggest to "just increment a counter in a hashtable for every line that maps to a particular scaffold name", which columns are your referring to? I assume the scaffold name is the first column. And any line, regardless of its information in the other columns, should add a counter for that scaffold? Each line can only contribute one point for the counter, right?

    For instance, in the four consecutive lines in my .SAM file above, I should get count table like:

    SRR452342.29913521 1
    SRR452342.29913531 1
    SRR452342.29913541 1
    SRR452342.29913551 1

    Since each of those four scaffold names appeared once (even though the columns they had were often of value zero). It just seems strange because I took 1/10 lines from my .fastq file (to keep it small in size), and if I did it this way, I would jus get exactly one count for each scaffold!?

    I must be misinterpreting.

    Also, from the count table, I would like to identify the gene names of DEGs. I am a bit unclear how these scaffold names can be translated to gene names.


    • #17
      When mapping to a transcriptome, you would just use the RNAME column. The QNAME is the name of the read itself (e.g "SRR452342.29913521"); the RNAME (e.g. "AF001540", but always "*" for an unmapped read) is the name of the reference scaffold (gene) it maps to. And you'd skip the lines starting with "@" (although the "@SQ" lines will tell you the name and length of the genes, which may be useful).

      The fourth column (POS) is the location in the gene the read maps to; that's irrelevant for your purposes. Each line is a single read. So you only need to look at column 3, RNAME, and sum the number of lines for each individual RNAME.

      As for translating scaffold names into gene names... that's problematic. Hopefully wherever you got the mrna file also has a translation file. Different sources name genes in all kinds of ways; some places use (mutually incompatible) numbers, some use names (like BRCA), and so forth. It's really a nightmare. You might want to try various places (NCBI, UCSB, etc) to find a transcriptome fasta with scaffold names that are directly useful.


      • #18
        Thank you, Brian. I am glad to know someone intelligent also thinks part of the process can be a "nightmare" ). Thanks again for this information; I learned a lot from it...


        Latest Articles


        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin

          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin

          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM





        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        Last Post seqadmin