Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how can i parse lines of a huge .sam file into a data frame, table, list faster in R?

    I have an R script in which i can read the lines from a .Sam file after mapping and i want to parse lines of sam file into strings in order to be easier to manipulate them and create the wig files that i want or to calculate the cov3 and cov5 that i need.
    Can you help me please to make this script work faster?how can i parse lines of a huge .sam file into a data frame faster?Or into a list? Here is my script:

    gc()
    rm(list=ls())

    exptPath <- "/home/dimitris/INDEX3PerfectUnique31cov5.sam"


    lines <- readLines(exptPath)
    pos = lines
    pos
    chrom = lines
    chrom
    pos = ""
    chrom = ""
    nn = length(lines)
    nn

    # parse lines of sam file into strings(this part is very very slow)
    rr = strsplit(lines,"\t", fixed = TRUE)
    rr
    trr = do.call(rbind.data.frame, rr)
    pos = as.numeric(as.character(trr[8:nn,4]))
    # for cov3
    #pos = pos+25
    #pos
    chrom = trr[8:nn,3]
    pos = as.numeric(pos)
    pos

    tab1 = table(chrom,pos, exclude="")
    tab1

    ftab1 = as.data.frame(tab1)
    ftab1 = subset(ftab1, ftab1[3] != 0)
    ftab1 = subset(ftab1, ftab1[1] != "<NA>")
    oftab1 = ftab1[ order(ftab1[,1]), ]
    final.ftab1 = oftab1[,2:3]
    write.table(final.ftab1, "ind3_cov5_wig.txt", row.names=FALSE, sep=" ", quote=FALSE)

  • #2
    R stores all variables in memory, so if you would like to parse a large file, try to write a loop around readLines() and process only one line. Example.

    Comment


    • #3
      Are you absolutely set on using R for this? R is great for many things, but reading a whole SAM/BAM file into memory (which most of the R mechanisms for dealing with SAM/BAM files entail) isn't exactly the most efficient processing mechanism. You might find that pysam/python meets your flexibility needs while still delivering increased performance.

      Comment


      • #4
        You're better off using a method that is essentially a map-reduce process. Use some other quick tool (i.e. doesn't take much longer than just reading out the reads using cat) to pre-process the BAM files into a format that is small and easy for another program to use. I notice that you're using strsplit and just using a subset of the columns, where subsetting the columns first using awk would be much better:

        Code:
        samtools view input.bam | awk -F '\t' '{print $3,$4}' | sort | uniq -c > ref_pos_table.txt
        [using samtools view will also skip the header lines, which will have a variable length, rather than the "always 7 lines" that your R code suggests]

        Then proceed from the ftab1 lines in your R script, loading ref_pos_table.txt:

        Code:
        ftab1 <- read.table("ref_pos_table.txt",col.names = c("Count","RNAME","POS"));
        ftab1 <- subset(ftab1, RNAME != "*");
        ... etc
        Last edited by gringer; 05-28-2014, 04:45 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        32 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        37 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        31 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        53 views
        0 likes
        Last Post seqadmin  
        Working...
        X