Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how can i parse lines of a huge .sam file into a data frame, table, list faster in R?

    I have an R script in which i can read the lines from a .Sam file after mapping and i want to parse lines of sam file into strings in order to be easier to manipulate them and create the wig files that i want or to calculate the cov3 and cov5 that i need.
    Can you help me please to make this script work faster?how can i parse lines of a huge .sam file into a data frame faster?Or into a list? Here is my script:

    gc()
    rm(list=ls())

    exptPath <- "/home/dimitris/INDEX3PerfectUnique31cov5.sam"


    lines <- readLines(exptPath)
    pos = lines
    pos
    chrom = lines
    chrom
    pos = ""
    chrom = ""
    nn = length(lines)
    nn

    # parse lines of sam file into strings(this part is very very slow)
    rr = strsplit(lines,"\t", fixed = TRUE)
    rr
    trr = do.call(rbind.data.frame, rr)
    pos = as.numeric(as.character(trr[8:nn,4]))
    # for cov3
    #pos = pos+25
    #pos
    chrom = trr[8:nn,3]
    pos = as.numeric(pos)
    pos

    tab1 = table(chrom,pos, exclude="")
    tab1

    ftab1 = as.data.frame(tab1)
    ftab1 = subset(ftab1, ftab1[3] != 0)
    ftab1 = subset(ftab1, ftab1[1] != "<NA>")
    oftab1 = ftab1[ order(ftab1[,1]), ]
    final.ftab1 = oftab1[,2:3]
    write.table(final.ftab1, "ind3_cov5_wig.txt", row.names=FALSE, sep=" ", quote=FALSE)

  • #2
    R stores all variables in memory, so if you would like to parse a large file, try to write a loop around readLines() and process only one line. Example.

    Comment


    • #3
      Are you absolutely set on using R for this? R is great for many things, but reading a whole SAM/BAM file into memory (which most of the R mechanisms for dealing with SAM/BAM files entail) isn't exactly the most efficient processing mechanism. You might find that pysam/python meets your flexibility needs while still delivering increased performance.

      Comment


      • #4
        You're better off using a method that is essentially a map-reduce process. Use some other quick tool (i.e. doesn't take much longer than just reading out the reads using cat) to pre-process the BAM files into a format that is small and easy for another program to use. I notice that you're using strsplit and just using a subset of the columns, where subsetting the columns first using awk would be much better:

        Code:
        samtools view input.bam | awk -F '\t' '{print $3,$4}' | sort | uniq -c > ref_pos_table.txt
        [using samtools view will also skip the header lines, which will have a variable length, rather than the "always 7 lines" that your R code suggests]

        Then proceed from the ftab1 lines in your R script, loading ref_pos_table.txt:

        Code:
        ftab1 <- read.table("ref_pos_table.txt",col.names = c("Count","RNAME","POS"));
        ftab1 <- subset(ftab1, RNAME != "*");
        ... etc
        Last edited by gringer; 05-28-2014, 04:45 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Quality Control Essentials for Next-Generation Sequencing Workflows
          by seqadmin




          Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

          Nucleic Acid Quality Control
          Preparing for NGS starts with isolating the...
          02-10-2025, 01:58 PM
        • seqadmin
          An Introduction to the Technologies Transforming Precision Medicine
          by seqadmin


          In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
          01-27-2025, 07:46 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 03:06 PM
        0 responses
        8 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-07-2025, 09:30 AM
        0 responses
        72 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-05-2025, 10:34 AM
        0 responses
        113 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 02-03-2025, 09:07 AM
        0 responses
        90 views
        0 likes
        Last Post seqadmin  
        Working...
        X