Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how can i parse lines of a huge .sam file into a data frame, table, list faster in R?

    I have an R script in which i can read the lines from a .Sam file after mapping and i want to parse lines of sam file into strings in order to be easier to manipulate them and create the wig files that i want or to calculate the cov3 and cov5 that i need.
    Can you help me please to make this script work faster?how can i parse lines of a huge .sam file into a data frame faster?Or into a list? Here is my script:

    gc()
    rm(list=ls())

    exptPath <- "/home/dimitris/INDEX3PerfectUnique31cov5.sam"


    lines <- readLines(exptPath)
    pos = lines
    pos
    chrom = lines
    chrom
    pos = ""
    chrom = ""
    nn = length(lines)
    nn

    # parse lines of sam file into strings(this part is very very slow)
    rr = strsplit(lines,"\t", fixed = TRUE)
    rr
    trr = do.call(rbind.data.frame, rr)
    pos = as.numeric(as.character(trr[8:nn,4]))
    # for cov3
    #pos = pos+25
    #pos
    chrom = trr[8:nn,3]
    pos = as.numeric(pos)
    pos

    tab1 = table(chrom,pos, exclude="")
    tab1

    ftab1 = as.data.frame(tab1)
    ftab1 = subset(ftab1, ftab1[3] != 0)
    ftab1 = subset(ftab1, ftab1[1] != "<NA>")
    oftab1 = ftab1[ order(ftab1[,1]), ]
    final.ftab1 = oftab1[,2:3]
    write.table(final.ftab1, "ind3_cov5_wig.txt", row.names=FALSE, sep=" ", quote=FALSE)

  • #2
    R stores all variables in memory, so if you would like to parse a large file, try to write a loop around readLines() and process only one line. Example.

    Comment


    • #3
      Are you absolutely set on using R for this? R is great for many things, but reading a whole SAM/BAM file into memory (which most of the R mechanisms for dealing with SAM/BAM files entail) isn't exactly the most efficient processing mechanism. You might find that pysam/python meets your flexibility needs while still delivering increased performance.

      Comment


      • #4
        You're better off using a method that is essentially a map-reduce process. Use some other quick tool (i.e. doesn't take much longer than just reading out the reads using cat) to pre-process the BAM files into a format that is small and easy for another program to use. I notice that you're using strsplit and just using a subset of the columns, where subsetting the columns first using awk would be much better:

        Code:
        samtools view input.bam | awk -F '\t' '{print $3,$4}' | sort | uniq -c > ref_pos_table.txt
        [using samtools view will also skip the header lines, which will have a variable length, rather than the "always 7 lines" that your R code suggests]

        Then proceed from the ftab1 lines in your R script, loading ref_pos_table.txt:

        Code:
        ftab1 <- read.table("ref_pos_table.txt",col.names = c("Count","RNAME","POS"));
        ftab1 <- subset(ftab1, RNAME != "*");
        ... etc
        Last edited by gringer; 05-28-2014, 04:45 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Genetic Variation in Immunogenetics and Antibody Diversity
          by seqadmin



          The field of immunogenetics explores how genetic variations influence immune responses and susceptibility to disease. In a recent SEQanswers webinar, Oscar Rodriguez, Ph.D., Postdoctoral Researcher at the University of Louisville, and Ruben Martínez Barricarte, Ph.D., Assistant Professor of Medicine at Vanderbilt University, shared recent advancements in immunogenetics. This article discusses their research on genetic variation in antibody loci, antibody production processes,...
          11-06-2024, 07:24 PM
        • seqadmin
          Choosing Between NGS and qPCR
          by seqadmin



          Next-generation sequencing (NGS) and quantitative polymerase chain reaction (qPCR) are essential techniques for investigating the genome, transcriptome, and epigenome. In many cases, choosing the appropriate technique is straightforward, but in others, it can be more challenging to determine the most effective option. A simple distinction is that smaller, more focused projects are typically better suited for qPCR, while larger, more complex datasets benefit from NGS. However,...
          10-18-2024, 07:11 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Today, 11:09 AM
        0 responses
        24 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, Today, 06:13 AM
        0 responses
        20 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 11-01-2024, 06:09 AM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-30-2024, 05:31 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Working...
        X