Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • how can i parse lines of a huge .sam file into a data frame, table, list faster in R?

    I have an R script in which i can read the lines from a .Sam file after mapping and i want to parse lines of sam file into strings in order to be easier to manipulate them and create the wig files that i want or to calculate the cov3 and cov5 that i need.
    Can you help me please to make this script work faster?how can i parse lines of a huge .sam file into a data frame faster?Or into a list? Here is my script:

    gc()
    rm(list=ls())

    exptPath <- "/home/dimitris/INDEX3PerfectUnique31cov5.sam"


    lines <- readLines(exptPath)
    pos = lines
    pos
    chrom = lines
    chrom
    pos = ""
    chrom = ""
    nn = length(lines)
    nn

    # parse lines of sam file into strings(this part is very very slow)
    rr = strsplit(lines,"\t", fixed = TRUE)
    rr
    trr = do.call(rbind.data.frame, rr)
    pos = as.numeric(as.character(trr[8:nn,4]))
    # for cov3
    #pos = pos+25
    #pos
    chrom = trr[8:nn,3]
    pos = as.numeric(pos)
    pos

    tab1 = table(chrom,pos, exclude="")
    tab1

    ftab1 = as.data.frame(tab1)
    ftab1 = subset(ftab1, ftab1[3] != 0)
    ftab1 = subset(ftab1, ftab1[1] != "<NA>")
    oftab1 = ftab1[ order(ftab1[,1]), ]
    final.ftab1 = oftab1[,2:3]
    write.table(final.ftab1, "ind3_cov5_wig.txt", row.names=FALSE, sep=" ", quote=FALSE)

  • #2
    R stores all variables in memory, so if you would like to parse a large file, try to write a loop around readLines() and process only one line. Example.

    Comment


    • #3
      Are you absolutely set on using R for this? R is great for many things, but reading a whole SAM/BAM file into memory (which most of the R mechanisms for dealing with SAM/BAM files entail) isn't exactly the most efficient processing mechanism. You might find that pysam/python meets your flexibility needs while still delivering increased performance.

      Comment


      • #4
        You're better off using a method that is essentially a map-reduce process. Use some other quick tool (i.e. doesn't take much longer than just reading out the reads using cat) to pre-process the BAM files into a format that is small and easy for another program to use. I notice that you're using strsplit and just using a subset of the columns, where subsetting the columns first using awk would be much better:

        Code:
        samtools view input.bam | awk -F '\t' '{print $3,$4}' | sort | uniq -c > ref_pos_table.txt
        [using samtools view will also skip the header lines, which will have a variable length, rather than the "always 7 lines" that your R code suggests]

        Then proceed from the ftab1 lines in your R script, loading ref_pos_table.txt:

        Code:
        ftab1 <- read.table("ref_pos_table.txt",col.names = c("Count","RNAME","POS"));
        ftab1 <- subset(ftab1, RNAME != "*");
        ... etc
        Last edited by gringer; 05-28-2014, 04:45 PM.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Non-Coding RNA Research and Technologies
          by seqadmin




          Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

          Nobel Prize for MicroRNA Discovery
          This week,...
          10-07-2024, 08:07 AM
        • seqadmin
          Recent Developments in Metagenomics
          by seqadmin





          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
          09-23-2024, 06:35 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 06:55 AM
        0 responses
        10 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-02-2024, 04:51 AM
        0 responses
        108 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-01-2024, 07:10 AM
        0 responses
        114 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-30-2024, 08:33 AM
        1 response
        118 views
        0 likes
        Last Post EmiTom
        by EmiTom
         
        Working...
        X