Seqanswers Leaderboard Ad

**TiborNagy** · 05-28-2014, 05:08 AM

R stores all variables in memory, so if you would like to parse a large file, try to write a loop around readLines() and process only one line. Example.

**dpryan** · 05-28-2014, 05:12 AM

Are you absolutely set on using R for this? R is great for many things, but reading a whole SAM/BAM file into memory (which most of the R mechanisms for dealing with SAM/BAM files entail) isn't exactly the most efficient processing mechanism. You might find that pysam/python meets your flexibility needs while still delivering increased performance.

**gringer** · 05-28-2014, 02:48 PM

You're better off using a method that is essentially a map-reduce process. Use some other quick tool (i.e. doesn't take much longer than just reading out the reads using cat) to pre-process the BAM files into a format that is small and easy for another program to use. I notice that you're using strsplit and just using a subset of the columns, where subsetting the columns first using awk would be much better:

Code:

samtools view input.bam | awk -F '\t' '{print $3,$4}' | sort | uniq -c > ref_pos_table.txt

[using samtools view will also skip the header lines, which will have a variable length, rather than the "always 7 lines" that your R code suggests]

Then proceed from the ftab1 lines in your R script, loading ref_pos_table.txt:

Code:

ftab1 <- read.table("ref_pos_table.txt",col.names = c("Count","RNAME","POS"));
ftab1 <- subset(ftab1, RNAME != "*");
... etc

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 29 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

how can i parse lines of a huge .sam file into a data frame, table, list faster in R?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News