Seqanswers Leaderboard Ad

**TiborNagy** · 05-28-2014, 05:08 AM

R stores all variables in memory, so if you would like to parse a large file, try to write a loop around readLines() and process only one line. Example.

**dpryan** · 05-28-2014, 05:12 AM

Are you absolutely set on using R for this? R is great for many things, but reading a whole SAM/BAM file into memory (which most of the R mechanisms for dealing with SAM/BAM files entail) isn't exactly the most efficient processing mechanism. You might find that pysam/python meets your flexibility needs while still delivering increased performance.

**gringer** · 05-28-2014, 02:48 PM

You're better off using a method that is essentially a map-reduce process. Use some other quick tool (i.e. doesn't take much longer than just reading out the reads using cat) to pre-process the BAM files into a format that is small and easy for another program to use. I notice that you're using strsplit and just using a subset of the columns, where subsetting the columns first using awk would be much better:

Code:

samtools view input.bam | awk -F '\t' '{print $3,$4}' | sort | uniq -c > ref_pos_table.txt

[using samtools view will also skip the header lines, which will have a variable length, rather than the "always 7 lines" that your R code suggests]

Then proceed from the ftab1 lines in your R script, loading ref_pos_table.txt:

Code:

ftab1 <- read.table("ref_pos_table.txt",col.names = c("Count","RNAME","POS"));
ftab1 <- subset(ftab1, RNAME != "*");
... etc

Topics	Statistics	Last Post
Gene Editing Technique Removes Extra Chromosome 21 in Human Cells by seqadmin Started by seqadmin, Today, 03:06 PM	0 responses 8 views 0 likes	Last Post by seqadmin Today, 03:06 PM
Genetic Mapping of Plasmodium knowlesi Identifies Essential Genes and Drug Resistance Mechanisms by seqadmin Started by seqadmin, 02-07-2025, 09:30 AM	0 responses 72 views 0 likes	Last Post by seqadmin 02-07-2025, 09:30 AM
New DNA Sequencing Method Measures Metabolites with High Precision by seqadmin Started by seqadmin, 02-05-2025, 10:34 AM	0 responses 113 views 0 likes	Last Post by seqadmin 02-05-2025, 10:34 AM
AI Model Maps 3D Genome Structures in Minutes by seqadmin Started by seqadmin, 02-03-2025, 09:07 AM	0 responses 90 views 0 likes	Last Post by seqadmin 02-03-2025, 09:07 AM

Seqanswers Leaderboard Ad

Announcement

how can i parse lines of a huge .sam file into a data frame, table, list faster in R?

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News