Unconfigured Ad

**Brian Bushnell** · 06-08-2015, 09:17 AM

You can randomly subsample with reformat like this:

reformat.sh in=reads.fq out=sampled.fq samplerate=0.1

...which will reduce the data to 10%, or 0.5 days worth

If your paired reads are in 2 files use the in2 and out2 flags; pairs will be kept together. Otherwise, if they're interleaved, that will be autodetected.

Alternatively, you may be able to filter for a specific barcode quickly using BBDuk, depending on where the barcode is. If, for example, you wanted the pairs where read1 starts with "ACGTTGCA":

bbduk.sh in=reads.fq out=filtered.fq literal=ACGTTGCA mm=f rcomp=f k=8 skipr2 restrictleft=8

**Lohman** · 06-08-2015, 07:12 PM

Hi Brian,

Thanks for the great suggestion. I used your reformat.sh without any problems, and I'm rerunning the multiplexing now.

I may further subset the data again after demultiplexing to even out the number of reads per sample, but this won't be nearly as big of an issue. It was the raw reads that were giving me trouble.

Thanks!

**dschika** · 06-09-2015, 12:41 AM

Originally posted by Lohman View Post

I have a working perl script for demultiplexing them, but the script takes upwards of 5 days to run.

Even though I don't know how large your sample is, 5 days for demultiplexing sounds very long, even for very large data sets. Maybe you could speed up the script?

Two questions just for curiosity:
- This depends on the analysis of course, but in general: Is it not possible to introduce a bias if you sample randomly reads from your raw data?
- Why do you need only 1000 reads per sample?

**Lohman** · 06-09-2015, 05:16 AM

Hi dschika,

Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.

**SNPsaurus** · 06-09-2015, 06:59 AM

Not to harp on the script, but 5 days does seem like a long time. I have a similar step as you describe when generating a pseudo-reference for nextRAD in large populations. In Perl-ish code, something like:

$indexA = substr($seq,0,$indexA_length);
$indexB = substr($seq,$indexB_length); # $indexB_length is negative to take end of string
$read = substr($seq,$indexA_length,$indexB_length); #grab the middle
$unique{$indexA}{$indexB}{$read}++;
$unique_all{$read}++;

The biggest issue on a small computer is that sequence error reads will grow the hash of unique reads quite large given 100 millions of reads, but at 1000 reads per sample this would be done in a few minutes and not take much memory.

If you are already doing something like the above and it is taking 5 days, then you've got quite a large file indeed!

**dschika** · 06-09-2015, 09:37 AM

Originally posted by Lohman View Post

Hi dschika,

Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.

Indeed, if you are interested in only a handful of loci that sounds reasonable. Thanks for clarification.

Topics	Statistics	Last Post
New Genomic Method Uncovers Ancient Hominin DNA by SEQadmin2 Started by SEQadmin2, Today, 02:55 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 02:55 AM
Study Captures the First Moments of DNA Replication by SEQadmin2 Started by SEQadmin2, 07-24-2026, 12:17 PM	0 responses 11 views 0 reactions	Last Post by SEQadmin2 07-24-2026, 12:17 PM
Chemotherapy Leaves Detectable DNA Signatures in Childhood Tumors by SEQadmin2 Started by SEQadmin2, 07-23-2026, 11:41 AM	0 responses 12 views 0 reactions	Last Post by SEQadmin2 07-23-2026, 11:41 AM
Single-Cell Atlases Skew Toward European Ancestry, Analysis Finds by SEQadmin2 Started by SEQadmin2, 07-20-2026, 11:10 AM	0 responses 24 views 0 reactions	Last Post by SEQadmin2 07-20-2026, 11:10 AM

Unconfigured Ad

Sub-setting fasta file by forward and reverse baracodes

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News