Seqanswers Leaderboard Ad

**Brian Bushnell** · 06-08-2015, 09:17 AM

You can randomly subsample with reformat like this:

reformat.sh in=reads.fq out=sampled.fq samplerate=0.1

...which will reduce the data to 10%, or 0.5 days worth

If your paired reads are in 2 files use the in2 and out2 flags; pairs will be kept together. Otherwise, if they're interleaved, that will be autodetected.

Alternatively, you may be able to filter for a specific barcode quickly using BBDuk, depending on where the barcode is. If, for example, you wanted the pairs where read1 starts with "ACGTTGCA":

bbduk.sh in=reads.fq out=filtered.fq literal=ACGTTGCA mm=f rcomp=f k=8 skipr2 restrictleft=8

**Lohman** · 06-08-2015, 07:12 PM

Hi Brian,

Thanks for the great suggestion. I used your reformat.sh without any problems, and I'm rerunning the multiplexing now.

I may further subset the data again after demultiplexing to even out the number of reads per sample, but this won't be nearly as big of an issue. It was the raw reads that were giving me trouble.

Thanks!

**dschika** · 06-09-2015, 12:41 AM

Originally posted by Lohman View Post

I have a working perl script for demultiplexing them, but the script takes upwards of 5 days to run.

Even though I don't know how large your sample is, 5 days for demultiplexing sounds very long, even for very large data sets. Maybe you could speed up the script?

Two questions just for curiosity:
- This depends on the analysis of course, but in general: Is it not possible to introduce a bias if you sample randomly reads from your raw data?
- Why do you need only 1000 reads per sample?

**Lohman** · 06-09-2015, 05:16 AM

Hi dschika,

Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.

**SNPsaurus** · 06-09-2015, 06:59 AM

Not to harp on the script, but 5 days does seem like a long time. I have a similar step as you describe when generating a pseudo-reference for nextRAD in large populations. In Perl-ish code, something like:

$indexA = substr($seq,0,$indexA_length);
$indexB = substr($seq,$indexB_length); # $indexB_length is negative to take end of string
$read = substr($seq,$indexA_length,$indexB_length); #grab the middle
$unique{$indexA}{$indexB}{$read}++;
$unique_all{$read}++;

The biggest issue on a small computer is that sequence error reads will grow the hash of unique reads quite large given 100 millions of reads, but at 1000 reads per sample this would be done in a few minutes and not take much memory.

If you are already doing something like the above and it is taking 5 days, then you've got quite a large file indeed!

**dschika** · 06-09-2015, 09:37 AM

Originally posted by Lohman View Post

Hi dschika,

Yes, the perl script could be faster, but it does a few other tasks while demultiplexing. It also generates an index of unique reads present in the sample, which is what really takes a long time. The longer the list of unique reads, the longer each check to see if a new read is already present in the index. I need this index for the next step.

I would imagine that you could introduce bias by sub-sampling too deeply, but as you point out, it depends on the analysis. In this case I'm interested in only a handful of loci (5) in hundreds of individuals (I'm not cutting with restriction enzymes, but using PCR primers which amplify alleles at multiple loci). Therefore, if there are a maximum of 10 different alleles, it shouldn't hurt anything to only use ~1000 reads per individual.

Indeed, if you are interested in only a handful of loci that sounds reasonable. Thanks for clarification.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Today, 11:49 AM	0 responses 13 views 0 likes	Last Post by seqadmin Today, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, Yesterday, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin Yesterday, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Sub-setting fasta file by forward and reverse baracodes

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News