Unconfigured Ad

**GenoMax** · 04-05-2016, 06:03 AM

Take a look at @Brian's post in this thread.

Edit: Looking at @Brian's post again it seems as if the paired information would not be kept. If you have original fastq files available you should be able to retrieve the other read from a pair by using repair.sh.

**fanli** · 04-05-2016, 06:47 AM

seqtk does this and will keep read pairs intact if you pass the same random seed:

GitHub - lh3/seqtk: Toolkit for processing sequences in FASTA/Q formats

https://github.com/lh3/seqtk

Toolkit for processing sequences in FASTA/Q formats - lh3/seqtk

Code:

seqtk sample -s100 read1.fq 10000 > sub1.fq
seqtk sample -s100 read2.fq 10000 > sub2.fq

**EcoRInya** · 04-05-2016, 06:51 AM

GenoMax, thank you for your reply!
Do you know what happens with the total read count after this repair? I don't quite understand that. If I just complete all the read pairs, it will obviously affect total number of reads I am getting and screw downsampling.

**EcoRInya** · 04-05-2016, 06:59 AM

thank you, fanli!
I am not sure that downsampling of a fastq file is a good idea. I use downsampling for normalisation and if I downsample initial fastq files I don't know how many reads I am getting back after the alignment is done which is not suitable for normalisation. However, I could convert bam with uniquely aligned reads after PCR duplicates removal to fastq and use seqtk. I am not sure that it is the most optimal solution though. But I will think about it.

**GenoMax** · 04-05-2016, 07:02 AM

"Repair.sh" (re-pair : a trick name) should only recover corresponding reads to the ones that are present in the downsampled file.

I assume you are doing this from the BAM because you only want to sample reads that aligned (your BAM does not have unmapped reads?). If you don't care about the alignments then you could downsample the original fastq files by using reformat.sh or the seqtk method @fanli posted above.

**EcoRInya** · 04-05-2016, 07:28 AM

>"Repair.sh" (re-pair : a trick name) should only recover corresponding reads to the ones that are present in the downsampled file.

But if I am doing it for several files, for different files ratio of the incomplete pairs will be random and different. It means that if I add a pair to the reads in the downsampled file I will potentially get different number of reads and scew downsampling (with subsequent normalisation).

>I assume you are doing this from the BAM because you only want to sample reads that aligned (your BAM does not have unmapped reads?).

Exactly!

**GenoMax** · 04-05-2016, 09:17 AM

Can you clarify what exactly are you trying to achive? Are these separate files/samples or same sample multiple files? Your BAM files don't have unmapped reads?

**EcoRInya** · 04-05-2016, 10:14 AM

Sure! I am downsampling bam files that contain uniquely aligned reads with PCR duplicates removed. In total I have 6 bam files (2 conditions, 3 replicated each). For each file I have a normalisation coefficient which I want to use as a downsampling factor, bringing each bam file to a specific read count. I will be using the resulting downsampled bam files for different kinds of comparative analysis between groups. For instance, using diffReps package. It takes as an input bed files. For paired end reads these bed files should contain position of the centre of a fragment. In order, to create such a file all the reads in the bam file should be paired, which I have failed to achieve using standard samtools view -s.

Topics	Statistics	Last Post
New AI Model Captures Long-Range Genomic Signals to Improve RNA Splice Site Prediction by SEQadmin2 Started by SEQadmin2, Today, 05:37 AM	0 responses 5 views 0 reactions	Last Post by SEQadmin2 Today, 05:37 AM
Large-Scale Protein Screen Uncovers Hidden Regulators of Alternative Polyadenylation by SEQadmin2 Started by SEQadmin2, 06-26-2026, 11:10 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 06-26-2026, 11:10 AM
Whole-Genome Sequencing Traces Faroe Islands Ancestry to a North Atlantic Founder Population by SEQadmin2 Started by SEQadmin2, 06-17-2026, 06:09 AM	0 responses 49 views 0 reactions	Last Post by SEQadmin2 06-17-2026, 06:09 AM
Sequencing the Two-Toed Sloth Genome Reveals Jumping Genes Tied to Its Extreme Metabolism by SEQadmin2 Started by SEQadmin2, 06-09-2026, 11:58 AM	0 responses 109 views 0 reactions	Last Post by SEQadmin2 06-09-2026, 11:58 AM

Unconfigured Ad

downsampling read pairs from a bam file

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News