Seqanswers Leaderboard Ad

**Brian Bushnell** · 05-05-2015, 04:19 PM

I would say there's a problem with your methodology. You should expect adapters at the same position in read 1 as in read 2... adapter-trimming and contaminant filtering are different operations, and should be done in different steps.

First do adapter trimming with adapter sequence, trimming to the same position in each read (you can do this with BBDuk using the "tpe" flag). Then remove all reads that contain contaminants, rather than trimming them... you can do that with BBDuk as well. I don't see much point in removing reads that are 100% poly-A or poly-T, though, as they won't map anyway.

**BrianU** · 05-05-2015, 05:04 PM

Originally posted by Brian Bushnell View Post

I would say there's a problem with your methodology. You should expect adapters at the same position in read 1 as in read 2... adapter-trimming and contaminant filtering are different operations, and should be done in different steps.

First do adapter trimming with adapter sequence, trimming to the same position in each read (you can do this with BBDuk using the "tpe" flag). Then remove all reads that contain contaminants, rather than trimming them... you can do that with BBDuk as well. I don't see much point in removing reads that are 100% poly-A or poly-T, though, as they won't map anyway.

Thanks for the suggestions. I'll give it a shot and report back.

**BrianU** · 05-07-2015, 09:23 AM

Hi Brian, I tried your suggestions. Here are the results in case someone else has a similar issue. First, I separated the adapter trimming step and the contaminant removal steps. Just as a comparison, I wanted to try BBDuk vs Trimmomatic for the adapter trimming step. I cleaned up the results a bit to make it easier to read.

Trimmomatic adapter trimming:

time java -Xmx27g -jar $TRIMMOMATICDIR/trimmomatic-0.32.jar PE -threads 12 -phred33 $DATADIR/test_R1.fastq $DATADIR/test_R2.fastq $DATADIR/test_trim/test_trimmomatic_R1.fastq $DATADIR/test_trim/test_trimmomatic_unpaired_R1.fastq $DATADIR/test_trim/test_trimmomatic_R2.fastq $DATADIR/test_trim/test_trimmomatic_unpaired_R2.fastq ILLUMINACLIP:"$TRIMMOMATICDIR/adapters/TruSeq3-PE-2.fa":2:30:10 MINLEN:36

Input Read Pairs: 55723516
Both Surviving: 55567661 (99.72%)
Forward Only Surviving: 2203 (0.00%)
Reverse Only Surviving: 130983 (0.24%)
Dropped: 22669 (0.04%)

real 4m42.470s
user 16m35.382s
sys 1m10.948s

BBDuk Adapter Trimming:

time $BBMAPDIR/bbduk.sh -Xmx27g in1=$DATADIR/test_R1.fastq
in2=$DATADIR/test_R2.fastq out1=$DATADIR/test_trim/test_bbduk_R1.fastq out2=$DATADIR/test_trim/test_bbduk_R2.fastq ref=$BBMAPDIR/resources/truseq.fa.gz ktrim=r k=28 mink=12 hdist=1 tpe=t tbo=t minlength=36

Input: 111447032 reads
KTrimmed: 315462 reads (0.28%)
Trimmed by overlap: 0 reads (0.00%)
Result: 111135074 reads (99.72%) / 2 = 55567537

Time: 359.990 seconds.
Reads Processed: 111m 309.58k reads/sec
Bases Processed: 5572m 15.48m bases/sec

real 6m1.432s
user 19m13.116s
sys 1m14.501s

Both seemed to have worked great. BBDuk took a slightly bit longer but ended up with a very similar amount of reads remaining. Adapters weren't a large proportion of the overrepresented sequences (fastqc) to begin with.

Removing the clonetech primer contaminants

Since trimmomatic doesn't have the option for removing contaminating sequences, I just used BBDuk for this one.

clonetech.fa:
>SMARTerIIAOligonucleotide
AAGCAGTGGTATCAACGCAGAGTACNNNNN
>3SMARTCDSPrimerIIA
AAGCAGTGGTATCAACGCAGAGTACNNNNN
>3SMARTCDSPrimerIIA_consensus
AAGCAGTGGTATCAACGCAGAGTAC
>SmartISPCRPrimer
AAGCAGTGGTATCAACGCAGAGT

time $BBMAPDIR/bbduk.sh -Xmx27g in1=$DATADIR/test_trim/test _bbduk_R1.fastq in2=$DATADIR/test_bbduk_R2.fastq out1=$DATADIR/ test_rm_contam/test_bbduk_unmatched_R1.fastq out2=$DATADIR/test_rm_contam/test_bbduk_unmatched_R2.fastq outm1=$DATADIR/test_rm_contam/test_bbduk_matched_R1.fastq outm2=$DATADIR/test_rm_contam/test_bbduk_matched_R2.fastq ref=$DATADIR/test_rm_contam/clonetech.fa k=25 hdist=1 stats=$SEBADIR/test_rm_contam/rm_contam_stats.txt

Input: 111135074 reads
Contaminants: 50448064 reads (45.39%)
Result: 60687010 reads (54.61%)

Time: 350.589 seconds.
Reads Processed: 111m 317.00k reads/sec
Bases Processed: 5556m 15.85m bases/sec

Statsfile:
#Total 111135074
#Matched 25422697 22.87549%
#Name Reads ReadsPct
SMARTerIIAOligonucleotide 25422697 22.87549%

real 5m52.338s
user 11m23.119s
sys 1m44.755s

I am still losing ~ half my reads, but my matched forward & reverse read files are equal now (rather than forward heavy). Probably due to using the tbe flag. Another interesting point is that in the stats file, only half the reads matched the contaminants 22.87% vs 45.39% removed.

Since these samples are enriched by oligo dT primers, do you think that maybe only one read of the pair would contain the contaminating primers? In the FastQC report, I would find over represented sequences that look like this:

'AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT'

Say if these primers only attach to the PolyA side of the insert, do you think the read's pair that covers the opposite end of the insert might contain useable data?

**GenoMax** · 05-07-2015, 09:55 AM

@BrianU: You should be able to provide the clonetech.fa file to Trimmomatic as well. Only difference is you will need to include the reverse complements of the sequences also in your file (see the section in trimmomatic manual).

Contaminant separation is not supported by trimmomatic but it should still be able to trim the clontech sequences. Isn't that correct?

**Brian Bushnell** · 05-07-2015, 10:31 AM

This is quite interesting; I've never seen a situation where BBDuk was slower than Trimommatic. I imagine that it's only using 1 thread or something... you can force it to use 12 threads with the "t=12" flag, but normally it autodetects the number of cores available.

As for the 22.87% vs 45.39%, that's because pairs are always kept together and if one read is flagged as contaminant, the mate is thrown away also (the "tpe" flag is ignored in filtering; that's only used in trimming). This is controlled by the flag removeifeitherbad (rieb). You can set "rieb=f" and then a pair will only be removed if both are contaminant.

Now that you have the contaminant pairs from the outm stream, you can process them individually (rather than as pairs) to just get the noncontaminant singletons. But if these contaminants are in fact primers that theoretically should be attached to real genomic sequence, then probably you should be trimming them rather than removing them. BBDuk supports left-trimming and right-trimming, so if you know what part of the read should be genomic when you detect primer in some location, you should be able to salvage it. Can you explain (perhaps with a nice ascii diagram) exactly what the structure of the molecule/read is when these primer sequences are present?

**BrianU** · 05-07-2015, 12:08 PM

I'll give the other suggestions a try, but here is my attempt to explain the enrichment step. Maybe it will help. From what I understand, the CloneTech SMARTer II kit is supposed to allow you to get more complete transcripts (increased chances of getting the 5' end). This is how I interpreted the manual:

Key:
'...' - just spacers
Ns - any bases in the transcript
Xs - the overhangs (don't know if they are polyAs too?)

First step:5'- ................NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA......... -3' Starting mRNA (w/ polyA tail)
3'- ..............................................TTTTTTTTTTT[PRIMER1] -5' Prime SMART CDS Primer containing dT

Second step:
The kit contains a special reverse transcriptase that tacks on a few extra bases (overhang) after its done.
5'- ................NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA......... -3'
3'- .........XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5' Create 1st Strand + overhang

Third Step
5'- [PRIMER2]XXXXXXX.................................................. -3' Prime SMARTer II A oligo using overhang
3'- .........XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5'

Fourth Step
5'- [PRIMER2]XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA[PRIMER1] -3' Complete 2nd Strand
3'- .........XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5'

Final Step
5'- [PRIMER2]XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAAAAAAAAAAA[PRIMER1] -3'
3'- [PRIMER2]XXXXXXXNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTTTTTTTTTT[PRIMER1] -5' Complete 3rd strand

This final cDNA product is then sheered into smaller pieces and used in subsequent Illumina library prep steps. So I am not sure where does the overrepresented sequence of primer + polyT comes from?

'AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT'

Unless it is the 5' overhang (XXXXXX) is a PolyA also. Out of all the clonetech primer sequences provided, only the SMARTerIIAOligonucleotide was matched (shown in the stats.txt) using BBDuk. However, the other primer sequences are very similar too. I wonder if it matched mainly because it was the first one on the list...

Edit: Oops, it makes sense now once you orient the primer properly. The over represented sequence is from the 3rd strand illustrated above.

This: 5' -AAGCAGTGGTATCAACGCAGAGTACTTTTTTTTTTTTTTTTTTTTTTTTT- 3'
Is actually the same as this: 3' - TTTTTTTTTTT[PRIMER1] -5'

**Brian Bushnell** · 05-07-2015, 12:21 PM

BBDuk will match the first in a list, if two sequences contain identical kmers. A related program, Seal, will match both of them, instead. It does not do trimming, but it does do filtering and can create a stats file.

My impression from your diagram is that probably when one read in a pair contains primer plus poly-A/T, the other read should still be useful. And furthermore, the remainder of the read with the primer should be useful too, and you should trim it with "ktrim=l" to do a left trim, if I'm reading it correctly (but do NOT include "tpe or "tbo", those are just for the adapter-trimming).

**BrianU** · 05-08-2015, 06:51 AM

Originally posted by Brian Bushnell View Post

My impression from your diagram is that probably when one read in a pair contains primer plus poly-A/T, the other read should still be useful. And furthermore, the remainder of the read with the primer should be useful too, and you should trim it with "ktrim=l" to do a left trim, if I'm reading it correctly (but do NOT include "tpe or "tbo", those are just for the adapter-trimming).

I think I may have come full circle. In my first post, I used Trimmomatic to trim both Illumina adapters and the clonetech primers in one step (granted, by combining both steps, I couln't diagnose the true problem). After I trim the clonetech primers and it turns out that the singleton reads are usable, what would I do with the unpaired reads? How would I incorporate them into the analysis?

**Brian Bushnell** · 05-08-2015, 08:27 AM

If you right-trim adapter sequences, then left-trim clonetech primers in a second step, you should be left with surviving pairs, not singletons (depending on read length). Either way, mixing singletons and pairs for analysis should not really be a problem for most tools, I would think. But it depends on what analysis you are doing.

**BrianU** · 05-08-2015, 09:32 AM

That makes sense. Thanks, Brian. I'll give it a shot.

**BrianU** · 05-09-2015, 08:42 AM

Brian, in case you were curious about the Illumina adapter trim speed above of BBDuk after forcing it to use 12 threads. Here is the full output for diagnostic purposes.

Original run:

time $BBMAPDIR/bbduk.sh -Xmx27g in1=$DATADIR/test_R1.fastq in2=$DATADIR/test_R2.fastq \
out1=$DATADIR/test_trim/test_bbduk_R1.fastq out2=$DATADIR/test_trim/test_bbduk_R2.fastq \
ref=$BBMAPDIR/resources/truseq.fa.gz ktrim=r k=28 mink=12 hdist=1 tpe=t tbo=t minlength=36

BBDuk version 34.92
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=27348m, used=435m

Added 64770 kmers; time: 0.200 seconds.
Memory: free=26333m, used=1450m

Input is being processed as paired
Started output streams: 0.027 seconds.
Processing time: 359.725 seconds.

Input: 111447032 reads 5572351600 bases.
KTrimmed: 315462 reads (0.28%) 15640564 bases (0.28%)
Trimmed by overlap: 0 reads (0.00%) 0 bases (0.00%)
Result: 111135074 reads (99.72%) 5556711036 bases (99.72%)

Time: 359.990 seconds.
Reads Processed: 111m 309.58k reads/sec
Bases Processed: 5572m 15.48m bases/sec

real 6m1.432s
user 19m13.116s
sys 1m14.501s

Setting t = 12:

time $BBMAPDIR/bbduk.sh -Xmx27g t=12 in1=$DATADIR/test_R1.fastq in2=$DATADIR/test_R2.fastq \
out1=$DATADIR/test_trim/test_bbduk_R1.fastq out2=$DATADIR/test_trim/test_bbduk_R2.fastq \
ref=$BBMAPDIR/resources/truseq.fa.gz ktrim=r k=28 mink=12 hdist=1 tpe=t tbo=t minlength=36

BBDuk version 34.92
Set threads to 12
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=27348m, used=435m

Added 64770 kmers; time: 0.575 seconds.
Memory: free=26478m, used=1305m

Input is being processed as paired
Started output streams: 0.197 seconds.
Processing time: 310.076 seconds.

Input: 111447032 reads 5572351600 bases.
KTrimmed: 315462 reads (0.28%) 15640564 bases (0.28%)
Trimmed by overlap: 0 reads (0.00%) 0 bases (0.00%)
Result: 111135074 reads (99.72%) 5556711036 bases (99.72%)

Time: 310.945 seconds.
Reads Processed: 111m 358.41k reads/sec
Bases Processed: 5572m 17.92m bases/sec

real 5m12.763s
user 16m1.380s
sys 1m9.628s

**Brian Bushnell** · 05-09-2015, 09:19 AM

Thanks; looks like it was probably already using (approximately) 12 threads and the time difference was just due to system variability. On my PC, with 4 threads it normally runs at over 60 megabases/s with that kind of command. I suspect that you may be I/O limited.

**BrianU** · 05-09-2015, 09:51 AM

Originally posted by Brian Bushnell View Post

I suspect that you may be I/O limited

Ah... good point. I have limited local disk space, so all my data is processed on a large mounted NFS drive.

**BrianU** · 05-09-2015, 11:10 AM

Here is my data on trimming the clonetech primers (rather than treating them as contaminants). I changed the minimum length filter between multiple runs to see if I can get different results. The clonetech primers are about ~25 nucleotides long and the max read length is 50. So, I chose to try 25nt, 30nt, and 36nt. I'd like to keep the longer reads if I can, as they will be more unique. I made a summary table of the results. SeqAnswers didn't turn on the BBcode for tables, so I just took a screen shot of my Excel sheet (see attached). I also tried two out of the three lengths in Trimmomatic to compare. Here are two examples of the parameters

BBDuk left trim

time $BBMAPDIR/bbduk.sh -Xmx27g t=12 \
in1=$DATADIR/test_trim/test_bbduk_R1.fastq \
in2=$DATADIR/test_trim/test_bbduk_R2.fastq \
out1=$DATADIR/ct_test_trim/test_bbduk_ct36_R1.fastq \
out2=$DATADIR/ct_test_trim/test_bbduk_ct36_R2.fastq \
ref=clonetech.fa ktrim=l k=24 mink=12 hdist=1 minlength=36

BBDuk version 34.92
Set threads to 12
maskMiddle was disabled because useShortKmers=true
Initial:
Memory: free=27348m, used=435m

Added 1430 kmers; time: 0.379 seconds.
Memory: free=26333m, used=1450m

Input is being processed as paired
Started output streams: 0.207 seconds.
Processing time: 239.172 seconds.

Input: 111135074 reads 5556711036 bases.
KTrimmed: 51351140 reads (46.21%) 2566637949 bases (46.19%)
Result: 59808372 reads (53.82%) 2990073087 bases (53.81%)

Time: 239.844 seconds.
Reads Processed: 111m 463.36k reads/sec
Bases Processed: 5556m 23.17m bases/sec

real 4m1.715s
user 12m41.120s
sys 1m23.517s

Trimmomatic

time java -jar $TRIMMOMATICDIR/trimmomatic-0.32.jar PE -threads 12 -phred33 \
$DATADIR/test_trim/test_trimmomatic_R1.fastq \
$DATADIR/test_trim/test_trimmomatic_R2.fastq \
$DATADIR/ct_test_trim/test_trimmomatic_ct36_R1.fastq \
$DATADIR/ct_test_trim/test_trimmomatic_ct36_unpaired_R1.fastq \
$DATADIR/ct_test_trim/test_trimmomatic_ct36_R2.fastq \
$DATADIR/ct_test_trim/test_trimmomatic_ct36_unpaired_R2.fastq \
ILLUMINACLIP:"clonetech.fa":2:30:10 MINLEN:36

Using Medium Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGT'
Using Long Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTACNNNNN'
Using Long Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTAC'
Skipping duplicate Clipping Sequence: 'AAGCAGTGGTATCAACGCAGAGTACNNNNN'
ILLUMINACLIP: Using 0 prefix pairs, 3 forward/reverse sequences,
0 forward only sequences, 0 reverse only sequences

Input Read Pairs: 55,567,661
Both Surviving: 30,045,460 (54.07%)
Forward Only Surviving: 22,460,833 (40.42%)
Reverse Only Surviving: 2850609 (5.13%)
Dropped: 210759 (0.38%)
TrimmomaticPE: Completed successfully

real 3m55.694s
user 10m21.251s
sys 0m54.051s

Some observations:

As I expected, I get many more surviving reads after trimming with BBDuk with a more relaxed read length at 25 nt, its likely due to the clonetech primer size (~25 nt). Although checking fastqc, most of those remaining trimmed reads were PolyTs. At minlen=36, and 30, the out1(f) and out2(r) files are small (lost ~half the reads) and are even. However, at minlen=25 both files are close to double in size and the forward read file is bigger than the reverse.

Varying the minimum length between 25 and 36 didn't make a difference in trimmomatic, but I suspect I may have ran it wrong...? The file sizes turned out to be the same as the out1/out2 files from BBDuk when the minlens were stricter at 30, and 36nt. But the unpaired reads are both kept in trimmomatic (the forward read file is huge compared to the reverse).

Is a 25nt read long enough for decent mapping?

I aligned 3 pairs files with TopHat2 to see the differences. I'll post that next. Thanks for your help guys.

Attached Files

summary.png (51.6 KB, 62 views)

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Too many forward only surviving unpaired reads after trimmomatic (RNA-Seq data)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News