Dear Experts,
Please accept my apologies if this has been posted elsewhere. I am new to the analysis of RNA-seq data, and I am confused regarding trimming of my adapters from the FASTQ files using cutadapt. I have read through some of the posts but they have gotten me more confused!
The details of my RNA-seq data are as follows:
- The platform is Illumina, TruSeq
- The FASTQ files are pair-ended (so I have an R1.fastq and R2.fastq for each of my samples). It is unknown which of the R1 and R2 represent the 'forward' or 'reverse' reads.
- The files have been demultiplexed, so I have a barcode per sample which matches a specific barcode in a corresponding indexed adapter.
- I have been provided with a Universal adapter and 5'-3' indexed adapters. I have checked the indexed adapters and they are all exactly identical except at the 6bp barcode in the middle of the sequence.
Please kindly help me with the following:
1. I am still trying to understand how Illumina TruSeq works but on principle, should the trimming be done at the 3' only, or also at the 5' end of the read? Or is it that only the Universal Adapter should be trimmed at the 5', and the indexed adapters at the 3'?
NB1: Read length in 101bp as observed in FastQC. This was expected in the experimental setup but makes me wonder if I have any adapters to begin with.
NB2: I have used FastQC to look at a sample of my data (around 198,000 seqs), I didn't find any overrpresented sequences but I did find increased 5-mer representation in the first 10 base pairs of my pairs (which I am assuming to be the 5' end?). There are also more GC fluctuations in those first 10bps as well.
2. What is the minimum overlap that is effective to consitute a 'match' between the adapter and the read? Cutadapt has a default value of 3...but wouldn't that necessarily promote 'false matching' as well and lead to culling of sequences that don't have the adapter? I am considering a higher cutoff for the overlap, say 5bp, given the k-mer overrepresentations observed in FastQC.
3. When providing the adapter sequences, seeing that the indexed adapters only differ at the barcode, is it still prudent to provide the entire sequence of the indexed adapters, in addition to entire sequence of the universal adapter? What is the bare minimum sequence people have provided for their adapters, both indexed and universal? Does it make a difference?
4. I am assuming that the same indexed 5'-3' adapter is provided when trimming from both the R1 and R2 reads. I have not attempted to trim the reverse complement or the reversed sequence from either R1 or R2. If I am mistaken in this approach please correct me!
My apologies for the multiple questions. Thank you in advance for your help with this!
Much obliged!
SEQNovice
Please accept my apologies if this has been posted elsewhere. I am new to the analysis of RNA-seq data, and I am confused regarding trimming of my adapters from the FASTQ files using cutadapt. I have read through some of the posts but they have gotten me more confused!
The details of my RNA-seq data are as follows:
- The platform is Illumina, TruSeq
- The FASTQ files are pair-ended (so I have an R1.fastq and R2.fastq for each of my samples). It is unknown which of the R1 and R2 represent the 'forward' or 'reverse' reads.
- The files have been demultiplexed, so I have a barcode per sample which matches a specific barcode in a corresponding indexed adapter.
- I have been provided with a Universal adapter and 5'-3' indexed adapters. I have checked the indexed adapters and they are all exactly identical except at the 6bp barcode in the middle of the sequence.
Please kindly help me with the following:
1. I am still trying to understand how Illumina TruSeq works but on principle, should the trimming be done at the 3' only, or also at the 5' end of the read? Or is it that only the Universal Adapter should be trimmed at the 5', and the indexed adapters at the 3'?
NB1: Read length in 101bp as observed in FastQC. This was expected in the experimental setup but makes me wonder if I have any adapters to begin with.
NB2: I have used FastQC to look at a sample of my data (around 198,000 seqs), I didn't find any overrpresented sequences but I did find increased 5-mer representation in the first 10 base pairs of my pairs (which I am assuming to be the 5' end?). There are also more GC fluctuations in those first 10bps as well.
2. What is the minimum overlap that is effective to consitute a 'match' between the adapter and the read? Cutadapt has a default value of 3...but wouldn't that necessarily promote 'false matching' as well and lead to culling of sequences that don't have the adapter? I am considering a higher cutoff for the overlap, say 5bp, given the k-mer overrepresentations observed in FastQC.
3. When providing the adapter sequences, seeing that the indexed adapters only differ at the barcode, is it still prudent to provide the entire sequence of the indexed adapters, in addition to entire sequence of the universal adapter? What is the bare minimum sequence people have provided for their adapters, both indexed and universal? Does it make a difference?
4. I am assuming that the same indexed 5'-3' adapter is provided when trimming from both the R1 and R2 reads. I have not attempted to trim the reverse complement or the reversed sequence from either R1 or R2. If I am mistaken in this approach please correct me!
My apologies for the multiple questions. Thank you in advance for your help with this!
Much obliged!
SEQNovice
Comment