Hi, so I'm working with some similar data. Something I found is that alot of trimming tools aren't really set up for paired end stuff. I have a pipeline for trimming and aligning reads. It goes basically like this:
//There are first two files, paired end illumina. This removes all the ones that failed basic quality checks. Outputs to Filtered
grep -A 3 '^@.* [^:]*:N:[^:]*:' $INPUT1 > $FILTERED1
grep -A 3 '^@.* [^:]*:N:[^:]*:' $INPUT2 > $FILTERED2
//This tool is good for dealing with paired end reads. Best that I could find for paired end trimming. I don't remember all the parameters but theres a great resource out there describing this tool.
fastq-mcf -o $OUTPUT1 -o $OUTPUT2 -l 16 -q 15 -w 4 -x 10 -u -P 33 $ADAPTERS $FILTERED1 $FILTERED2
//This aligns using bowtie and gets a samfile made.
bowtie -t -p 8 --sam $REF_GENOME -1 $OUTPUT1 -2 $OUTPUT2 $ALIGNED_OUTPUT
//This makes a sorted bam file from our bowtie alignment, which can be used for all sorts of things.
samtools view -bS $ALIGNED_OUTPUT | samtools sort - $SORTED_BAM
samtools index $SORTED_BAM.bam $SORTED_BAM.bam.bai
That's pretty much how I'm doing it for my data. It works pretty well. As for those nasty overrepresented sequences. I'm guessing you're doing quality assessment with fastqc, which is a great tool. In my case, I did RNA-seq on bacterial genomes, so my read depth is really really high, because the genome is small. Add to that some highly expressed genes and you get queues for highly represented sequences. I'm basically ignoring them in my data, but think about how overrepresented sequences apply to your data and how bad or not important they really are.
Hope this helps.
//There are first two files, paired end illumina. This removes all the ones that failed basic quality checks. Outputs to Filtered
grep -A 3 '^@.* [^:]*:N:[^:]*:' $INPUT1 > $FILTERED1
grep -A 3 '^@.* [^:]*:N:[^:]*:' $INPUT2 > $FILTERED2
//This tool is good for dealing with paired end reads. Best that I could find for paired end trimming. I don't remember all the parameters but theres a great resource out there describing this tool.
fastq-mcf -o $OUTPUT1 -o $OUTPUT2 -l 16 -q 15 -w 4 -x 10 -u -P 33 $ADAPTERS $FILTERED1 $FILTERED2
//This aligns using bowtie and gets a samfile made.
bowtie -t -p 8 --sam $REF_GENOME -1 $OUTPUT1 -2 $OUTPUT2 $ALIGNED_OUTPUT
//This makes a sorted bam file from our bowtie alignment, which can be used for all sorts of things.
samtools view -bS $ALIGNED_OUTPUT | samtools sort - $SORTED_BAM
samtools index $SORTED_BAM.bam $SORTED_BAM.bam.bai
That's pretty much how I'm doing it for my data. It works pretty well. As for those nasty overrepresented sequences. I'm guessing you're doing quality assessment with fastqc, which is a great tool. In my case, I did RNA-seq on bacterial genomes, so my read depth is really really high, because the genome is small. Add to that some highly expressed genes and you get queues for highly represented sequences. I'm basically ignoring them in my data, but think about how overrepresented sequences apply to your data and how bad or not important they really are.
Hope this helps.
Comment