I'm looking for for some help with fastx clipper. Despite my best (if inexperienced) efforts, it's not doing what I want. So far it's little better than random. Worse, actually, since it's not clipping at the adapter sequence provided, but at other sequences entirely.
I've analysed my Illumina data using FastQC. This showed contamination with indexed TruSeq adapters. The universal adapter did not show up (below the 0.1% threshold I guess). It also showed up fairly high (min 30%, mostly >60%) levels of sequence redundancy in the libraries.
As part of a pipeline I used fastx_clipper to remove adapter-containing reads entirely (-C option). This removed huge numbers of reads (e.g. ~20% of all reads even for TruSeq Universal adapter - which did not show up in the FastQC analysis).
I've spent the rest of the afternoon away from the pipeline, using commandline to try figure out what it's actually doing. For this I used a single fastq file (2.fq) as a test case.
I used grep -c into the .fastq file (2.fq) to count up the instances of the Illumina universal adapter. This showed there were some, mostly at start of inserts; certainly nothing like the number fastx_clipper removed.
I ran a manual fastx_clipper (path specifications removed)
Since I couldn't understand what was going on, I repeated this but used option -c to retained only those sequences that had been clipped (ie those reads that had originally had the adapter). Numbers-wise outcome was comparable to the first run.
I extracted at random a single sequence (TGGTATTTTATTTTTCTACCTAAATTT) from this file, and grepped back into the file to recover all reads terminating with the sequence.
Then I did the same with the original file (2.fq) so I had a set of clipped and original unclipped sequences I could compare.
Comparison of the two on alternating lines below (first line clipped, second line original, etc) shows that the sequences removed by fastx clipper are not those supplied as the -a param string. If I blast the sequence removed it is contiguous sequence with the original random sequence (TGGTATTTTATTTTTCTACCTAAATTT) ... not the supplied commandline sequence string.
CTTGGTATTTTATTTTTCTACCTAAATTT
CTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAATTTA
TCTTGGTATTTTATTTTTCTACCTAAATTT
TCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAATTT
TTTCTTGGTATTTTATTTTTCTACCTAAATTT
TTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAAT
TTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
TTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGC
ATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
ATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATG
AAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
AAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATA
GGAAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
GGAAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAA
I've since repeated this with 4 addition sequences from the fastx_clipper output, with the same result. This has left me baffled. I would welcome someone pointing out my simple error (?)!
M
I've analysed my Illumina data using FastQC. This showed contamination with indexed TruSeq adapters. The universal adapter did not show up (below the 0.1% threshold I guess). It also showed up fairly high (min 30%, mostly >60%) levels of sequence redundancy in the libraries.
As part of a pipeline I used fastx_clipper to remove adapter-containing reads entirely (-C option). This removed huge numbers of reads (e.g. ~20% of all reads even for TruSeq Universal adapter - which did not show up in the FastQC analysis).
I've spent the rest of the afternoon away from the pipeline, using commandline to try figure out what it's actually doing. For this I used a single fastq file (2.fq) as a test case.
I used grep -c into the .fastq file (2.fq) to count up the instances of the Illumina universal adapter. This showed there were some, mostly at start of inserts; certainly nothing like the number fastx_clipper removed.
PHP Code:
$ grep -ce AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 2.fq
6903 # total instances
$ grep -ce ^AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT 2.fq
6329 # almost all of them at start of reads (i.e. no insert)
I ran a manual fastx_clipper (path specifications removed)
PHP Code:
$ fastx_clipper -Cva AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
-i 2.fq -o 2.noUniv.fq
Min. Length: 5
Clipped reads - discarded.
Input: 4959421 reads.
Output: 3937769 reads.
discarded 55558 too-short reads.
discarded 26892 adapter-only reads. #
discarded 936303 clipped reads. # these two lines total 963195 ...
discarded 2899 N reads. ... >>>> the 6903 from grep.
Since I couldn't understand what was going on, I repeated this but used option -c to retained only those sequences that had been clipped (ie those reads that had originally had the adapter). Numbers-wise outcome was comparable to the first run.
PHP Code:
$ fastx_clipper -cva AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
-i 2.fq -o 2_clippedonly
Min. Length: 5
Non-Clipped reads - discarded.
Input: 4959421 reads.
Output: 936149 reads. # that's 936149 reads that originally HAD adapter
discarded 55558 too-short reads.
discarded 26892 adapter-only reads.
discarded 3940668 non-clipped reads.
discarded 154 N reads.
I extracted at random a single sequence (TGGTATTTTATTTTTCTACCTAAATTT) from this file, and grepped back into the file to recover all reads terminating with the sequence.
PHP Code:
$ grep TGGTATTTTATTTTTCTACCTAAATTT 2_clippedonly | sort | uniq -c | sort -k1,1n
> 2_clipped_TGGTATTTTATTTTTCTACCTAAATTT
Then I did the same with the original file (2.fq) so I had a set of clipped and original unclipped sequences I could compare.
PHP Code:
$ grep TGGTATTTTATTTTTCTACCTAAATTT 2.fq | sort | uniq -c | sort -k1,1n
> 2_original_TGGTATTTTATTTTTCTACCTAAATTT
Comparison of the two on alternating lines below (first line clipped, second line original, etc) shows that the sequences removed by fastx clipper are not those supplied as the -a param string. If I blast the sequence removed it is contiguous sequence with the original random sequence (TGGTATTTTATTTTTCTACCTAAATTT) ... not the supplied commandline sequence string.
CTTGGTATTTTATTTTTCTACCTAAATTT
CTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAATTTA
TCTTGGTATTTTATTTTTCTACCTAAATTT
TCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAATTT
TTTCTTGGTATTTTATTTTTCTACCTAAATTT
TTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGCTACCTAAT
TTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
TTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATGATATTGC
ATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
ATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATAAAGAAAAAAGTCAGAAAATG
AAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
AAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAATA
GGAAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTT
GGAAAAAACAATAGTAATAGCCATATTTTTTGTTGTATTTCTTGGTATTTTATTTTTCTACCTAAATTTAAATCGTAGGTTAGCATTAAGTGTTTTTACTATGAA
I've since repeated this with 4 addition sequences from the fastx_clipper output, with the same result. This has left me baffled. I would welcome someone pointing out my simple error (?)!
M
Comment