Hello everyone,
I have some challenges for the group and any help and suggestion is welcome:
I run several genomes using the 8kb pair end protocol at one genome per lane. The bio informatics group in my facility have little experience on this and are very challenged helping my project. so here are the problems.
A) the runs seem contaminated by chimeric fragments from the sequencing adapters used in making the pair end data. so is there any software or script out there that can remove sequences matching the adapters (and the key part is) allowing for certain percentage of mismatch to adapter sequence (this to account for chimeric multiprimer sequences)
B) now the next problem is that in the pair end data you also use a central adapter and the true pair end data will be the one where the reads start at both ends of genomic fragment far from the central adapter (see pdf protocol for more detail http://www.illumina.com/applications...equencing.ilmn). however since the technology can not control that the position of the central adapter be just in the center, because the random shearing steps required, then the 42 bp adapter and the genomic sequence can come in all combinations possible as follows:
1- sequence read+adapter read (this is the easy one where a 3' triming tool can do the job)
2- adapter read+sequence read (this would need a 5' triming tool) and can be tricky when the adapter read is small as 2 or 3 bases because those bases will appear later in the assembly. But more importantly a read that start with adapter follow by the actual sequence is not a True 8KB pair end. this is actually a pair end of the 500 bases placed in the sequencing reaction. Then this data should be trimmed and move into a 500 b fragment file (along with its pair) and used in that way or just used as single read.
3- adapter in the center case: sequence read+adapter read+sequence read this case should be handle by a 3'end trimmer but the trimmer should be able to recognize the adapter as in the center of the read and not at the end of the reads as they are usually coded for.
4- the removing tool should be able to take an action for the pairs: e.g if kicking one pair as chimeric primer then should also throw away the second one (no chimera allowed). the trimming tool should trim the pair continously and place the trim pair as true 8Kb or 500 b or if one of the reads is eliminated because what is left is too small then the other read should go to a single reads file.
after all the sorting, filtering, etc the reads should be organized in different files: true clipped pair ends of 8Kb, 500 b pair ends, single reads all these after removing the chimeric/artifact reads coming from primer dimerization. This is a complex case and my question is about your recommendations on which tool or set of tools can allow me doing all these steps so I can use my ginormous amount of reads that have been kidnapped by all these issues.
any advice is welcome
Hinsby
I have some challenges for the group and any help and suggestion is welcome:
I run several genomes using the 8kb pair end protocol at one genome per lane. The bio informatics group in my facility have little experience on this and are very challenged helping my project. so here are the problems.
A) the runs seem contaminated by chimeric fragments from the sequencing adapters used in making the pair end data. so is there any software or script out there that can remove sequences matching the adapters (and the key part is) allowing for certain percentage of mismatch to adapter sequence (this to account for chimeric multiprimer sequences)
B) now the next problem is that in the pair end data you also use a central adapter and the true pair end data will be the one where the reads start at both ends of genomic fragment far from the central adapter (see pdf protocol for more detail http://www.illumina.com/applications...equencing.ilmn). however since the technology can not control that the position of the central adapter be just in the center, because the random shearing steps required, then the 42 bp adapter and the genomic sequence can come in all combinations possible as follows:
1- sequence read+adapter read (this is the easy one where a 3' triming tool can do the job)
2- adapter read+sequence read (this would need a 5' triming tool) and can be tricky when the adapter read is small as 2 or 3 bases because those bases will appear later in the assembly. But more importantly a read that start with adapter follow by the actual sequence is not a True 8KB pair end. this is actually a pair end of the 500 bases placed in the sequencing reaction. Then this data should be trimmed and move into a 500 b fragment file (along with its pair) and used in that way or just used as single read.
3- adapter in the center case: sequence read+adapter read+sequence read this case should be handle by a 3'end trimmer but the trimmer should be able to recognize the adapter as in the center of the read and not at the end of the reads as they are usually coded for.
4- the removing tool should be able to take an action for the pairs: e.g if kicking one pair as chimeric primer then should also throw away the second one (no chimera allowed). the trimming tool should trim the pair continously and place the trim pair as true 8Kb or 500 b or if one of the reads is eliminated because what is left is too small then the other read should go to a single reads file.
after all the sorting, filtering, etc the reads should be organized in different files: true clipped pair ends of 8Kb, 500 b pair ends, single reads all these after removing the chimeric/artifact reads coming from primer dimerization. This is a complex case and my question is about your recommendations on which tool or set of tools can allow me doing all these steps so I can use my ginormous amount of reads that have been kidnapped by all these issues.
any advice is welcome
Hinsby
Comment