Is there some way to use RNA-seq and/or whole genome sequencing data (I have both for the relevant samples) to find the genomic location of an insert with an unknown location? The insert itself is of known sequence, and aligns correctly to a reference containing only itself + some minor control sequences.
I was told that one thing I might do is to align my data to the reference containing only the insert sequences, but split my (paired-end) data into two, i.e. only align one pair at a time as a single end ("..._1"-files and "..._2"-files separately). I should then take out all the reads that align (by name) and subset the other original fastq files by them so I get their mates (i.e. subset "..._2" by aligned reads in "..._1") and align those to the normal reference genome, again single-end. I would then, hopefully, get reads aligning to the same region, and I would know the location of my insert (after which I could create some PCR primers and validate the results).
I have done this with my WGS-data, but the reads map more or less randomly across all chromosomes... I feel I might be subsetting the read names wrong, somehow, mostly because I don't think I'm sure exactly how they are given names and how to find the pairs properly. At the moment, this is what I'm doing:
Am I doing something wrong with the analysis, or is the idea somehow flawed? I am being fairly stringent in the first alignment step, using the -B 40 -O 60 -E 10 options (with BWA), in order to hopefully only align more exact matches (I have also done without this stringency, with more or less the same results).
Does anybody have any idea what I'm doing wrong, what's wrong with the idea, or have any other idea on how to find an unknown insert?
I was told that one thing I might do is to align my data to the reference containing only the insert sequences, but split my (paired-end) data into two, i.e. only align one pair at a time as a single end ("..._1"-files and "..._2"-files separately). I should then take out all the reads that align (by name) and subset the other original fastq files by them so I get their mates (i.e. subset "..._2" by aligned reads in "..._1") and align those to the normal reference genome, again single-end. I would then, hopefully, get reads aligning to the same region, and I would know the location of my insert (after which I could create some PCR primers and validate the results).
I have done this with my WGS-data, but the reads map more or less randomly across all chromosomes... I feel I might be subsetting the read names wrong, somehow, mostly because I don't think I'm sure exactly how they are given names and how to find the pairs properly. At the moment, this is what I'm doing:
Code:
(... alignment with BWA) samtools view mapped.sorted.rmdup.input_1.bam | \ gawk '{print $1}' | \ sort | \ uniq > unique.txt fastqutils filter -whitelist unique.txt input_2.fastq > 1-to-2.fastq
Does anybody have any idea what I'm doing wrong, what's wrong with the idea, or have any other idea on how to find an unknown insert?
Comment