I'm writing a NSG simulator as part of a project for my school. I'm trying to simulate fastq paired-end data, but I'm unsure of the formatting for doing so. I'm producing simulated reads and successfully aligning them in BWA for single ended data, but when I try to produce pairs I end up with various alignment errors. I read that for BWA, the reads in the second alignment file should be the reverse complement.
If I have a base sequence and a simulated a pair of reads from each end like this...
I'm outputing
file1
read/1 AAAG
file2
read/2 TCTC
(along with the quality scores and the rest of the formatting)
Is that the way they should be? It doesn't seem to work, so I'm guessing not, but neither does not doing the reverse complement. I suspect I'm missing something about how pairs of reads are represented.
I may also have the naming conventions wrong. Should paired reads be separated into 2 different files and labeled <read-name>/1 and <read-name>/2? Ultimately they get rolled into 1 file, so should I be putting them together into 1?
I'm not sure if I have a software bug and am just producing wrong data, or if I'm producing the right thing but formatting it wrong.
Does anyone have an example of correct formatting that I could use as a template? I have had trouble locating an example.
Any help would be appreciated.
If I have a base sequence and a simulated a pair of reads from each end like this...
Code:
AAAGGGTTCTC read AAAG read TCTC
file1
read/1 AAAG
file2
read/2 TCTC
(along with the quality scores and the rest of the formatting)
Is that the way they should be? It doesn't seem to work, so I'm guessing not, but neither does not doing the reverse complement. I suspect I'm missing something about how pairs of reads are represented.
I may also have the naming conventions wrong. Should paired reads be separated into 2 different files and labeled <read-name>/1 and <read-name>/2? Ultimately they get rolled into 1 file, so should I be putting them together into 1?
I'm not sure if I have a software bug and am just producing wrong data, or if I'm producing the right thing but formatting it wrong.
Does anyone have an example of correct formatting that I could use as a template? I have had trouble locating an example.
Any help would be appreciated.
Comment