Hi everyone,
I am trying to figure out what possible paired read name formats are in use and good to support for open source alignment programs. The most common paired read name format i encounter is something like the following,
>READ_A/1
AAAAACCCCCCC
>READ_A/2
TTTTTGGGGGAA
Is this the only format used by different sequencing platforms? Are there other formats such as,
>READ_B:1
AAAAACCCCCCC
>READ_B:2
TTTTTGGGGGAA
Also with respect to SAM format, does anyone know what the proper query name for these reads should be?
Most aligners that I have tried seem to agree that for paired reads that end in "/1" or "/2", these last two characters are truncated to get the SAM query name.
For example the first read pair would be reported as,
READ_A
But the second read pair would have two query names,
READ_B:1 and READ_B:2 in the SAM file
I was thinking that maybe the safest approach would be to find the largest common prefix and truncate there to get the pair read name. But I am not sure if this will always work, for example, if the read names are,
>READ_C_Z12411/1
AAAAACCCCCCC
>READ_C_Z12516/2
TTTTTGGGGGAA
The pair read names based on largest common prefix would be,
READ_C_Z12
Where as most aligners would report two separate names as,
READ_C_Z12411 and READ_Z12516
Thanks for the help!
Misko
I am trying to figure out what possible paired read name formats are in use and good to support for open source alignment programs. The most common paired read name format i encounter is something like the following,
>READ_A/1
AAAAACCCCCCC
>READ_A/2
TTTTTGGGGGAA
Is this the only format used by different sequencing platforms? Are there other formats such as,
>READ_B:1
AAAAACCCCCCC
>READ_B:2
TTTTTGGGGGAA
Also with respect to SAM format, does anyone know what the proper query name for these reads should be?
Most aligners that I have tried seem to agree that for paired reads that end in "/1" or "/2", these last two characters are truncated to get the SAM query name.
For example the first read pair would be reported as,
READ_A
But the second read pair would have two query names,
READ_B:1 and READ_B:2 in the SAM file
I was thinking that maybe the safest approach would be to find the largest common prefix and truncate there to get the pair read name. But I am not sure if this will always work, for example, if the read names are,
>READ_C_Z12411/1
AAAAACCCCCCC
>READ_C_Z12516/2
TTTTTGGGGGAA
The pair read names based on largest common prefix would be,
READ_C_Z12
Where as most aligners would report two separate names as,
READ_C_Z12411 and READ_Z12516
Thanks for the help!
Misko
Comment