I am trying to extract sequences for a list of predicted genes from genomic scaffolds. The list of predicted genes with Scaffold IDs, start and end positions, and other info comes from published supplementary data. My script to extract the sequences doesn't work because for some genes, the start position is a larger number than the end position (fourth-to-last and third-to-last columns below). Here is an example (numbers have been changed from original):
I am new to working with annotated genomes. Does it make sense that the some "starts" come after the "ends"? Is this because the ORF for this gene is on the opposite strand of the scaffold? If so, and if I want to obtain that sequence, what's the best way to get it--should I extract the sequence in the scaffold between the two numbers and then find the reverse complement?
Thanks for any pointers.
geneID Gene_family Class ScaffoldID start_position end_position Number_of_exons Annotation_status
CSP1 cs Protein candidate gi|294506227|gb|GL650210.1| 61498 52100 2 intact
CSP10 cs Protein candidate gi|294507212|gb|GL649715.1| 293074 297989 2 intact
CSP2 cs Protein candidate gi|294507210|gb|GL650017.1| 234944 236074 2 intact
CSP3 cs Protein candidate gi|294507295|gb|GL649612.1| 323100 323743 2 intact
CSP4 cs Protein candidate gi|294506227|gb|GL650210.1| 41911 40888 2 intact
CSP5 cs Protein candidate gi|294507205|gb|GL649712.1| 274408 272617 2 intact
CSP1 cs Protein candidate gi|294506227|gb|GL650210.1| 61498 52100 2 intact
CSP10 cs Protein candidate gi|294507212|gb|GL649715.1| 293074 297989 2 intact
CSP2 cs Protein candidate gi|294507210|gb|GL650017.1| 234944 236074 2 intact
CSP3 cs Protein candidate gi|294507295|gb|GL649612.1| 323100 323743 2 intact
CSP4 cs Protein candidate gi|294506227|gb|GL650210.1| 41911 40888 2 intact
CSP5 cs Protein candidate gi|294507205|gb|GL649712.1| 274408 272617 2 intact
Thanks for any pointers.
Comment