Hey guys,
I wanted to map by brand new paired-end RNA-seq data to the mouse transcriptome using the current beta (b5) of Bowtie2.
As I could not find any pre-build index for this, I build it myself using bowtie2-build to make an index of ensemble transcript information.
The three mouse-fasta-files for this were downloaded from the ensemble ftp site . I wanted to get as much information as possible so I included cDNA-all, cDNA-abinitio and ncRNA fasta files for indexing.
Then, I mapped the paired-end RNA-seq data to this index using the following command:
So far so good, it all worked well with overall alignment rates of 80-90%.
Now, when I want to import the data to SeqMonk, after reading all the lines it tells me that it "Couldn't extract valid name for <ensemble-tanscript-ID/Genscen-ID>" and leaves me with no reads at all... This is probably because there is no chromosome information or not in the expected position?
From what I could find out, the ensemble-fasta-files also contain some "supercontigs" that do not have chromosome information but an NT-xxxx ID.
But still, then there should be reads with the correct annotation, right?
So what went wrong with my workflow here, and can I still rescue the SAM-files that I now produced?
btw: the SAM file header looks like this:
and then the alignment comes, which looks like this:
I wanted to map by brand new paired-end RNA-seq data to the mouse transcriptome using the current beta (b5) of Bowtie2.
As I could not find any pre-build index for this, I build it myself using bowtie2-build to make an index of ensemble transcript information.
The three mouse-fasta-files for this were downloaded from the ensemble ftp site . I wanted to get as much information as possible so I included cDNA-all, cDNA-abinitio and ncRNA fasta files for indexing.
Then, I mapped the paired-end RNA-seq data to this index using the following command:
./bowtie2 -p 4 -t --local -x mouse_transcriptome_ensembl-NCBI37_ncRNA_cDNAall_abinitiopredictons -1 <matepair1.fastq> -2 <matepair2.fastq> -S output.sam
Now, when I want to import the data to SeqMonk, after reading all the lines it tells me that it "Couldn't extract valid name for <ensemble-tanscript-ID/Genscen-ID>" and leaves me with no reads at all... This is probably because there is no chromosome information or not in the expected position?
From what I could find out, the ensemble-fasta-files also contain some "supercontigs" that do not have chromosome information but an NT-xxxx ID.
But still, then there should be reads with the correct annotation, right?
So what went wrong with my workflow here, and can I still rescue the SAM-files that I now produced?
btw: the SAM file header looks like this:
@HD VN:1.0 SO:unsorted
@SQ SN:GENSCAN00000015589 LN:298
@SQ SN:GENSCAN00000001573 LN:74
@SQ SN:GENSCAN00000001572 LN:260
@SQ SN:GENSCAN00000026402 LN:489
...
@SQ SN:ENSMUST00000146092 LN:216
@SQ SN:ENSMUST00000120435 LN:630
@SQ SN:ENSMUST00000118023 LN:1647
...almost endlessly...
@SQ SN:GENSCAN00000015589 LN:298
@SQ SN:GENSCAN00000001573 LN:74
@SQ SN:GENSCAN00000001572 LN:260
@SQ SN:GENSCAN00000026402 LN:489
...
@SQ SN:ENSMUST00000146092 LN:216
@SQ SN:ENSMUST00000120435 LN:630
@SQ SN:ENSMUST00000118023 LN:1647
...almost endlessly...
HWI-ST933:54:C01BFACXX:3:1101:10433:5230 99 ENSMUST00000082408 34 99M = 76 169 CGAAAATCTATTTGCCTCATTCATTACCCCAACAATAATAGGATTCCCAATCGTTGTAGCCATCATTATATTTCCTTCAATCCTATTCCCATCCTCAAA CCCFFFFFHHHHHJJJJJJJJJJJJJJJJJJJJJIJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJAHHHHHFFFFFFEEEDEEDDDDDD AS:i:198 XS:i:98 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:99YS:i:198 YT:Z:CP