Hey,
I was running TopHat for the first time for paired-end reads which were 101 bp long. Since it was a test run, I only performed it for the first 500,000 reads that were in each fastq file. I noticed that before it aligns/maps the left and right ends to the hg19 build, it got rid of 19 reads from the left file and 7 reads from the right file.
In case you are wondering about the output below, 'reads_in' is 487,355 and not 500,000 in each file because the original fastq files were quality filtered using the fastxtoolkit. This would remove reads independently in each file and so I had to use a script to find intersecting/matching pairs from the first 500,000 reads in both the filtered fastq files. That resulted in 487,355 with the remainder being orphaned (I will map this later separately)
cat left_kept_reads.info
min_read_len=101
max_read_len=101
reads_in =487355
reads_out=487336
cat right_kept_reads.info
min_read_len=101
max_read_len=101
reads_in =487355
reads_out=487348
Does anyone know why this happens when TopHat is run? Also, is there a way I can find those left out reads, maybe using the output bam file and the 2 original fastq files?
Also, what is traditionally the next step after obtaining this bam file? This is a part of an RNA-seq analysis.
Any help would be greatly appreciated.
Thanks
I was running TopHat for the first time for paired-end reads which were 101 bp long. Since it was a test run, I only performed it for the first 500,000 reads that were in each fastq file. I noticed that before it aligns/maps the left and right ends to the hg19 build, it got rid of 19 reads from the left file and 7 reads from the right file.
In case you are wondering about the output below, 'reads_in' is 487,355 and not 500,000 in each file because the original fastq files were quality filtered using the fastxtoolkit. This would remove reads independently in each file and so I had to use a script to find intersecting/matching pairs from the first 500,000 reads in both the filtered fastq files. That resulted in 487,355 with the remainder being orphaned (I will map this later separately)
cat left_kept_reads.info
min_read_len=101
max_read_len=101
reads_in =487355
reads_out=487336
cat right_kept_reads.info
min_read_len=101
max_read_len=101
reads_in =487355
reads_out=487348
Does anyone know why this happens when TopHat is run? Also, is there a way I can find those left out reads, maybe using the output bam file and the 2 original fastq files?
Also, what is traditionally the next step after obtaining this bam file? This is a part of an RNA-seq analysis.
Any help would be greatly appreciated.
Thanks