Seqanswers Leaderboard Ad

**krobison** · 05-09-2013, 06:44 AM

If you'd post a little output, it should be trivial to do this -- so long as you don't have so much that you need to worry about memory management.

But I'd also recommend you revisit your decision to use BLAST rather than one of the newer algorithms designed for paired end reads -- Bowtie2, BWA, etc. They'll integrate the results for you, probably run faster and likely have other benefits.

**sp24** · 05-09-2013, 07:03 AM

I thought those programs could only be used if there is a reference genome, but I could be wrong? What if I used the transcriptome assembled from these reads as the reference?

So, if this is the output from blasting against the R1 database, I would want to pull out their partner from the output from blasting against the R2 database. First column is the query and second column is the hit.

gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923 100.00 32 0 0 439 470 96 1 3e-16 77.8
gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1304:4294:174785 100.00 32 0 0 437 468 96 1 5e-16 77.4
gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:11629:162562 100.00 32 0 0 441 472 98 3 1e-15 75.9
gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:14635:22937 96.88 32 1 0 444 475 97 2 1e-14 73.6

**GenoMax** · 05-09-2013, 07:34 AM

Originally posted by sp24 View Post

I thought those programs could only be used if there is a reference genome, but I could be wrong? What if I used the transcriptome assembled from these reads as the reference?

Most aligners will accept a multi-fasta format sequence file as a "reference". You will need to create indexes for the set as needed.

You are planning to use the assembled transcriptome as a "reference" for the queries with NP* or the other way around?

Was your blast search a tblastn?

**sp24** · 05-09-2013, 08:04 AM

I have a few options:

I can use that R1/R2 no-redundancy file that I asked about and use an assembly program (I'm looking at one gene at a time).

Or I can use Cap3.

Someone else in our lab assembled the transcriptome, so I think that would be an option for me to use as a reference. I'm not sure though since I have not worked on anything like this.

I just want to pull out my sequence and be able to build a phylogenetic tree for one gene, so right now I'm just practicing trying to pull out one gene from one organism. Then I can move on and do the same with the next organism.

And yes, I used tblastn.

**GenoMax** · 05-09-2013, 09:45 AM

Originally posted by sp24 View Post

I just want to pull out my sequence and be able to build a phylogenetic tree for one gene, so right now I'm just practicing trying to pull out one gene from one organism. Then I can move on and do the same with the next organism.

If you have the blast database made for your NGS data then look into using the "blastdbcmd" command from the blast manual: http://www.ncbi.nlm.nih.gov/books/NBK1763/ See the section on "Extracting data from BLAST databases with blastdbcmd".

If you want to extract the fastq format sequences then there are other threads that have suggestions.

Phylogenetic tree for one gene but from what exact sequences? Do you have individual sequence files for multiple organisms and you are looking to get the reads for a specific gene from each of these files, assemble them and then build a tree?

**krobison** · 05-10-2013, 05:42 AM

Originally posted by sp24 View Post

I thought those programs could only be used if there is a reference genome, but I could be wrong? What if I used the transcriptome assembled from these reads as the reference?

So, if this is the output from blasting against the R1 database, I would want to pull out their partner from the output from blasting against the R2 database. First column is the query and second column is the hit.

gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923 100.00 32 0 0 439 470 96 1 3e-16 77.8
gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1304:4294:174785 100.00 32 0 0 437 468 96 1 5e-16 77.4
gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:11629:162562 100.00 32 0 0 441 472 98 3 1e-15 75.9
gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:14635:22937 96.88 32 1 0 444 475 97 2 1e-14 73.6

Any prior sequence you have is a reference sequence. With transcriptomes it is a bit tricky if you have multiple isoforms, as default parameters on many of these programs do not favor queries that align in multiple locations.

In the above, is there any way to tell forward & reverse reads for the same fragment? Not being able to distinguish is not fatal, but potentially problematic.

In any case, if all you want is the list of reads that are hit twice by a given query, in theory you can do this at the command line -- though if the file is big enough it may give trouble

# take first two columns (query id, hit id)
# report all pairs which appear two or more times
cut -f1,2 blast.table.txt |sort |uniq -d > pairs.txt

**sp24** · 05-10-2013, 07:25 AM

After blasting I decided to use the below command to pull out hit ID's.

cat Gene_XXR1out | cut -f 2| sort -u > Gene_XXR1hit_ids

So I would know that all the ID's in this file belong to the R1 database. It gives me the list, shows me the ID but not that it's R1.

I tried that command, krobison. I should be clear about what I have done:
1) blast gene against R1 database
2) blast gene against R2 database
3) pull out hits ids from each blast output using:
cat Gene_XXR1.out | cut -f 2| sort -u > Gene_XXR1hit_ids
cat Gene_XXR2.out | cut -f 2| sort -u > Gene_XXR2hit_ids

OR the command you gave. In order to do that I first had to merge the two separate blast outputs from R1 and R2 together using a perl script.

4) Not sure now. I've been advised to use the names of R1 to pull out reads from the original R2 fastq file and use names of R2 to pull out reads from the original R1 fastq file. I'm trying to use mirabait in Mira to do this right now. I think in order to do this it would be best if I use the cat command in step 3 in order to keep the R1 and R2 hit ID's separate.

**GenoMax** · 05-10-2013, 07:57 AM

Read 1: D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923/1
Read 2: D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923/2

It appears that the trailing /1 and /2 which indicate Read 1 and Read 2 have been removed from the identifiers that you are using in your blast search.

You should use the scripts/ideas mentioned in the thread I had included in a previous post (#6) to get the reads using the ID's you have parsed out using the two cat commands (after merging the ID's and taking the unique entries as indicated by krobinson).

**sp24** · 05-16-2013, 01:45 PM

Thanks everyone for your help. GenoMax, I decided to use that python script because it was the only way I could keep the /1 and /2.

So I've figured out how to pull out the hits, but the python script is extremely slow. It's not horrible, but I have to do this for many, many genes across many transcriptomes, so it's a little painful.

**GenoMax** · 05-16-2013, 05:23 PM

Originally posted by sp24 View Post

Thanks everyone for your help. GenoMax, I decided to use that python script because it was the only way I could keep the /1 and /2.

So I've figured out how to pull out the hits, but the python script is extremely slow. It's not horrible, but I have to do this for many, many genes across many transcriptomes, so it's a little painful.

If you have access to a cluster then you could potentially start multiple jobs and work on the transcriptomes in parallel.

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Script to retrieve paired end data after blast?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News