Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Script to retrieve paired end data after blast?

    Hi All,

    Does anyone have a script to pull out paired end data after a blast search? I created a separate blast database from each paired end read file. I used one gene as a query and did a blast search against each database with a tabular output. What I'd like to do next is use the R1 names to pull out R2 and R2 to pull out R1, and then take out any redundancy.

    I think that I can retrieve the names of the hit reads using:

    cat gene_blast_output | cut -f 2 | sort -u > contig_ids.txt

    But not sure where to go from here.

  • #2
    If you'd post a little output, it should be trivial to do this -- so long as you don't have so much that you need to worry about memory management.

    But I'd also recommend you revisit your decision to use BLAST rather than one of the newer algorithms designed for paired end reads -- Bowtie2, BWA, etc. They'll integrate the results for you, probably run faster and likely have other benefits.

    Comment


    • #3
      I thought those programs could only be used if there is a reference genome, but I could be wrong? What if I used the transcriptome assembled from these reads as the reference?

      So, if this is the output from blasting against the R1 database, I would want to pull out their partner from the output from blasting against the R2 database. First column is the query and second column is the hit.

      gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923 100.00 32 0 0 439 470 96 1 3e-16 77.8
      gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1304:4294:174785 100.00 32 0 0 437 468 96 1 5e-16 77.4
      gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:11629:162562 100.00 32 0 0 441 472 98 3 1e-15 75.9
      gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:14635:22937 96.88 32 1 0 444 475 97 2 1e-14 73.6

      Comment


      • #4
        Originally posted by sp24 View Post
        I thought those programs could only be used if there is a reference genome, but I could be wrong? What if I used the transcriptome assembled from these reads as the reference?
        Most aligners will accept a multi-fasta format sequence file as a "reference". You will need to create indexes for the set as needed.

        You are planning to use the assembled transcriptome as a "reference" for the queries with NP* or the other way around?

        Was your blast search a tblastn?
        Last edited by GenoMax; 05-09-2013, 07:38 AM.

        Comment


        • #5
          I have a few options:

          I can use that R1/R2 no-redundancy file that I asked about and use an assembly program (I'm looking at one gene at a time).

          Or I can use Cap3.

          Someone else in our lab assembled the transcriptome, so I think that would be an option for me to use as a reference. I'm not sure though since I have not worked on anything like this.

          I just want to pull out my sequence and be able to build a phylogenetic tree for one gene, so right now I'm just practicing trying to pull out one gene from one organism. Then I can move on and do the same with the next organism.

          And yes, I used tblastn.

          Comment


          • #6
            Originally posted by sp24 View Post

            I just want to pull out my sequence and be able to build a phylogenetic tree for one gene, so right now I'm just practicing trying to pull out one gene from one organism. Then I can move on and do the same with the next organism.
            If you have the blast database made for your NGS data then look into using the "blastdbcmd" command from the blast manual: http://www.ncbi.nlm.nih.gov/books/NBK1763/ See the section on "Extracting data from BLAST databases with blastdbcmd".

            If you want to extract the fastq format sequences then there are other threads that have suggestions.

            Phylogenetic tree for one gene but from what exact sequences? Do you have individual sequence files for multiple organisms and you are looking to get the reads for a specific gene from each of these files, assemble them and then build a tree?

            Comment


            • #7
              Originally posted by sp24 View Post
              I thought those programs could only be used if there is a reference genome, but I could be wrong? What if I used the transcriptome assembled from these reads as the reference?

              So, if this is the output from blasting against the R1 database, I would want to pull out their partner from the output from blasting against the R2 database. First column is the query and second column is the hit.

              gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923 100.00 32 0 0 439 470 96 1 3e-16 77.8
              gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1304:4294:174785 100.00 32 0 0 437 468 96 1 5e-16 77.4
              gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:11629:162562 100.00 32 0 0 441 472 98 3 1e-15 75.9
              gi|313661399|ref|NP_001186313.1| D3NH4HQ1:107:C0LN7ACXX:1:1208:14635:22937 96.88 32 1 0 444 475 97 2 1e-14 73.6
              Any prior sequence you have is a reference sequence. With transcriptomes it is a bit tricky if you have multiple isoforms, as default parameters on many of these programs do not favor queries that align in multiple locations.

              In the above, is there any way to tell forward & reverse reads for the same fragment? Not being able to distinguish is not fatal, but potentially problematic.

              In any case, if all you want is the list of reads that are hit twice by a given query, in theory you can do this at the command line -- though if the file is big enough it may give trouble

              # take first two columns (query id, hit id)
              # report all pairs which appear two or more times
              cut -f1,2 blast.table.txt |sort |uniq -d > pairs.txt

              Comment


              • #8
                After blasting I decided to use the below command to pull out hit ID's.

                cat Gene_XXR1out | cut -f 2| sort -u > Gene_XXR1hit_ids

                So I would know that all the ID's in this file belong to the R1 database. It gives me the list, shows me the ID but not that it's R1.

                I tried that command, krobison. I should be clear about what I have done:
                1) blast gene against R1 database
                2) blast gene against R2 database
                3) pull out hits ids from each blast output using:
                cat Gene_XXR1.out | cut -f 2| sort -u > Gene_XXR1hit_ids
                cat Gene_XXR2.out | cut -f 2| sort -u > Gene_XXR2hit_ids


                OR the command you gave. In order to do that I first had to merge the two separate blast outputs from R1 and R2 together using a perl script.

                4) Not sure now. I've been advised to use the names of R1 to pull out reads from the original R2 fastq file and use names of R2 to pull out reads from the original R1 fastq file. I'm trying to use mirabait in Mira to do this right now. I think in order to do this it would be best if I use the cat command in step 3 in order to keep the R1 and R2 hit ID's separate.

                Comment


                • #9
                  Read 1: D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923/1
                  Read 2: D3NH4HQ1:107:C0LN7ACXX:1:1206:8858:106923/2
                  It appears that the trailing /1 and /2 which indicate Read 1 and Read 2 have been removed from the identifiers that you are using in your blast search.

                  You should use the scripts/ideas mentioned in the thread I had included in a previous post (#6) to get the reads using the ID's you have parsed out using the two cat commands (after merging the ID's and taking the unique entries as indicated by krobinson).
                  Last edited by GenoMax; 05-10-2013, 08:00 AM.

                  Comment


                  • #10
                    Thanks everyone for your help. GenoMax, I decided to use that python script because it was the only way I could keep the /1 and /2.

                    So I've figured out how to pull out the hits, but the python script is extremely slow. It's not horrible, but I have to do this for many, many genes across many transcriptomes, so it's a little painful.

                    Comment


                    • #11
                      Originally posted by sp24 View Post
                      Thanks everyone for your help. GenoMax, I decided to use that python script because it was the only way I could keep the /1 and /2.

                      So I've figured out how to pull out the hits, but the python script is extremely slow. It's not horrible, but I have to do this for many, many genes across many transcriptomes, so it's a little painful.
                      If you have access to a cluster then you could potentially start multiple jobs and work on the transcriptomes in parallel.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 04-25-2024, 11:49 AM
                      0 responses
                      15 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-24-2024, 08:47 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      62 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X