Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • merging data from velvet emboss and blastp analyses

    Hi, I hope someone could help me sort this out.

    I am working with a marine viral metagenome and I am trying to extrapolate some information on the dataset.

    So I have three files:
    1)contigs.fa >> which is a file with the contigs created with velvet (there are 176502 contigs total) for example: >NODE_77_length_69_cov_4.985507

    2)contigs.orf.fa >> a file created with EMBOSS with ORF (so the node above now has this ids:
    >NODE_77_length_69_cov_4.985507_1 [2 - 97]
    >NODE_77_length_69_cov_4.985507_2 [3 - 98]
    >NODE_77_length_69_cov_4.985507_3 [1 - 99]
    >NODE_77_length_69_cov_4.985507_4 [98 - 3] (REVERSE SENSE)
    >NODE_77_length_69_cov_4.985507_5 [97 - 2] (REVERSE SENSE)
    >NODE_77_length_69_cov_4.985507_6 [99 - 1] (REVERSE SENSE)

    3)contigs.orf.fa.blastp: which contain the blastp output.

    What I want to do is create a spreadsheet with the node_ID extracted from the contigs.fa file, in the next column the node_ID_orf (so every possible ORF for every single node) and finally the corresponding node which made it with the blastp query.

    Is there a script or a way to do this?

    F.

  • #2
    It is very unlikely that there exists a script to do what you want!

    You will need a bespoke solution. It could probably be done using a series of Unix tools, but ultimately a custom Perl/Python script would be quickest.

    The fact is, a lot of basic bioinformatics is massaging data into the form you want.

    Comment


    • #3
      man grep
      man cut
      man sort
      man paste

      should take you quite far..
      savetherhino.org

      Comment


      • #4
        thank you I'll try... got tons of things to learn

        Comment


        • #5
          Originally posted by flacchy View Post
          thank you I'll try... got tons of things to learn
          I've found cut to be an especially helpful command,

          e.g. cut -f 1,2,4 file.txt > output.txt would cut columns 1, 2 and 4 from file.txt into output.txt assuming tab separated fields. With your fasta files you'll first need to extract only the header lines, e.g. grep '>' file.fasta > output.fasta ('>' because unique to header lines).. and so on..
          savetherhino.org

          Comment


          • #6
            so I know I can extract the Id from contigs.fa e contigs.orf.fa.
            The problems is how can I extract the information that I want from the blast output?
            I have a huge file like this:

            "BLASTP 2.2.25 [Feb-01-2011]


            Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
            Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
            "Gapped BLAST and PSI-BLAST: a new generation of protein database search
            programs", Nucleic Acids Res. 25:3389-3402.
            Reference for compositional score matrix adjustment: Altschul, Stephen F.,
            John C. Wootton, E. Michael Gertz, Richa Agarwala, Aleksandr Morgulis,
            Alejandro A. Schaffer, and Yi-Kuo Yu (2005) "Protein database searches
            using compositionally adjusted substitution matrices", FEBS J. 272:5101-5109.

            Query= NODE_26_length_67_cov_1.000000_6 [97 - 2] (REVERSE SENSE)
            (32 letters)

            Database: MicroB3_Viral_proteins
            120,896 sequences; 31,818,017 total letters

            Searching..................................................done



            Score E
            Sequences producing significant alignments: (bits) Value

            ref|YP_214377.1|genome recombination endonuclease subunit [Proch... 60 5e-13
            ref|YP_004322284.1|genome gp47 gene product [Synechococcus phage... 47 2e-08 ""


            How do I extract the information I want??? I tried sed like this

            $ sed -n '/node*/,/searching/p' contigs.orf.fa.blastp > output.csv
            or

            $ sed -n '/node*/,/blastp/p' contigs.orf.fa.blastp > output.csv (because I want to extract information between node and after the searching is complete)

            but it output everything .... can't figure out why

            Comment


            • #7
              Which output format is that? Maybe you can change it to tabular with blast_formatter (in your blast bin)?
              savetherhino.org

              Comment


              • #8
                I run the blast search this way:
                $ blastall -p blastp -d MicroB3_Viral_proteins.faa -i contigs.orf.fa -o contigs.orf.fa.blastp

                is that wrong???? should I try something different?? or can I just convert the file???

                Comment


                • #9
                  Originally posted by flacchy View Post
                  I run the blast search this way:
                  $ blastall -p blastp -d MicroB3_Viral_proteins.faa -i contigs.orf.fa -o contigs.orf.fa.blastp

                  is that wrong???? should I try something different?? or can I just convert the file???
                  I always use tabular output myself. If your output is in blast archive format, you can convert it with the tool. However, if it's some other format, you either rerun your blast or learn how to parse your output..
                  savetherhino.org

                  Comment


                  • #10
                    how can I run blast in a tabular format??? can you tell me???

                    Comment


                    • #11
                      Originally posted by flacchy View Post
                      how can I run blast in a tabular format??? can you tell me???
                      In blast 2.2.28+ flag is -outfmt 6 -out output.tsv

                      Read the manual. I think in your blast it's -m 8 instead of -outfmt 6 but I might remember wrong..
                      savetherhino.org

                      Comment


                      • #12
                        ok thank you so much

                        Comment


                        • #13
                          I checked is -m 8 for my blastall version... should be easier now to extract results ^_^

                          Comment


                          • #14
                            BLAST Manual

                            BLAST Command Line Applications User Manual

                            Comment

                            Latest Articles

                            Collapse

                            • seqadmin
                              Best Practices for Single-Cell Sequencing Analysis
                              by seqadmin



                              While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                              06-06-2024, 07:15 AM
                            • seqadmin
                              Latest Developments in Precision Medicine
                              by seqadmin



                              Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                              Somatic Genomics
                              “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                              05-24-2024, 01:16 PM

                            ad_right_rmr

                            Collapse

                            News

                            Collapse

                            Topics Statistics Last Post
                            Started by seqadmin, 06-07-2024, 06:58 AM
                            0 responses
                            179 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-06-2024, 08:18 AM
                            0 responses
                            228 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-06-2024, 08:04 AM
                            0 responses
                            184 views
                            0 likes
                            Last Post seqadmin  
                            Started by seqadmin, 06-03-2024, 06:55 AM
                            0 responses
                            18 views
                            0 likes
                            Last Post seqadmin  
                            Working...
                            X