Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to proceed, BWA and now?

    Hello,

    I just found this forum and hope I can find some clues to my issue.
    Being a biologist, but not a computer scientist nor working longer than a year on NGS , I am able to use Linux. Also I have access to a nice cluster, so that computing power is not an issue.

    From a former coworker I got an external HDD with data from a HiSeq sequencing project (genome) from a Bacillus species, which was subsequently aligned to two different reference genomes using BWA after quality assessment and subsequent removal of duplicates was sucessful.

    Now I intend to get those sucessfully aligned short reads out as longer reads in form of a fasta file to subsequently assign a function using blast.

    Everything until the BWA alignment is clear (was done already) and everything after obtaining the aligned long reads is also clear.

    But with which tool can I fill the gap?

    Thanks a lot for every suggestion and sorry cause I guess this is quite an easy question.

  • #2
    I assume you have BAM files from the alignments available. You can use "samtools" to generate a consensus sequence from them by using the following command.

    Code:
    $ samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fq
    Once you have the fastq format consensus file you can use seqtk to convert the fastq to fasta format.

    Then on to blast ..

    Comment


    • #3
      Thank you very much.

      I proceeded to submit to the queue with the following file:

      Code:
      #!/bin/sh
      
      #Ask for special queue
      #BSUB -q long
      
      #Ask for processors
      #BSUB -n 1
      
      # Processors per host
      #BSUB -R "span[ptile=1]"
      
      # Wallclock time in HH:mm
      #BSUB -W 480:00
      
      #Job Name
      #BSUB -J "3056C"
      
      #Memory per MPI Task
      #BSUB -R "rusage[mem=20000]"
      
      #Output and Error files
      #BSUB -o lsf%J.o
      #BSUB -e lsf%J.e
      
      #Other Option
      #BSUB -B
      #BSUB -N
      
      cd /home01/ices/icescjho/scratch/genome
      samtools mpileup -uf CP3056.fasta C106_CP003056_sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > finalconsensus3056.fq
      The same I did for the other alignment with:

      Code:
      samtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > finalconsensus2472.fq
      Finally the 2 jobs got accepted and after 1.5 days the jobs were finished.
      The logfiles does not show any error or other messages.

      When looking for the files I actually found nothing. Now I am a bit confused.

      By the way the folder looks like this:

      28G Apr 4 2014 C106_CP002472_rmdup.sam
      6.4G Apr 4 2014 C106_CP002472_sorted.bam
      29G Apr 4 2014 C106_CP003056.rmdup.sam
      6.0G Apr 4 2014 C106_CP003056_sorted.bam
      1.1K Oct 7 06:11 consensus2472.sh
      1.1K Oct 7 06:11 consensus3056.sh
      3.0M Oct 7 02:51 CP2472.fasta
      45 Oct 7 02:53 CP2472.fasta.fai
      3.5M Oct 7 02:51 CP3056.fasta
      45 Oct 9 05:56 CP3056.fasta.fai
      998 May 18 14:10 extract.sh
      5.0M Apr 4 2014 genome2472.1.bt2
      751K Apr 4 2014 genome2472.2.bt2
      26 Apr 4 2014 genome2472.3.bt2
      751K Apr 4 2014 genome2472.4.bt2
      3.0M Apr 4 2014 genome_2472.fa
      5.0M Apr 4 2014 genome2472.rev.1.bt2
      751K Apr 4 2014 genome2472.rev.2.bt2
      21G Apr 4 2014 PR2-3-1_CP002472_rmdup.sam
      5.0G Apr 4 2014 PR2-3-1_CP002472_sorted.bam
      23G Apr 4 2014 PR2-3-1_CP003056_rmdup.sam
      4.7G Apr 4 2014 PR2-3-1_CP003056_sorted.bam
      12G Jun 10 2013 C106.fasta
      The pipeline uses as input:
      samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fq

      ref.fa --> the respective reference genomes as fasta files
      aln.bam --> the respective bam file


      What did I do wrong ?
      Any hints ?

      Comment


      • #4
        Are all the files in this directory: /home01/ices/icescjho/scratch/genome?

        Though most administrators frown on this practice when a cluster users a job scheduler (and I am not advocating that one use it either ) but can you try to start this part of the job interactively (i.e. not under the job scheduler, if you have an interactive job queue available then use that) to make sure that everything is in place (i.e. all the executables are found) and watch it for a few minutes to see if any errors are printed to the screen.
        Code:
        $ amtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam
        You can also open a separate shell on the side to ensure that the job is running (use top or equivalent method to monitor the node). Remember to kill the job after a few minutes so you don't overwhelm the head node.

        What scheduler are you using?

        Note: If some of the stuff above does not make sense then find a friend who is cluster savvy. They should know what I ask asking you to do.

        Comment


        • #5
          Yes, all files are in there (but not the initial C106.fasta sequencing file).

          Yes, to run heavy jobs on the head node is also here a no go. In the FAQs its said that such jobs will be immediately terminated.
          For trying if everything is in place, I normally run the command line on the head node for a few seconds or minutes to test if everything is in place.
          Cause the queue can take some time before its my turn, I use (if still unsure) the express queue with minimal ressources to test if all executables are found.

          Scheduler is IBM LSB (I hope this is what you mean).
          Nothing else to my knowledge available.

          Yes with the following I can see if the job ran and if it is finished.
          Code:
          $ bhist
          $ bhist -a

          With the following, I can see the current output of the job. (you meant this with interactive view ?)
          Code:
          bpeek
          For those two jobs no output during the run. Those output is actually logged in a .o file and the error like an interruption is logged in the .e file.
          But both were empty which indicates most of the time a successful run.

          I am not that cluster savy but it was also not that difficult.
          Other points are more difficult like what is allowed and what not. :-)

          ....and I do not have any friends at all here working on that.
          The support helped me with basic linux issues which was hard to conclude from the FAQs
          but when I asked if I can split my 40,000 Sequences in chunks of 100 or less and submit them in parallel to the cluster for blasting,
          I did not get an answer. So I submitted it in chunks of 500 to minimize a possible queue blockage. It went well.

          -------

          So I proceed as following:
          1. The pipeline commands on headnode without any parameters (indicates if the pathes are correct)

          Responses (short):
          Program: bcftools (Tools for data in the VCF/BCF formats)
          Version: 0.1.19-44428cd
          Program: samtools (Tools for alignments in the SAM format)
          Version: 0.1.19-44428cd
          Usage: vcfutils.pl <command> [<arguments>]
          So everything seems to be in place.
          The same I repeated by using the express queue.
          The job is not yet running....still waiting

          Then I used:
          $ samtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam > test.log

          Got direct output:
          [mpileup] 1 samples in 1 input files
          <mpileup> Set max per-file depth to 8000
          and the test.log contains: Screenshot


          Eventhough performance will go down and time surely go up a lot, maybe I can split the pipeline in 3 parts ?
          20 days left until I need to submit a extension application for the cluster and these two are the last 2 jobs I have to run.

          Comment


          • #6
            You can split the pipeline in 3 parts but since the output of step 1 is required for step 2 you can't run the jobs simultaneously. It may be better to run them separately to ensure that you have an output at each step.

            It looks like you have not indexed your reference fasta files. Can you do that? Just run this on the head node directly, since this should be a quick process.

            Code:
            $ samtools faidx CP2472.fasta
            Then try

            Code:
            $ samtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam > outfile.bcf
            Last edited by GenoMax; 10-10-2014, 09:50 AM.

            Comment


            • #7
              Originally posted by ChrisSinga View Post
              20 days left until I need to submit a extension application for the cluster and these two are the last 2 jobs I have to run.
              You could also try to run these jobs on a desktop linux/OS X machine. Your files are not very large and you would not have to wait for your turn on the cluster.

              Comment


              • #8
                Hi,

                so with the genome sequencing file I got, I tried a bit and the following result I got:

                Code:
                samtools mpileup -uf CP002472.fasta C106_CP002472_sorted.bam > 2472_outfile.bcf
                Result : 460 MB

                Code:
                bcftools view -cg 2472_outfile.bcf > bcftoolsout2472
                Result: 334 MB

                Code:
                vcfutils.pl vcf2fq bcftoolsout_2472 > consensus_2472.fq
                Result: 6 MB

                Code:
                seqtk seq -a consensus_2472.fq > consensus_2472.fasta
                Result: 4 MB

                The last one has a fasta format consisting of 2 rows...description and a looong sequence.
                I thought initially more like getting a lot of sequences :-D

                Mistakes in one of the commands?

                Comment


                • #9
                  Isn't that the consensus sequence you were looking for? How many bases are there in the final file?

                  Comment


                  • #10
                    It has roughly 3.6 million bases in there.

                    I think I got it now.

                    The initial reference genomes were the same...1 sequence.
                    By assembling it, I did not get a multifasta with all assembled sequences but also one file and one sequence. Areas which nothing was assembled to are just 'nnnnnn..'.

                    So I would need to do just one step more and just split the sequence into smaller chunks by 'nnnnn.....'.
                    After that I can try to predict ORFs and make a final fasta file like Seq1_ORF1.....etc....and those via blastx subsequently annotated.

                    As the two alignments gave a different degree of alignment, do you think it makes sense to try a de novo alignment with velvet?


                    I think for my purpose I will split now the sequence by 'nnnn..' and convert it into a blast database. That can serve me quite well as a point to know if known sequences are present.


                    Anyway thanks a lot for your patience and your help
                    Sorry for the lack of knowledge and sequencing/bioinformatic terms .. will do my best to become better.

                    Sad that I cannot transfer a beer to your email

                    Comment


                    • #11
                      That looks to be about the right size for a microbial genome then. If you are curious (and have the time) you could certainly try to do a de novo assembly with the original reads. I would suggest using SPAdes in addition to velvet.

                      In the mean time you can move forward with annotating the sequence you have. See this thread for programs that can help with that: http://seqanswers.com/forums/showthread.php?t=47297

                      Will accept a rain check on that beer

                      Comment


                      • #12
                        Curious...yes...time..no
                        So much things to learn but so little time.
                        But most certainly will have a look when I am unemployed soon :-P
                        Thanks for the link....thats what I am looking for.

                        OK then, gimme your paypal email via private message and you get your sixpack of beer.

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          Current Approaches to Protein Sequencing
                          by seqadmin


                          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                          04-04-2024, 04:25 PM
                        • seqadmin
                          Strategies for Sequencing Challenging Samples
                          by seqadmin


                          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                          03-22-2024, 06:39 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, 04-11-2024, 12:08 PM
                        0 responses
                        25 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 10:19 PM
                        0 responses
                        28 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-10-2024, 09:21 AM
                        0 responses
                        24 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 04-04-2024, 09:00 AM
                        0 responses
                        52 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X