Seqanswers Leaderboard Ad

**GenoMax** · 10-06-2014, 03:39 AM

I assume you have BAM files from the alignments available. You can use "samtools" to generate a consensus sequence from them by using the following command.

Code:

$ samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fq

Once you have the fastq format consensus file you can use seqtk to convert the fastq to fasta format.

Then on to blast ..

**ChrisSinga** · 10-10-2014, 05:15 AM

Thank you very much.

I proceeded to submit to the queue with the following file:

Code:

#!/bin/sh

#Ask for special queue
#BSUB -q long

#Ask for processors
#BSUB -n 1

# Processors per host
#BSUB -R "span[ptile=1]"

# Wallclock time in HH:mm
#BSUB -W 480:00

#Job Name
#BSUB -J "3056C"

#Memory per MPI Task
#BSUB -R "rusage[mem=20000]"

#Output and Error files
#BSUB -o lsf%J.o
#BSUB -e lsf%J.e

#Other Option
#BSUB -B
#BSUB -N

cd /home01/ices/icescjho/scratch/genome
samtools mpileup -uf CP3056.fasta C106_CP003056_sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > finalconsensus3056.fq

The same I did for the other alignment with:

Code:

samtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam | bcftools view -cg - | vcfutils.pl vcf2fq > finalconsensus2472.fq

Finally the 2 jobs got accepted and after 1.5 days the jobs were finished.
The logfiles does not show any error or other messages.

When looking for the files I actually found nothing. Now I am a bit confused.

By the way the folder looks like this:

28G Apr 4 2014 C106_CP002472_rmdup.sam
6.4G Apr 4 2014 C106_CP002472_sorted.bam
29G Apr 4 2014 C106_CP003056.rmdup.sam
6.0G Apr 4 2014 C106_CP003056_sorted.bam
1.1K Oct 7 06:11 consensus2472.sh
1.1K Oct 7 06:11 consensus3056.sh
3.0M Oct 7 02:51 CP2472.fasta
45 Oct 7 02:53 CP2472.fasta.fai
3.5M Oct 7 02:51 CP3056.fasta
45 Oct 9 05:56 CP3056.fasta.fai
998 May 18 14:10 extract.sh
5.0M Apr 4 2014 genome2472.1.bt2
751K Apr 4 2014 genome2472.2.bt2
26 Apr 4 2014 genome2472.3.bt2
751K Apr 4 2014 genome2472.4.bt2
3.0M Apr 4 2014 genome_2472.fa
5.0M Apr 4 2014 genome2472.rev.1.bt2
751K Apr 4 2014 genome2472.rev.2.bt2
21G Apr 4 2014 PR2-3-1_CP002472_rmdup.sam
5.0G Apr 4 2014 PR2-3-1_CP002472_sorted.bam
23G Apr 4 2014 PR2-3-1_CP003056_rmdup.sam
4.7G Apr 4 2014 PR2-3-1_CP003056_sorted.bam
12G Jun 10 2013 C106.fasta

The pipeline uses as input:
samtools mpileup -uf ref.fa aln.bam | bcftools view -cg - | vcfutils.pl vcf2fq > consensus.fq

ref.fa --> the respective reference genomes as fasta files
aln.bam --> the respective bam file

What did I do wrong ?

Any hints ?

**GenoMax** · 10-10-2014, 05:29 AM

Are all the files in this directory: /home01/ices/icescjho/scratch/genome?

Though most administrators frown on this practice when a cluster users a job scheduler (and I am not advocating that one use it either

) but can you try to start this part of the job interactively (i.e. not under the job scheduler, if you have an interactive job queue available then use that) to make sure that everything is in place (i.e. all the executables are found) and watch it for a few minutes to see if any errors are printed to the screen.

Code:

$ amtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam

You can also open a separate shell on the side to ensure that the job is running (use top or equivalent method to monitor the node). Remember to kill the job after a few minutes so you don't overwhelm the head node.

What scheduler are you using?

Note: If some of the stuff above does not make sense then find a friend who is cluster savvy. They should know what I ask asking you to do.

**ChrisSinga** · 10-10-2014, 08:04 AM

Yes, all files are in there (but not the initial C106.fasta sequencing file).

Yes, to run heavy jobs on the head node is also here a no go. In the FAQs its said that such jobs will be immediately terminated.
For trying if everything is in place, I normally run the command line on the head node for a few seconds or minutes to test if everything is in place.
Cause the queue can take some time before its my turn, I use (if still unsure) the express queue with minimal ressources to test if all executables are found.

Scheduler is IBM LSB (I hope this is what you mean).
Nothing else to my knowledge available.

Yes with the following I can see if the job ran and if it is finished.

Code:

$ bhist
$ bhist -a

With the following, I can see the current output of the job. (you meant this with interactive view ?)

Code:

bpeek

For those two jobs no output during the run. Those output is actually logged in a .o file and the error like an interruption is logged in the .e file.
But both were empty which indicates most of the time a successful run.

I am not that cluster savy but it was also not that difficult.
Other points are more difficult like what is allowed and what not. :-)

....and I do not have any friends at all here working on that.

The support helped me with basic linux issues which was hard to conclude from the FAQs
but when I asked if I can split my 40,000 Sequences in chunks of 100 or less and submit them in parallel to the cluster for blasting,
I did not get an answer. So I submitted it in chunks of 500 to minimize a possible queue blockage. It went well.

-------

So I proceed as following:
1. The pipeline commands on headnode without any parameters (indicates if the pathes are correct)

Responses (short):

Program: bcftools (Tools for data in the VCF/BCF formats)
Version: 0.1.19-44428cd

Program: samtools (Tools for alignments in the SAM format)
Version: 0.1.19-44428cd

Usage: vcfutils.pl <command> [<arguments>]

So everything seems to be in place.
The same I repeated by using the express queue.
The job is not yet running....still waiting

Then I used:
$ samtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam > test.log

Got direct output:

[mpileup] 1 samples in 1 input files
<mpileup> Set max per-file depth to 8000

and the test.log contains: Screenshot

Eventhough performance will go down and time surely go up a lot, maybe I can split the pipeline in 3 parts ?
20 days left until I need to submit a extension application for the cluster and these two are the last 2 jobs I have to run.

**GenoMax** · 10-10-2014, 09:34 AM

You can split the pipeline in 3 parts but since the output of step 1 is required for step 2 you can't run the jobs simultaneously. It may be better to run them separately to ensure that you have an output at each step.

It looks like you have not indexed your reference fasta files. Can you do that? Just run this on the head node directly, since this should be a quick process.

Code:

$ samtools faidx CP2472.fasta

Then try

Code:

$ samtools mpileup -uf CP2472.fasta C106_CP002472_sorted.bam > outfile.bcf

**GenoMax** · 10-10-2014, 09:40 AM

Originally posted by ChrisSinga View Post

20 days left until I need to submit a extension application for the cluster and these two are the last 2 jobs I have to run.

You could also try to run these jobs on a desktop linux/OS X machine. Your files are not very large and you would not have to wait for your turn on the cluster.

**ChrisSinga** · 10-13-2014, 04:14 PM

Hi,

so with the genome sequencing file I got, I tried a bit and the following result I got:

Code:

samtools mpileup -uf CP002472.fasta C106_CP002472_sorted.bam > 2472_outfile.bcf

Result : 460 MB

Code:

bcftools view -cg 2472_outfile.bcf > bcftoolsout2472

Result: 334 MB

Code:

vcfutils.pl vcf2fq bcftoolsout_2472 > consensus_2472.fq

Result: 6 MB

Code:

seqtk seq -a consensus_2472.fq > consensus_2472.fasta

Result: 4 MB

The last one has a fasta format consisting of 2 rows...description and a looong sequence.
I thought initially more like getting a lot of sequences :-D

Mistakes in one of the commands?

**GenoMax** · 10-14-2014, 02:53 AM

Isn't that the consensus sequence you were looking for? How many bases are there in the final file?

**ChrisSinga** · 10-14-2014, 05:16 AM

It has roughly 3.6 million bases in there.

I think I got it now.

The initial reference genomes were the same...1 sequence.
By assembling it, I did not get a multifasta with all assembled sequences but also one file and one sequence. Areas which nothing was assembled to are just 'nnnnnn..'.

So I would need to do just one step more and just split the sequence into smaller chunks by 'nnnnn.....'.
After that I can try to predict ORFs and make a final fasta file like Seq1_ORF1.....etc....and those via blastx subsequently annotated.

As the two alignments gave a different degree of alignment, do you think it makes sense to try a de novo alignment with velvet?

I think for my purpose I will split now the sequence by 'nnnn..' and convert it into a blast database. That can serve me quite well as a point to know if known sequences are present.

Anyway thanks a lot for your patience and your help

Sorry for the lack of knowledge and sequencing/bioinformatic terms

.. will do my best to become better.

Sad that I cannot transfer a beer to your email

**GenoMax** · 10-14-2014, 05:25 AM

That looks to be about the right size for a microbial genome then. If you are curious (and have the time) you could certainly try to do a de novo assembly with the original reads. I would suggest using SPAdes in addition to velvet.

In the mean time you can move forward with annotating the sequence you have. See this thread for programs that can help with that: http://seqanswers.com/forums/showthread.php?t=47297

Will accept a rain check on that beer

**ChrisSinga** · 10-14-2014, 05:35 AM

Curious...yes...time..no

So much things to learn but so little time.
But most certainly will have a look when I am unemployed soon :-P
Thanks for the link....thats what I am looking for.

OK then, gimme your paypal email via private message and you get your sixpack of beer.

Topics	Statistics	Last Post
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 25 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 28 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM
Novel Diagnostic Assay Enhances Ovarian Cancer Detection by seqadmin Started by seqadmin, 04-10-2024, 09:21 AM	0 responses 24 views 0 likes	Last Post by seqadmin 04-10-2024, 09:21 AM
Evolutionary Dynamics of Centromeres: A Comparative Genomic Analysis by seqadmin Started by seqadmin, 04-04-2024, 09:00 AM	0 responses 52 views 0 likes	Last Post by seqadmin 04-04-2024, 09:00 AM

Seqanswers Leaderboard Ad

Announcement

How to proceed, BWA and now?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News