building scaffolds using a contig and mate pair

catfisher replied

07-12-2010, 05:15 PM
grommit script failed

Marten, thanks for your quick reply. I editted my configure file as you suggested and run goBambus again, but still failed.
I used the .conf as:
# Priorities
priority ALL 1
# The following lines can be un-commented to specify certain
# per-library settings

# Redundancies
# redundancy lib_some 1

# allowed error
# error MUMmer 0.5

# overlaps allowed
# overlaps MUMmer Y

# Global redundancy
redundancy 2

# min group size
mingroupsize 0

The log information for goBambus is :
Parsing links out of input file
Step 100: running detective
Combining XML files
Step 200: making the xmls
starting
Done
Step 300: Preparing contig links
starting
Done
Step 400: Running scaffolder
Grommit(/home/aubsxl/bin/bambus/bin/grommit -i ctg2660_BES_mapping_704.inp -o ctg2660_BES_mapping_704.out.xml -C c
tg2660_BES_mapping_704.grommit.conf --append --logfile goBambus.log --debug 1) script failed

The error information from goBambus.error file is:
20100712|123807| 10451| Grommit(/home/aubsxl/bin/bambus/bin/grommit -i ctg2660_BES_mapping_704.inp -o ctg2660_BES_
mapping_704.out.xml -C ctg2660_BES_mapping_704.grommit.conf --append --logfile goBambus.log --debug 1) script fail
ed

The first several lines from my mates files is:
library libname 200 500
HWUSI-EAS1665_0002:2:1:1022:18088#0/1 HWUSI-EAS1665_0002:2:1:1022:18088#0/2 libname
HWUSI-EAS1665_0002:2:1:1029:11872#0/1 HWUSI-EAS1665_0002:2:1:1029:11872#0/2 libname
HWUSI-EAS1665_0002:2:1:1029:11034#0/1 HWUSI-EAS1665_0002:2:1:1029:11034#0/2 libname
HWUSI-EAS1665_0002:2:1:1030:19457#0/1 HWUSI-EAS1665_0002:2:1:1030:19457#0/2 libname
HWUSI-EAS1665_0002:2:1:1031:12133#0/1 HWUSI-EAS1665_0002:2:1:1031:12133#0/2 libname

Marten, could you look at these information and point out what's wrong with this? I have no idea. Thanks a lot,

Kevin
Leave a comment:
boetsie replied

07-12-2010, 01:26 AM
Hi catfisher,

i´ve had this error too. To solve it, you should set a priority in the .conf file. A file named default.conf is generated once you have run Bambus. This file contains the default parameters. Change or edit the line to;

priority ALL 1

to the file.
If you did not run Bambus yet, you should create one from scratch. See the below links for more information. Once you have the .conf file, you should add it to the command line options with for example;
goBambus -c test.contig -m test.mates -C default.conf -o test-bambus

For more information about the .config file see;

AMOS

http://sourceforge.net/apps/mediawiki/amos/index.php?title=Bambus_Manual#The_configuration_file

Download AMOS for free. AMOS is a collection of tools for genome assembly. AMOS is a collection of tools and class interfaces for the assembly of DNA reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulation.

For an example see;

AMOS

http://sourceforge.net/apps/mediawiki/amos/index.php?title=Bambus.conf

Download AMOS for free. AMOS is a collection of tools for genome assembly. AMOS is a collection of tools and class interfaces for the assembly of DNA reads. The package includes a robust infrastructure, modular assembly pipelines, and tools for overlapping, consensus generation, contigging, and assembly manipulation.

Marten
Leave a comment:
catfisher replied

07-11-2010, 04:55 PM
Bambus error: library priority

Boetsie and danix, I noticed that you may do a lot of work using Bambus, I also get the contigs generated from CLCbio. I know how to get the .contig file for Bambus, and I also got a mates file following your instructions, but when I rum goBambus, I got an error:
20100710|193857| 16658| Grommit(/home/aubsxl/bin/bambus/bin/grommit -i ctg2660_BES_mapping_704.inp -o ctg2660_BES_
mapping_704.out.xml -C ctg2660_BES_mapping_704.grommit.conf --append --logfile goBambus.log --debug 1) script fail
ed
20100710|204158|24277|grommit|FATAL|9: Priority not specified: at least one library must be assigned a priority

I don't know what's the 'priority', how can I do to solve this problem? could you all give any help? Thanks in advance.
Leave a comment:
danix replied

04-15-2010, 07:17 AM
Originally posted by boetsie View Post

I have no idea... I've never used a .sff file. How does it look like? why do you want to use it, does it contain additional data?

If the mates that are present in the .contig file, are all present in the two .fasta files, you can just use the two fasta files to create the .mates file.

How do I create the .mates? I tried with the script u send me and the output isn't fine. Besides I don't understand why FZ92HC101CZUHH.1 and FZ92HC102IDBLW.2 are in the same line. How can I tell that they are mates? I'm really lost and confused now...

FZ92HC101CZUHH.1 FZ92HC102IDBLW.2 libname
FZ92HC101DJEHD.1 FZ92HC102JYG94.2 libname
FZ92HC101DUWKQ.1 FZ92HC102HS1LU.2 libname
FZ92HC101CUUV5.1 FZ92HC102G8H4Z.2 libname
FZ92HC101EMKQX.1 FZ92HC102HOD38.2 libname
FZ92HC101CE653.1 FZ92HC102HO0J7.2 libname
FZ92HC101ECTBB.1 FZ92HC102IBNJJ.2 libname
FZ92HC101DXMSC.1 TGATCCGGCGCAGGCGTATCTGGGCTCGGATCGTGCCTGGTGCCGACGGCGATGAACGAC
libname
FZ92HC101C587C.1 FZ92HC102F3E16.2 libname
FZ92HC101BZ63S.1 CGGTCGGCCGCGGCCGATCTCGGGATTGCGCGGCGTGTGCAT
libname
FZ92HC101DEODE.1 CCGCGTGGACATGCCGTTCGAGGAACCGTGGACGCAACC
libname
FZ92HC101DP9HX.1 ATCGGCTATGCACAGGTCATCGAGTATCTCGACGGCG
libname
FZ92HC101EE90B.1 ACGTCCGACGTGATCAGGAGCGAGTCGGTGACGGCGCTTCGCACTCCGAGGG
libname
TTTGATGATCGACATCAAT GCGTTCGACTACCAGTTCGTCGGACCATCCGGGTAGCGTGTCGCAAGGGTCGGTTCCGAA
libname
CGTTCGCTGAGCACCGCCGAATCGAGCAGTTCGCGGATCTCGTCGAACGTCCNCGA FZ92HC102GE3MB.2 libname
CGTACGGATGTAGCTGGTGAAGAGGTCCCTTGCGGGCGGAGAAGTCGAGTCGTTCCGTCG TCGAGAGGCCGCGGAAGCGGCCGGAAAGGACGGCAACGATGTTTGACCGTTTCAACTCAG
libname
FZ92HC101DBOTK.1 FZ92HC102GVOHT.2 libname
FZ92HC101BEEQB.1 TCTGCGTGGAGACCGTGACGGCTGATCTACGGCCNCCTCGGCCGATGATCGCCGCCT
Leave a comment:
danix replied

04-15-2010, 07:10 AM
Hi, the 454 output is sff (looks like a binary file), but we use a script called sff_extract to convert this data in fasta, xml and quality files. I was just reading now that "The 454 paired-end protocol will generate reads which contain the forward and reverse direction in one read, separated by a linker."
So I think the key to generate .mates is .sff, but I don't know how.
I think I shouldn't be so complicated...
Leave a comment:
boetsie replied

04-15-2010, 06:47 AM
Originally posted by danix View Post

Hi, I forgot to mention that I also have the .sff if I can use them to create .mates it'll be great.
Can I? If so, how?

I have no idea... I've never used a .sff file. How does it look like? why do you want to use it, does it contain additional data?

If the mates that are present in the .contig file, are all present in the two .fasta files, you can just use the two fasta files to create the .mates file.
Leave a comment:
danix replied

04-15-2010, 06:38 AM
Hi, I forgot to mention that I also have the .sff if I can use them to create .mates it'll be great.
Can I? If so, how?
Leave a comment:
danix replied

04-15-2010, 05:38 AM
Hi boetsie, thanx again for your quick reply.
Here is a part of my .contig file. It was created by ace2contig (AMOS pack) and the input was the .ace that phrap generated after the assembly.
I'll try to use the script u attached.
Thank you so much again!

##Contig1 1 458 bases, 00000000 checksum.
agttcggcatggggtcaggtggttccactgcgctattgccgccaggcaaattcttcaatc
tgagaaagctgatgtaagtaattcgttcattcgctacaaggccagaaacacttcttgggt
gttgtatggttaagcctcacgggtaattagtatgggttagctcaacgtatcgctacgctt
acacaccccacctatcaacgttgtggtctccaacggccctttaggaccctcaaggggtca
gggatgactcatctcagggctcgcttcccgcttagatgctttcagcggttatcgattccg
aacttagctaccgggcagtgccactggcgtgacaacccgaacaccagaggttcgttcact
ccggtcctctcgtactaggagcaactcccttcaatcatccaacgcccacggcagataggg
accgaactgtctcacgacgttctgaacccagctcgcgt
#FZ92HC101BPK62(0) [] 458 bases, 00000000 checksum. {1 458} <1 459>
agttcggcatggggtcaggtggttccactgcgctattgccgccaggcaaattcttcaatc
tgagaaagctgatgtaagtaattcgttcattcgctacaaggccagaaacacttcttgggt
gttgtatggttaagcctcacgggtaattagtatgggttagctcaacgtatcgctacgctt
acacaccccacctatcaacgttgtggtctccaacggccctttaggaccctcaaggggtca
gggatgactcatctcagggctcgcttcccgcttagatgctttcagcggttatcgattccg
aacttagctaccgggcagtgccactggcgtgacaacccgaacaccagaggttcgttcact
ccggtcctctcgtactaggagcaactcccttcaatcatccaacgcccacggcagataggg
accgaactgtctcacgacgttctgaacccagctcgcgt
##Contig2 1 379 bases, 00000000 checksum.
ttctgagggaacacgcgttctgcgcgggttgtcttggtgctcactgttttccgccccgga
gtttgtggggtgttgggggtggtgggtgtgtgttgtttgagaagtgcatagtggatgcga
gcatctagcccggcgagttccttggtgttcttgttgggttgtgtgttctgcaatttcgat
tctggtttgtgcgatcgcgtgttgtgatcgttgatttttgtttgttgtccgcattcgcgt
ctcgggcactgtttggtgtgtggggtgtgtttgtgggtgttgttgtaagtgtttgagggc
gttcggtggatgccttggtaccaggagccgatgaaggacggccgtgcggtgggtcagtga
taaatcgacatgttaggtg
#FZ92HC101BFQDN(0) [] 379 bases, 00000000 checksum. {1 379} <1 380>
ttctgagggaacacgcgttctgcgcgggttgtcttggtgctcactgttttccgccccgga
gtttgtggggtgttgggggtggtgggtgtgtgttgtttgagaagtgcatagtggatgcga
gcatctagcccggcgagttccttggtgttcttgttgggttgtgtgttctgcaatttcgat
tctggtttgtgcgatcgcgtgttgtgatcgttgatttttgtttgttgtccgcattcgcgt
ctcgggcactgtttggtgtgtggggtgtgtttgtgggtgttgttgtaagtgtttgagggc
gttcggtggatgccttggtaccaggagccgatgaaggacggccgtgcggtgggtcagtga
taaatcgacatgttaggtg
Leave a comment:
boetsie replied

04-15-2010, 04:50 AM
Originally posted by danix View Post

Complementing the information I gave before:
454Reads.01.MID4.fna is like this:
>FZ92HC101CZUHH length=41 xy=1111_1155 region=1 run=R_2009_08_04_12_33_02_
CGCGCGTTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC101DJEHD length=46 xy=1334_0127 region=1 run=R_2009_08_04_12_33_02_
GTCTCGCGTCGTGTCTTCGCGTCGTATGCGGTACTGGTCAGGCGTT

454Reads.02.MID4.fna is like this:
>FZ92HC102IDBLW length=40 xy=3315_0370 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC102JYG94 length=40 xy=3966_0618 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC

Can I extract any information from these fastas to create a .mates?
Thanx

Hmmm i see it, it's 454, that doesn't have a prefix like .x or /1. (sorry, i have never worked with 454 data before )

Can you tell me how your .contig file looks like?

The mate file should have the same name as the first string after the "#" line in the .contig file. This line represents which read has mapped to the contig (starting with ##).

So if the line with "#" starts with e.g. FZ92HC102IDBLW, followed by the offset in parantheses, like;

#FZ92HC102IDBLW(0)

you should extract the names out of both files and put them in the same file

If this is indeed the case, you can use my script i attached.
Use it with;

perl testmates.pl file1 file2

It will generate a txt file with the mates. Only thing to do is put the library sizes at the top of the file.

more info about .contig file at http://www.cbcb.umd.edu/research/con...entation.shtml

Hope this helps.
Attached Files

testmates.pl (820 Bytes, 129 views)
Last edited by boetsie; 04-15-2010, 05:25 AM.
Leave a comment:
danix replied

04-15-2010, 03:53 AM
Originally posted by danix View Post

Thanx boetsie for your quick answer.
But I can't use your script in this project because the 454 outputs I have 454Reads.01.MID4.fna and 454Reads.02.MID4.fna, have sequences with different names, so all id is unique and it creates a mates.txt empty.
Besides, the other bacteria I'm working with has only one fasta from 454.

Both fasta are like this:
>F35ERS102DJ7GS rank=0000002 x=1343.0 y=826.0 length=56
ATCAGACACGGAGGCGTACGCGCCGCTGTTCCAGGTGATGCTGGCATTCCAGAACA
>F35ERS102DBYUE rank=0000006 x=1249.0 y=1428.0 length=69
ATCAGACACGCCGCCGGCACCTTCGCCGCTGCCGCGCTCGCCACCGGTGGCACCCGTCGT
GCTGTGGTC
>F35ERS102C47FN rank=0000036 x=1172.0 y=1361.0 length=68
ATCAGACACGAGGTGAAGACCGGTTTCCGTCGCGGCGGAGAATAGCCGAACATCAGCGCG
CGATCGGG

I'm wondering if there is a way to create the .mates from the data I have. Any other idea?

Thanx

Complementing the information I gave before:
454Reads.01.MID4.fna is like this:
>FZ92HC101CZUHH length=41 xy=1111_1155 region=1 run=R_2009_08_04_12_33_02_
CGCGCGTTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC101DJEHD length=46 xy=1334_0127 region=1 run=R_2009_08_04_12_33_02_
GTCTCGCGTCGTGTCTTCGCGTCGTATGCGGTACTGGTCAGGCGTT

454Reads.02.MID4.fna is like this:
>FZ92HC102IDBLW length=40 xy=3315_0370 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC
>FZ92HC102JYG94 length=40 xy=3966_0618 region=2 run=R_2009_08_04_12_33_02_
CGCGCGTTCTCGTACGGCTCGCTGTATCCGACNCGCGCGC

Can I extract any information from these fastas to create a .mates?
Thanx
Leave a comment:
danix replied

04-15-2010, 02:53 AM
building sacaffold using bambus - .mates problem

Thanx boetsie for your quick answer.
But I can't use your script in this project because the 454 outputs I have 454Reads.01.MID4.fna and 454Reads.02.MID4.fna, have sequences with different names, so all id is unique and it creates a mates.txt empty.
Besides, the other bacteria I'm working with has only one fasta from 454.

Both fasta are like this:
>F35ERS102DJ7GS rank=0000002 x=1343.0 y=826.0 length=56
ATCAGACACGGAGGCGTACGCGCCGCTGTTCCAGGTGATGCTGGCATTCCAGAACA
>F35ERS102DBYUE rank=0000006 x=1249.0 y=1428.0 length=69
ATCAGACACGCCGCCGGCACCTTCGCCGCTGCCGCGCTCGCCACCGGTGGCACCCGTCGT
GCTGTGGTC
>F35ERS102C47FN rank=0000036 x=1172.0 y=1361.0 length=68
ATCAGACACGAGGTGAAGACCGGTTTCCGTCGCGGCGGAGAATAGCCGAACATCAGCGCG
CGATCGGG

I'm wondering if there is a way to create the .mates from the data I have. Any other idea?

Thanx
Leave a comment:
boetsie replied

04-14-2010, 03:58 AM
Originally posted by danix View Post

Hi, I'm trying to run bambus but I don't have any .mates. Does anyone know how can I create this files?
I have a 454 output (fasta + sff) from a bacteria genome and I assembled it with phrap, I already convert the .ace to .contig, using ace2contig from AMOS.
Thanx!

This script i got from Sergey Koren from AMOS, (which i adapted a bit):

cat my.fasta |grep ">" |sed s/\>//g |sed 's/\/1*$/./g;s/\/2*$/./g'|awk -F "." '{print $1}' |sort |uniq -c |awk '{if ($1 == 2) print $2"/1\t"$2"/2\tsmall"}' > mates.txt

You need to put in the fasta file with the read names as 'my.fasta'.

The file 'my.fasta' requires filenames to end with /1 and /2.
If you have other file names, like .x and .y. You should replace;

sed 's/\/1*$/./g;s/\/2*$/./g'

to for example;

sed 's/.x*$/./g;s/.y*$/./g'

in the code above.

If you have two fasta files. Just insert one and change;
if ($1 == 2) to if ($1 == 1)
in the code, this way you only have to run it for one file.

This will print the names to 'mates.txt'. Only thing to do is to set your library name and insert sizes on the top of this file.

Bambus will probably generate a lot of errors, because some names are not found in the .contig file. But this shouldn't be a problem.

Hope this works otherwise ask me.
Leave a comment:
danix replied

04-14-2010, 02:56 AM
building sacaffold using bambus - .mates problem

Hi, I'm trying to run bambus but I don't have any .mates. Does anyone know how can I create this files?
I have a 454 output (fasta + sff) from a bacteria genome and I assembled it with phrap, I already convert the .ace to .contig, using ace2contig from AMOS.
Thanx!
Leave a comment:
boetsie replied

04-14-2010, 01:11 AM
Originally posted by mack View Post

How big is your dataset? I were able to export my dataset as ace with 17k contigs + 250k singletons.

more than 1 million contigs
Leave a comment:
mack replied

04-09-2010, 08:51 AM
Originally posted by boetsie View Post

For large datasets, somehow no .ace files are produced.

How big is your dataset? I were able to export my dataset as ace with 17k contigs + 250k singletons.
Leave a comment:

Previous 1 2 3 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, Yesterday, 11:49 AM	0 responses 15 views 0 likes	Last Post by seqadmin Yesterday, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 16 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Latest Articles

ad_right_rmr

News