Unconfigured Ad

**boetsie** · 12-27-2010, 05:08 AM

Hi bioinf,

1) They wrote "the reads are first hashed according to a predefined k-mer length." What did they mean by those words? Did they mean that for each k-mer of that read the hash table contains a record where you store the id of the first read having this k-mer and the position of that k-mer in the read? So whenever the k-mer is encountered for the second time you just do nothing and leave the hash table as it is, since it already contains that k-mer?

This simply means that they cut the sequence into k-mers, small k-size parts of the original sequence. For example if k=3, a sequence ACGTGCT will be cut as;

ACGTGCT
ACG
.CGT
..GTG
...TGC
....GCT

The same is done for the reverse complement of the read. If the k-mer is already present, the number of occurences of this k-mer is added by 1.

2) What did they mean by the words "the ordered set of original k-mers of that read is cut each time an overlap with another read begins or ends"? Did they mean the total overlap of the k-mer with k first or last characters of some other read? What is meant by "cutting" the set? Could you please show an example?

I'm not really sure, but what i think is that if there is a k-1 overlap with another read(in this example thus 3-1 = 2), the reads are cut. in the most ideal case this would be;

read1;
ACGTGCT
read2:
CTAGAAT

overlap;
ACGTGCT
.....CTAGAAT

However, the following could also occur;

read1;
ACGTGCT
read2:
GTGGGGG

overlap;
ACGTGCT
..GTGGGGG

Here, the node is cut at "GT". See the example at the poster;

Error: 500

http://www.ebi.ac.uk/~zerbino/velvet/velvet_poster.pdf

3) When we construct the nodes of the graph, how is determining of the arcs between them being done?

If there is a k-1 overlap (as in 2)), an arc is drawn between the two reads. So between read1 and read2 an arc is drawn. Same for read1 and read3

Also have a look at Daniel Zerbino's dissertation at;

EMBL-EBI Training

https://www.ebi.ac.uk/training/ftp/PhDtheses/Daniel_Zerbino.pdf

We train scientists at all levels to get the most out of publicly available biological data.

Hope it helped,
Boetsie

**bioinf** · 12-27-2010, 06:32 AM

Hi, thank you for the answer and the link.

Here, the node is cut at "GT". See the example at the poster;

This is one of the major things I don't get. Say if we have two reads, then why is it allowed to check whether the inner part of one read can overlap with any(inner/outer) part of the second?
Well, say we take k = 3 for such reads:

Read_1: DDCCTGTAC
Read_2: NMCCTKLASSA

The paper has the following words:

A second database is created with the opposite information. It records, for each read, which of its original k-mers are overlapped by subsequent reads. The ordered set of original k-mers is cut each time an overlap with another read begins or ends. For each uninterrupted sequence of original k-mers a node is created.

That's really ambiguously described. Does it mean that I take a read, look at its k-mers from top to down and throw away the part of the list where the overlapping occurs and for the rest parts I make a graph nodes? Or is it meant that the cut-out part should also be a node?

If there is a k-1 overlap (as in 2)), an arc is drawn between the two reads.

How is it possible to find the supersequences from a graph if there is a node with two outgoing arcs? (like a tree with branches)

**Zigster** · 12-27-2010, 11:30 AM

Reads are not the actors in a Velvet assembly, only kmers, so any mention of k-1 always refers to kmers (two kmers get an edge if they overlap by k-1).

So forget about reads, at least until the scaffolding part, where paired-ends are leveraged.

Once you clip the tips and compress the bubbles then traversing the graph is straightforward, but path ambiguities often do result in fragmented assemblies.

**bioinf** · 12-27-2010, 12:23 PM

Thank you. Ok, its getting much clearer now. So do I understand you correctly?
- form all possible k-mers from the set of reads
- if k-1 end characters of some k-mer X intersect with k-1 beginning characters of k-mer Y, then put an arc from X-->Y (similar for the X<--Y case)
- simplify the graph by grouping the nodes where possible
- remove tips, bubbles
- walk the resulting graph from the beginning to the end to acquire the supersequence

but path ambiguities often do result in fragmented assemblies

Could you please explain what you've meant by the possibility of path-ambiguities?

Do I understand it correctly that in the end I should not get a tree-like resulting graph, since otherwise I have to travel several times to visit all nodes and I'll get several sequences instead of one final sequence?

Also why does the paper mention that we have to have a hash table which stores for each possible k-mer a record with the information about - in which read it was seen for the first time and at which position?

**Zigster** · 12-27-2010, 12:39 PM

the repeat problem is the primary reason for fragmented assemblies

the ambiguities here cannot be resolved so the assembly will likely be four contigs with repeat sequence attached

read tracking is needed to leverage paired end information

Attached Files

Screen shot 2010-12-27 at 3.29.09 PM.png (18.9 KB, 875 views)

**bioinf** · 12-27-2010, 12:52 PM

Thx very much!

Why did you mention 4 possible assembles? I see only two. The one on the top of the picture and the same, but if you switch light green with the dark green. But I see, so the idea of the ambiguity is that depending on how we traverse we might get different contigs.

read tracking is needed to leverage paired end information

Where could I read about how this is done?

**Zigster** · 12-27-2010, 01:06 PM

The diagram on the top is the actual order (known only to the organism being sequenced)

The graph below it would represent what an assembler could piece from short reads.
Four contigs would be produced by an assembler:
Tan-Blue
Blue-DarkGreen-Blue
Blue-LightGreen-Blue
Blue-BurntOrange

With paired end that span repeats we can deduce the correct contig order:

google "paired-end" "mate pairs" "scaffolding" "contig order" to learn more

Attached Files

Screen shot 2010-12-27 at 3.47.44 PM.png (12.8 KB, 842 views)

**bioinf** · 12-27-2010, 02:28 PM

Thank you very much for the brief answers! I'm not closing the topic yet in case I have questions about the further stages like error-correction stuff.

**boetsie** · 12-27-2010, 02:32 PM

Have a look at this site for some basic information about de novo assembly;

Page not found | Center for Bioinformatics and Computational Biology

http://www.cbcb.umd.edu/research/assembly_primer.shtml

**bioinf** · 12-28-2010, 04:19 AM

Thx bötsie. One more question about the de Bruijn graphs.
Is the ability to reduce the memory consumption by just having k-mers instead of reads as nodes - the only advantage against overlap graphs?

**boetsie** · 12-28-2010, 06:24 AM

Hi,

Well, it's not the only advantage, but indeed the biggest. Other advantages;

- De Bruijn graph assemblers are much faster.
- easier to correct for errors (which come alot with short read sequences) within the sequences (see the poster).
- repeats are immediately recognizable while in an overlap graph they are more difficult to identify.

Boetsie

Originally posted by bioinf View Post

Thx bötsie. One more question about the de Bruijn graphs.
Is the ability to reduce the memory consumption by just having k-mers instead of reads as nodes - the only advantage against overlap graphs?

**bioinf** · 01-05-2011, 09:39 AM

Hi again guys. I'm almost done with the presentation. I don't get something about the complementary strand assembly. Say I have a k-mer AATGC in the complementary strand read and the 100% same looking k-mer AATGC in the primary read. How does the algorithm see the difference? Whenever you analyze the read how do you know whether it is a primary or complementary strand?

**bioinf** · 01-05-2011, 10:30 AM

Perhaps someone could explain how the nodes for complementary strand are constructed and why? It is clear that there will be some reads from the complementary strand in the set, but what if there are less complementary reads than the primary ones? We still have to attach that complementary node to the primary one although that complementary node might not exist in the set of reads. How is that handled?

**bioinf** · 01-05-2011, 11:16 AM

A... I see. So they complement each other. There might be a primary read missing whereas its complementary read exists, so the nodes will get connected via arc and thus node overlapping information will not be lost.

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 18 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 30 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 16 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

Velvet assembler

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News