Unconfigured Ad

**Jeremy** · 01-30-2011, 06:56 PM

What you should do is take the subset of reads that were used in the assembly (i.e. remove all singletons) and then map these back against your contigs. You can then identify which isotig/isogroup the contig comes from and use that to annotate the SNPs. I did a similar thing to get differential expression values from a pair of samples where I made an assembly using both of the samples together. The number of reads listed for each contig from the assembly of both samples was in most cases the same as the combined number of reads for that contig when each sample was individually mapped back against the contigs.

However, the contig output file is incorrect and appends the sequence of the previous contig (for contigs where status=isotig), so you will need to correct this first. Read below for more information about the contig file problem

Detection of alternative splicing events from 454 output - SEQanswers

http://seqanswers.com/forums/showthread.php?t=4732

Pyrosequencing in picotiter plates, custom arrays for enrichment/decomplexing. (Roche)

You may want to consider a difference as being real if the allele frequency range is 20-80% of reads or perhaps 30-70%. I have previously sequenced some samples of gDNA that had been first sequence captured using Nimblegen. These samples had also been processed on a SNP array so I knew which sites were heterozygous. The allele sequence coverage from the 454 for heterozygosity ranged from about 20-80%. I don't know how much of that bias was from sequence capturethough, but keep in mind that its unlikely to get exactly 50:50 sequence depth.

**dschika** · 02-01-2011, 03:52 AM

Thank you for your reply, Jeremy.
I'm not sure if I understood you correctly: do you mean I should take the contigs which can be found in the isotigs as reference? But this wouldn't solve the problem, as reads can again match several contigs and are therefore rejected (or counted several times) ??

However, the contig output file is incorrect and appends the sequence of the previous contig (for contigs where status=isotig), so you will need to correct this first. Read below for more information about the contig file problem
http://seqanswers.com/forums/showthread.php?t=4732

Looks like Roche fixed this in v2.5p1.

**Jeremy** · 02-01-2011, 07:08 PM

Originally posted by dschika View Post

Thank you for your reply, Jeremy.
I'm not sure if I understood you correctly: do you mean I should take the contigs which can be found in the isotigs as reference? But this wouldn't solve the problem, as reads can again match several contigs and are therefore rejected (or counted several times) ??

I mean the 454AllContigs.fna file.
For the mapping process the long reads are split into smaller fragments called kmers, and these kmers are then mapped against the reference to determine if the whole read maps or not. Reads seem to be rejected if the kmers that they are split into each match multiple sites in the reference. So, although many reads can map to several contigs, the kmers that they are split into (the default is 12 or 16 from memory) should only match a single contig which means that the read will not be flagged as repeat sequence.

When I mapped the reads against the contigs, the read count per contig was in most cases identical to the number of reads used to produce that contig as listed in the 454AllContigs.fna file.

**dschika** · 02-02-2011, 12:02 AM

ahh, ok. That sounds interessting. I will try it! Unfortunately I will have to do something else the next weeks, but when I'm coming back to the topic, I will let you know if it worked!

**dschika** · 03-29-2011, 02:03 AM

Sorry for being away such a long time. But now I'm back to work at that topic.

I used now the contigs of the assembly as reference and mapped all reads with gsMapper against it.
Now I see something I'm not completely sure how to interprete:
Looking at the 454HCDiffs.txt I see some thousand differences where the Var Freq is greater than 90% and even a lot with Var Freq 100% which means (as far as I understood it) that no reads were mapped against the reference with the exact reference sequence. But as the reference sequence is the assembly of the reads used for mapping, does it mean there are sequencing errors? Did you therefore suggest to

Originally posted by Jeremy

consider a difference as being real if the allele frequency range is 20-80% of reads or perhaps 30-70%.

?

Btw:

Originally posted by Jeremy View Post

When I mapped the reads against the contigs, the read count per contig was in most cases identical to the number of reads used to produce that contig as listed in the 454AllContigs.fna file.

I can't see this in my data. The number of reads vary greatly.

**Jeremy** · 03-29-2011, 04:58 PM

Did you first subset the reads so that you only used the reads that were assembled? If you are using all reads then the reads that weren't assembled will be force aligned somewhere and you will get a lot of differences being identified based on incorrect alignment.

Some reads will still align incorrectly, but you can filter these out simply by comparing the number of reads mapped compared to the number of reads used to create the contig. The number of reads used to create the contig has to come from the readstatus.txt file not the number listed in the contigs.fna file. This is because the number in the contigs.fna file counts reads considered duplicates as a single read but mapping does not.

**dschika** · 03-31-2011, 06:42 AM

Thanks again, Jeremy and shame on me! I forgot to remove the not assembled reads

Now the read numbers are more equal!

But anyway I wasn't aware that duplicates are marked as assembled in the ReadStatus file. Thanks for that hint!

One more question: Is there a more direct way of getting the SNP positions in the isotigs than:
- parsing the 454HCDiffs.txt to get the SNP positions in the contigs
- next, parse the 454Isotigs of the assembly to get the contigs per isotig
- finally calculate the SNP positions in an isotig according to the SNP positions in the contigs of that isotig?

The problem is: I found that there are some discrepancies in the Newbler files:
- sometimes the length of an isotig and a corresponding contig can vary (nucleotides can be added or removed)
- sometimes an isotig consists of the reverse complement of a contig. That is in cases where the isotig consists only of one contig and the contig direction is marked '-' in 454Isotigs.txt. (At least I have seen a reversed complement only in that cases until now). Is the reverse complement of contigs used also in other cases?

Cheers!

**Jeremy** · 03-31-2011, 07:39 PM

Originally posted by dschika View Post

One more question: Is there a more direct way of getting the SNP positions in the isotigs than:
- parsing the 454HCDiffs.txt to get the SNP positions in the contigs
- next, parse the 454Isotigs of the assembly to get the contigs per isotig
- finally calculate the SNP positions in an isotig according to the SNP positions in the contigs of that isotig?

The problem is: I found that there are some discrepancies in the Newbler files:
- sometimes the length of an isotig and a corresponding contig can vary (nucleotides can be added or removed)
- sometimes an isotig consists of the reverse complement of a contig. That is in cases where the isotig consists only of one contig and the contig direction is marked '-' in 454Isotigs.txt. (At least I have seen a reversed complement only in that cases until now). Is the reverse complement of contigs used also in other cases?

Cheers!

I found the same thing and I don't think there is a more direct way than what you stated. What version of Newbler are you using? My answers address version 2.3 mostly, but I have tried 2.5 also.

The random nucleotide seems to be a nucleotide at the contig join meaning that sometimes the contigs have a one bp overlap at the join site and sometimes they don't. I am not sure that is true for all cases, but it seemed to be true for the ones that I checked, which should be easy enough to correct (this is true for v2.3, not sure if this is true for v2.5).

I haven't seen any cases where the reverse complement has been incorporated, but my samples were first strand cDNA so I would not expect to see any. I do see + and - in the 454Isotigs.txt file, but the contig sequences in the 454contigs.fna file (that i checked) are given in the same orientation as they appear in the 454isotigs.fna file (same thing happens in v2.5) so I don't really know what is going on there, but it looks like the output file for the contigs is automatically fixing the orientation to match the isotigs.

Code:

for example (not real data)
454Isotigs.fna file
>isotig00001  gene=isogroup00001  length=8  numContigs=2
ATCGAGGT

454IsotigsLayout.txt file
>isogroup00001  numIsotigs=1  numContigs=2
Length :  4        4
Contig :  00001  00002
          <<<<< >>>>>

454Contigs.fna file
>contig00001  length=4  numreads=lots  gene=isogroup00001  status=isotig
ATCG

>contig00001  length=4  numreads=lots  gene=isogroup00001  status=isotig
AGGT

**dschika** · 04-01-2011, 08:32 AM

I use version 2.5.

I now have seen also contigs with + direction to be in reveresed complement in the isotigs. That makes the identification of SNPs and SNP positions in the isotigs pretty ugly

Have you compared the SNPs in the isotigs to those in the contigs?

I altered my search: I now extract a small sequence around a SNP in the contig and search that sequence in the isotig, if necessary I search the reverse complemented sequence in the isotig.

**Jeremy** · 04-03-2011, 08:00 PM

Originally posted by dschika View Post

I now have seen also contigs with + direction to be in reveresed complement in the isotigs. That makes the identification of SNPs and SNP positions in the isotigs pretty ugly

I don't know about that. Hopefully someone else has dealt with it.

Originally posted by dschika View Post

Have you compared the SNPs in the isotigs to those in the contigs?

I only mapped against contigs because isotigs within an isogroup would cause mapping problems, so I have nothing to compare to. I have not bothered to calculate isotig positions for all the SNPs

Topics	Statistics	Last Post
UC San Diego Bioengineers Map Gene Function in Human Stem Cells by SEQadmin2 Started by SEQadmin2, 07-13-2026, 10:26 AM	0 responses 25 views 0 reactions	Last Post by SEQadmin2 07-13-2026, 10:26 AM
New Analysis Splits Leukemia Into 16 Epigenomic Subgroups by SEQadmin2 Started by SEQadmin2, 07-09-2026, 10:04 AM	0 responses 35 views 0 reactions	Last Post by SEQadmin2 07-09-2026, 10:04 AM
Genome-Wide CRISPR Screen Uncovers Unlikely Psoriasis Target by SEQadmin2 Started by SEQadmin2, 07-08-2026, 10:08 AM	0 responses 22 views 0 reactions	Last Post by SEQadmin2 07-08-2026, 10:08 AM
Engineered Protein Motor Takes Its First Steps Along DNA Track by SEQadmin2 Started by SEQadmin2, 07-07-2026, 11:05 AM	0 responses 34 views 0 reactions	Last Post by SEQadmin2 07-07-2026, 11:05 AM

Unconfigured Ad

SNP detection with gsMapper without reference

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News