Originally posted by shanebrubaker
View Post
Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Originally posted by shanebrubaker View PostI also noticed that you say it corrects the reads to a 5% error rate, but the Schatz work seems to mention a 0.1% error rate. Is there a reason for that?
Thanks,
Shane
Comment
-
Hi Shane,
At the moment I am trying to compare LSC vs. PacBioToCA, and if time and hardware permits SmrtPipe. I noticed that PacBioToCA reduces the dataset from 1 GB to around 400MB. Haven't checked error rate yet.
We have 30x coverage in short reads.
I am still struggling with LSC 0.2.1 This runs fine on a testset, but when it runs on the whole short read set (49 GB) I get an read error in awk (awk: read error (Bad address). It's not consistently on one point in the file. Sometimes this happens after processing 50MB, and once it reached 30 GB. I did different md5 checksums for the file and they are the same. Any suggestions are appreciated.
Thanks,
Hans Jansen
Comment
-
LSC,
I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.
I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.
What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?
Comment
-
I am not sure what journal this paper has been accepted at but is it not possible by now to post something (perhaps a provisional PDF) at the link ("how it works") that was included in a previous post (and is still showing a "Page not found" error).
Other links (http://www-stat.stanford.edu/~kinfai/LSC.html) appear to lead to a "Not found" error. This one is not working either (http://www-stat.stanford.edu/~kinfai/LSC_download.html).
Comment
-
Sorry of the incompleteness of the website. I am currently pulled into an emergency project so that I have to postpone the release of the documentation. I hope I could have time to finish the manual in a week or so. The paper is on the homepage now. Sorry for the inconvenience again.
Originally posted by kmcarr View PostLSC,
I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.
I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.
What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?
Comment
-
Has anyone had any success in running LSC to correct pacbio data arising from Gb genomes? I am currently using it to try and correct a 6x coverage of >2Gb genome with 30X SR data. At the moment it is in the alignment stage with 40 CPU but finding it difficult to gauge how long the alignment could take.
Comment
-
LSC: beware the dinucleotide repeats
I am working with a ~1Gb genome and using 40X coverage of mer-trimmed Illumina reads. A test run on 100Mb of PacBio sequence took almost 10 days to complete on 40 cpus. As you know, LSC sorts the Illumina reads by sequence, then normalizes the data with "uniq", then splits the reads into several SR.fa.*.cps files according to the number of cpus. Each sub-file is aligned to the PacBio reads in parallel. What I learned in this test run was that 'sort' grouped reads that contained classes of dinucleotide repeats. Thus the split resulted in a few sub-files that were quite rich in CA repeats, GT repeats, etc. Those files required a few more days to complete the Novoalign step while the rest of the cpus sat idle.
Next time, I would run a small test set of PacBio reads with SR_uniq.fa and copy the .cps subfiles to a new directory as soon as they are produced, then terminate runLSC. Let's say, hypothetically, that I used 48 cpus and sort/uniq/split resulted in four files that were rich in CA, GT, CT, and GA repeats. I would cat the 44 non-repetitive files then re-split into 48 subfiles. Then I'd split each of the four repeat-rich files into 48 subfiles and add them to the non-repetitive files. I'd cat these into a single, new SR_uniq.fa file. The result should be that when LSC runs afresh on the new SR_uniq.fa, the repetitive reads would be distributed evenly among the 48 subfiles. That approach is only a rough estimate of where the repetitive sequences exist in the original file, and is also inelegant due to lack of programming skill but perhaps someone more skilled could find a way to automate the process.
Comment
-
Hello!
The names of PacBio long reads must be in the format of the following example: ">m111006_202713_42141_c100202382555500000315044810141104_s1_p0/16/3441_3479".
The last two numbers (3441 and 3479 in this example) are the positions of the sub reads.
However, my new data PacBio, doesn't contain, the last two numbers.
ex:
>m120627_142215_42149_c100335932550000001523020209201251_s1_p0/7
>m120627_142215_42149_c100335932550000001523020209201251_s1_p0/9
How can I get the last two numbers (3441 and 3479 in this example) are the positions of the sub reads?
thanks
Comment
-
Has anyone experienced the following error when getting to the writetmp.py stage of the pipeline.
Traceback (most recent call last):
File "/home/stby/bin/writetmp.py", line 57, in ?
SR_cps_dict[readname] = line.strip()
MemoryError
I do have over 400Gb of memory available.
Comment
-
Originally posted by SLB View PostHas anyone experienced the following error when getting to the writetmp.py stage of the pipeline.
Traceback (most recent call last):
File "/home/stby/bin/writetmp.py", line 57, in ?
SR_cps_dict[readname] = line.strip()
MemoryError
I do have over 400Gb of memory available.
Comment
Latest Articles
Collapse
-
by seqadmin
While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...-
Channel: Articles
06-06-2024, 07:15 AM -
-
by seqadmin
Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.
Somatic Genomics
“We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...-
Channel: Articles
05-24-2024, 01:16 PM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, 06-14-2024, 07:24 AM
|
0 responses
12 views
0 likes
|
Last Post
by seqadmin
06-14-2024, 07:24 AM
|
||
Started by seqadmin, 06-13-2024, 08:58 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
06-13-2024, 08:58 AM
|
||
Started by seqadmin, 06-12-2024, 02:20 PM
|
0 responses
17 views
0 likes
|
Last Post
by seqadmin
06-12-2024, 02:20 PM
|
||
Started by seqadmin, 06-07-2024, 06:58 AM
|
0 responses
186 views
0 likes
|
Last Post
by seqadmin
06-07-2024, 06:58 AM
|
Comment