Seqanswers Leaderboard Ad

**LSC** · 09-14-2012, 02:51 PM

Originally posted by shanebrubaker View Post

Hi, I am also very interested in LSC. I would like to see the paper and manual if they are available.

Does anyone have time comparisons of LSC vs. PacBioToCA vs. SmrtPipe?

The paper was accepted last week and is off to print now. The preprint should be on the homepage next week.

**LSC** · 09-14-2012, 02:54 PM

Originally posted by shanebrubaker View Post

I also noticed that you say it corrects the reads to a 5% error rate, but the Schatz work seems to mention a 0.1% error rate. Is there a reason for that?

Thanks,
Shane

In the paper, you will see the accuracy go down to <1% when you have enough short reads (SGS) coverage. For those regions without any short reads coverage, the error rate would be still high. Then the average of the whole thing will be lower down. Thus, the more short reads, the better performance it does.

**ZFHans** · 09-20-2012, 12:26 AM

Hi Shane,

At the moment I am trying to compare LSC vs. PacBioToCA, and if time and hardware permits SmrtPipe. I noticed that PacBioToCA reduces the dataset from 1 GB to around 400MB. Haven't checked error rate yet.
We have 30x coverage in short reads.
I am still struggling with LSC 0.2.1 This runs fine on a testset, but when it runs on the whole short read set (49 GB) I get an read error in awk (awk: read error (Bad address). It's not consistently on one point in the file. Sometimes this happens after processing 50MB, and once it reached 30 GB. I did different md5 checksums for the file and they are the same. Any suggestions are appreciated.

Thanks,

Hans Jansen

**kmcarr** · 09-20-2012, 06:13 AM

LSC,

I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.

I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.

What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?

**GenoMax** · 09-20-2012, 06:47 AM

I am not sure what journal this paper has been accepted at but is it not possible by now to post something (perhaps a provisional PDF) at the link ("how it works") that was included in a previous post (and is still showing a "Page not found" error).

Other links (http://www-stat.stanford.edu/~kinfai/LSC.html) appear to lead to a "Not found" error. This one is not working either (http://www-stat.stanford.edu/~kinfai/LSC_download.html).

**LSC** · 10-11-2012, 12:22 AM

Sorry of the incompleteness of the website. I am currently pulled into an emergency project so that I have to postpone the release of the documentation. I hope I could have time to finish the manual in a week or so. The paper is on the homepage now. Sorry for the inconvenience again.

Originally posted by kmcarr View Post

LSC,

I would really like to use your software but it is extremely difficult to do so given the complete lack of documentation. The 'How it works?', 'Tutorial', 'Manual' and 'Filters' links are website are dead links. The 'FAQ' has just one line referring to SpliceMap. There isn't even a README file. Yes I can run the program but without any documentation I have know idea whether my results are correct or meaningful.

I installed LSC and ran it against a PacBio long read data set consisting of 100,000 reads, totaling 38Mbp. My short read set are 20 million, 100bp Illumina reads. I ran the program with default parameters and the output generated is 3 files, full_LR_SR.map.fa, uncorrected_LR_SR.map.fa and corrected_LR_SR.map.fa. Each file contains ~30,000 reads; the full file contains ~15Mbp and the other two each ~8Mbp.

What am I to make of these files? Does this output sound normal? Which output file is useful for further analysis?

**SLB** · 12-31-2012, 10:47 AM

Has anyone had any success in running LSC to correct pacbio data arising from Gb genomes? I am currently using it to try and correct a 6x coverage of >2Gb genome with 30X SR data. At the moment it is in the alignment stage with 40 CPU but finding it difficult to gauge how long the alignment could take.

**Boonie** · 12-31-2012, 10:00 PM

LSC: beware the dinucleotide repeats

I am working with a ~1Gb genome and using 40X coverage of mer-trimmed Illumina reads. A test run on 100Mb of PacBio sequence took almost 10 days to complete on 40 cpus. As you know, LSC sorts the Illumina reads by sequence, then normalizes the data with "uniq", then splits the reads into several SR.fa.*.cps files according to the number of cpus. Each sub-file is aligned to the PacBio reads in parallel. What I learned in this test run was that 'sort' grouped reads that contained classes of dinucleotide repeats. Thus the split resulted in a few sub-files that were quite rich in CA repeats, GT repeats, etc. Those files required a few more days to complete the Novoalign step while the rest of the cpus sat idle.

Next time, I would run a small test set of PacBio reads with SR_uniq.fa and copy the .cps subfiles to a new directory as soon as they are produced, then terminate runLSC. Let's say, hypothetically, that I used 48 cpus and sort/uniq/split resulted in four files that were rich in CA, GT, CT, and GA repeats. I would cat the 44 non-repetitive files then re-split into 48 subfiles. Then I'd split each of the four repeat-rich files into 48 subfiles and add them to the non-repetitive files. I'd cat these into a single, new SR_uniq.fa file. The result should be that when LSC runs afresh on the new SR_uniq.fa, the repetitive reads would be distributed evenly among the 48 subfiles. That approach is only a rough estimate of where the repetitive sequences exist in the original file, and is also inelegant due to lack of programming skill but perhaps someone more skilled could find a way to automate the process.

**SLB** · 01-01-2013, 02:54 PM

Thanks for the information. I would be interested to know how you get on with your second attempt. Out of interest, did the nature of your data set allow you to evaluate the corrected reads from your first test?

**juassis** · 01-18-2013, 06:02 AM

Hello!
The names of PacBio long reads must be in the format of the following example: ">m111006_202713_42141_c100202382555500000315044810141104_s1_p0/16/3441_3479".
The last two numbers (3441 and 3479 in this example) are the positions of the sub reads.

However, my new data PacBio, doesn't contain, the last two numbers.

ex:
>m120627_142215_42149_c100335932550000001523020209201251_s1_p0/7
>m120627_142215_42149_c100335932550000001523020209201251_s1_p0/9

How can I get the last two numbers (3441 and 3479 in this example) are the positions of the sub reads?
thanks

**flxlex** · 01-24-2013, 04:22 AM

Reads in the correct form for LSC are the result of filtering and trimming by smrtpipe. Your read IDs look like those from raw reads before filtering.

**SLB** · 02-13-2013, 11:42 AM

Has anyone experienced the following error when getting to the writetmp.py stage of the pipeline.

Traceback (most recent call last):
File "/home/stby/bin/writetmp.py", line 57, in ?
SR_cps_dict[readname] = line.strip()
MemoryError

I do have over 400Gb of memory available.

**SLB** · 02-18-2013, 12:12 PM

Originally posted by SLB View Post

Has anyone experienced the following error when getting to the writetmp.py stage of the pipeline.

Traceback (most recent call last):
File "/home/stby/bin/writetmp.py", line 57, in ?
SR_cps_dict[readname] = line.strip()
MemoryError

I do have over 400Gb of memory available.

Problem solved.. It was an issue with python version. Although I had specified a newer installation of python in the runLSC.py script, when it called the writetmp.py script the default python path pointed to an older version. Something to bear in mind if there is multiple installations of python on your system.

**weijenc** · 03-20-2013, 12:16 PM

Paired-End files

Hello,

I can't seem to find the instruction for Illumina paired-end reads. Should I first combine the two files, or there's a way to write both files in the .cfg file?

Also, is Novoalign still required to run LSC?

Thanks,

WJ

**joxcargator73** · 03-25-2013, 10:23 AM

I wonder if there is a version LSC for Mac users.
Thanks

Topics	Statistics	Last Post
Innovative Polymer for Long-Term DNA Storage by seqadmin Started by seqadmin, 06-14-2024, 07:24 AM	0 responses 12 views 0 likes	Last Post by seqadmin 06-14-2024, 07:24 AM
Assessing the Efficacy of Genome Sequencing for Diagnosing Genetic Disorders by seqadmin Started by seqadmin, 06-13-2024, 08:58 AM	0 responses 14 views 0 likes	Last Post by seqadmin 06-13-2024, 08:58 AM
The Independent Epigenetic Clock of T Cells by seqadmin Started by seqadmin, 06-12-2024, 02:20 PM	0 responses 17 views 0 likes	Last Post by seqadmin 06-12-2024, 02:20 PM
The Adaptation of the Cell Cycle in Multiciliated Cells by seqadmin Started by seqadmin, 06-07-2024, 06:58 AM	0 responses 186 views 0 likes	Last Post by seqadmin 06-07-2024, 06:58 AM

Seqanswers Leaderboard Ad

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News