Seqanswers Leaderboard Ad
Collapse
Announcement
Collapse
No announcement yet.
X
-
Let us know if the error correction is helping any as compared to not, when you do alignments. Your original data may be of good enough quality already.
-
Just writing to let all of you who helped me know that I have figured out (for the most part!) what I am doing and how to do it. I decided to use LSC for the error correction, but am running my pacbio data through the RS_IsoSeq (ICE+Quiver) pipeline first, then removing redundancy using CD-HIT, and then 'tofu' to collapse the transcripts. After all lf that I can do the error correction, and use the transcript data for further analyses.
Thank you all.
Leave a comment:
-
1. It is better if you can access SMRTPortal at least to get the CCS reads. It seems unreasonable to expect you to do all this work yet give you no computational support. Since you are not a computer science person, I would not encourage you to try to install SMRTAnalysis yourself (either on your own laptop or in the cloud).
2. Given your goal of eventually doing quantification, it seems like LSC + IDP is the fastest way for you to go. Fastest --- as in --- if you can get the author to help you with getting the software up and running, then you are very close to your end goal of analyzing the output. The lab that does LSC + IDP is here:
3. I *still* think the fastest way to your problem is not (2) but using (1) because I consider that the path of least resistance. From what I see, the fastest thing is to get full-length CCS reads using the RS_IsoSeq classify protocol:
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
After this step, every sequence is a full-length, high-quality CCS read. You can tune how high the expected CCS quality you want it to be. The tutorial uses a minimum predicted accuracy = 75 but for your purpose I suggest something like 90.
Once you have that, you *have* error corrected PacBio because CCS is intra-molecule consensus.
Then the next step is to cut down the redundantly mapped CCS reads (many of them may represent the same isoform) to the genome. I think cuffcompare or some tools from cufflinks can do that for you (FYI: there is a PacBio way to do this but it requires SMRTAnalysis and I'm helping you avoid it as much as possible).
After cuffxxxx , you now are left with a non-redundant unique set of PacBio full-length isoforms and move on to the next step of quantitating using short reads.
You can use RSEM, eXpress, etc by aligning short reads against the unique output from cuffxxxx.
Using this approach, you do not need to write a single line of code and all the software you use have tech support (PacBio) or wide community use (cufflinks suite and RSEM etc).
Good luck.
Leave a comment:
-
Originally posted by Magdoll View PostHi there,
First, a comprehensive resource of PacBio Iso-Seq is here:
The section on error correction using hybrid data is here (BUT before you stop here, read below):
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
Given that you are not a bioinformatician by training and you have limited time to process the PacBio Iso-Seq + Illumina RNA-seq data, may I ask what the goal of trying to combine them are?
I just happen to have data sets from the same few time-points from both illumina and pacbio iso-seq, so what I'd like to do is use the short reads to error correct the long reads so that in the end, I have long reads which represent full-length transcripts that are also very high-quality higher in quality than the pacbio reads by themselves).
Originally posted by Magdoll View PostIf the goal is to do gene annotation and there is a reference genome available (even if it's just a related species), you can align the PacBio CCS reads (better yet, if you first go through the first part of PacBio's RS_IsoSeq protocol to get the full-length reads only) alone to the genome. It will work. It won't be super pretty but aligners like GMAP and STAR will be able to align most of the CCS reads (tutorial is in the wiki). This is the path of least resistance. You can also align the short reads separately using TopHat2 and the sort. Then take a look at that in your genome browser (IGV?) and see what you want to do next.
If you do not have a reference genome, and you insist on getting 99-100% accuracy with the PacBio reads, you have a few choices (from least to most work):
1. Filter the CCS reads, just use those with very high quality. You can tune that by using the RS_IsoSeq protocol or just the CCS protocol (RS_ReadsOfInsert). You will lose of lot of slightly lower quality data but then you have zero extra work.
Originally posted by Magdoll View Post2. Run Iso-Seq protocol in both parts: classify and cluster. This means you must have access to SMRTPortal/SMRTAnalysis and it will be computationally intensive. However this way you error correct PacBio with itself and you don't have to run programs yourself. If you have issue with running SMRTPortal, PacBio tech support is there to help.
Originally posted by Magdoll View Post3. Hybrid error correction using either LSC+IDP. I don't recommend PacBioCA because the authors themselves don't use it for transcriptome (Iso-Seq) correction except for that one paper. They use it for genomes. LSC+IDP's author is at least still actively developing his software and he will respond to your emails.
For now, I'm only concentrating on error correction as this "project" is not just a part of my actual research, but a class project (class is taught by my PI). Anything beyond error correction will be a natural extension of my research, but performing the error correction is the hurdle I need to pass for now.
Leave a comment:
-
Originally posted by GenoMax View PostTranscriptome is ok. What I meant was is this data from an organism where the genome has been sequenced (i.e. you have a reference available). If you do then first thing you should do is align your data (illumina and PacBio) to get an idea of what the sequence quality (in terms of alignment) looks like. Is the transcriptome known for this organism or is that something you are trying to put together?
PacBio data is not necessarily of lower quality but there is a bigger chance that it may have more errors in it compared to Illumina data.
Did you get any kind of statistics (a report) from whoever you got the PacBio sequence done from? It would be useful to know "mean read length", "number of reads" etc. I don't have experience with IsoSeq data but if there are specific metrics for that you may want to post them.
I can ask my professor for the stats on the pacbio sequencing and get back to you with something more useful for you to look at.
Leave a comment:
-
Hi there,
First, a comprehensive resource of PacBio Iso-Seq is here:
The section on error correction using hybrid data is here (BUT before you stop here, read below):
GitHub is where people build software. More than 100 million people use GitHub to discover, fork, and contribute to over 420 million projects.
Given that you are not a bioinformatician by training and you have limited time to process the PacBio Iso-Seq + Illumina RNA-seq data, may I ask what the goal of trying to combine them are?
If the goal is to do gene annotation and there is a reference genome available (even if it's just a related species), you can align the PacBio CCS reads (better yet, if you first go through the first part of PacBio's RS_IsoSeq protocol to get the full-length reads only) alone to the genome. It will work. It won't be super pretty but aligners like GMAP and STAR will be able to align most of the CCS reads (tutorial is in the wiki). This is the path of least resistance. You can also align the short reads separately using TopHat2 and the sort. Then take a look at that in your genome browser (IGV?) and see what you want to do next.
If you do not have a reference genome, and you insist on getting 99-100% accuracy with the PacBio reads, you have a few choices (from least to most work):
1. Filter the CCS reads, just use those with very high quality. You can tune that by using the RS_IsoSeq protocol or just the CCS protocol (RS_ReadsOfInsert). You will lose of lot of slightly lower quality data but then you have zero extra work.
2. Run Iso-Seq protocol in both parts: classify and cluster. This means you must have access to SMRTPortal/SMRTAnalysis and it will be computationally intensive. However this way you error correct PacBio with itself and you don't have to run programs yourself. If you have issue with running SMRTPortal, PacBio tech support is there to help.
3. Hybrid error correction using either LSC+IDP. I don't recommend PacBioCA because the authors themselves don't use it for transcriptome (Iso-Seq) correction except for that one paper. They use it for genomes. LSC+IDP's author is at least still actively developing his software and he will respond to your emails.
If you want to the quantification, there are some ways to go about combining the two. But since your post is currently focused on error correction and I'm trying to understand what your end goal is, I'll stop here for now.
Leave a comment:
-
Transcriptome is ok. What I meant was is this data from an organism where the genome has been sequenced (i.e. you have a reference available). If you do then first thing you should do is align your data (illumina and PacBio) to get an idea of what the sequence quality (in terms of alignment) looks like. Is the transcriptome known for this organism or is that something you are trying to put together?
PacBio data is not necessarily of lower quality but there is a bigger chance that it may have more errors in it compared to Illumina data.
Did you get any kind of statistics (a report) from whoever you got the PacBio sequence done from? It would be useful to know "mean read length", "number of reads" etc. I don't have experience with IsoSeq data but if there are specific metrics for that you may want to post them.
Leave a comment:
-
Originally posted by GenoMax View Post@LampreyGuy: I moved this thread to PacBio sub-forum since there is a better chance that Dr. Hall from PacBio will see it. He participates in the forum but may only check threads under this sub-forum.
Is this Iso-seq sequence data? Is it of good quality? Can you provide some stats for it? Is this a known genome or an unknown one? Have you tried to align the data, if a reference is available to see what the quality of PacBio data looks like? Is there a specific reason for your supervisor to ask that you correct this data?
The two data sets that I have are both RNASeq data (mRNA that was converted into cDNA, then sequenced), so it's not a genome that I'm looking at, rather a transcriptome. One set was sequenced with Illumina HiSeq, and a later set was sequenced by a PacBio RSII (yes I believe IsoSeq). I have not tried to align the data yet, no. My understanding is that because Illumina data are of higher-quality, and PacBio of lower, both sets should be able to go right into an error correction pipeline, no?
I'm a new grad student, and I can only do animal work over the summer months. Between september and may, it's going to be all computational work. I have no background in computer science so I really have to teach myself this stuff as I go along, but I'm trying.
Leave a comment:
-
@LampreyGuy: I moved this thread to PacBio sub-forum since there is a better chance that Dr. Hall from PacBio will see it. He participates in the forum but may only check threads under this sub-forum.
Is this Iso-seq sequence data? Is it of good quality? Can you provide some stats for it? Is this a known genome or an unknown one? Have you tried to align the data, if a reference is available to see what the quality of PacBio data looks like? Is there a specific reason for your supervisor to ask that you correct this data?
Leave a comment:
-
Sorry, I forgot to mention that what I'm actually doing is correcting RNASeq data. I have Illumina HiSeq data and PacBio data on mRNA, and I'm trying to correct the PacBio reads to end up with full length, high-quality transcripts.
It's my understanding that you can only correct PacBio long read data with either Illumina, 454, or PacBio CCS data.
Leave a comment:
-
Not sure about your exact problem since I haven't used this program.
A (simpler) alternative might be https://github.com/douglasgscofield/PacBio-utilities
It's on my list but I haven't tried it yet.
The best way to correct high coverage Pacbio is with Pacbio reads themselves. If you have 40X + PAcbio whole genome data you can get very good results with a program like PBcR (MHAP) or Falcon. Thereafter you need to use Quiver to correct errors.
Leave a comment:
-
Desperate grad student trying to correct pacbio reads with illumina data
Let me preface this by saying that I am literally brand new to bioinformatics. I know almost nothing. My professor wants me to correct pacbio data with illumina short reads, and wants me to "figure it out for myself". I have googled my fingers to the bone but I just cannot figure out the answer to my problem.
I am trying to use pacBioToCA to correct my reads. My command line looks like this: ./pacBioToCA -l correctedreads -t 8 -s pacbio.spec -fastq _home_SMRT_userdata_jobs_016_016442_data_isoseq_flnc.fastq /home/codysaraceno/illuminarnaseq/embryo_rnaseq_1112/merged_1merged_2.frg
The error message that I keep getting is Error: unable to find a library to correct. Please double-check your input files and try again. at ./pacBioToCA line 1486.
What's confusing me is what they want for "-libraryname". I made a folder called "correctedreads" to be used as the library, but it keeps giving me the error message.
Other times it will tell me that "No frag files were specified" when I know for sure that they have been specified and that they exist.
I've been at this for weeks and I did everything I could before posting here but I feel frustrated and out of options. Any help would be immeasurably appreciated. It's 11:27 pm here now, so if I don't answer right away it's because I went to bed, but I will be on first thing tomorrow morning. Thank you.Tags: None
Latest Articles
Collapse
-
by seqadmin
Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.
Nobel Prize for MicroRNA Discovery
This week,...-
Channel: Articles
10-07-2024, 08:07 AM -
-
by seqadmin
Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...-
Channel: Articles
09-23-2024, 06:35 AM -
ad_right_rmr
Collapse
News
Collapse
Topics | Statistics | Last Post | ||
---|---|---|---|---|
Started by seqadmin, Yesterday, 02:44 PM
|
0 responses
7 views
0 likes
|
Last Post
by seqadmin
Yesterday, 02:44 PM
|
||
Started by seqadmin, 10-11-2024, 06:55 AM
|
0 responses
14 views
0 likes
|
Last Post
by seqadmin
10-11-2024, 06:55 AM
|
||
Started by seqadmin, 10-02-2024, 04:51 AM
|
0 responses
110 views
0 likes
|
Last Post
by seqadmin
10-02-2024, 04:51 AM
|
||
Started by seqadmin, 10-01-2024, 07:10 AM
|
0 responses
117 views
0 likes
|
Last Post
by seqadmin
10-01-2024, 07:10 AM
|
Leave a comment: