Header Leaderboard Ad

Collapse

Desperate grad student trying to correct pacbio reads with illumina data

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Desperate grad student trying to correct pacbio reads with illumina data

    Let me preface this by saying that I am literally brand new to bioinformatics. I know almost nothing. My professor wants me to correct pacbio data with illumina short reads, and wants me to "figure it out for myself". I have googled my fingers to the bone but I just cannot figure out the answer to my problem.

    I am trying to use pacBioToCA to correct my reads. My command line looks like this: ./pacBioToCA -l correctedreads -t 8 -s pacbio.spec -fastq _home_SMRT_userdata_jobs_016_016442_data_isoseq_flnc.fastq /home/codysaraceno/illuminarnaseq/embryo_rnaseq_1112/merged_1merged_2.frg

    The error message that I keep getting is Error: unable to find a library to correct. Please double-check your input files and try again. at ./pacBioToCA line 1486.

    What's confusing me is what they want for "-libraryname". I made a folder called "correctedreads" to be used as the library, but it keeps giving me the error message.

    Other times it will tell me that "No frag files were specified" when I know for sure that they have been specified and that they exist.

    I've been at this for weeks and I did everything I could before posting here but I feel frustrated and out of options. Any help would be immeasurably appreciated. It's 11:27 pm here now, so if I don't answer right away it's because I went to bed, but I will be on first thing tomorrow morning. Thank you.
    Last edited by LampreyGuy; 10-19-2015, 07:28 PM. Reason: typos

  • #2
    Not sure about your exact problem since I haven't used this program.

    A (simpler) alternative might be https://github.com/douglasgscofield/PacBio-utilities
    It's on my list but I haven't tried it yet.

    The best way to correct high coverage Pacbio is with Pacbio reads themselves. If you have 40X + PAcbio whole genome data you can get very good results with a program like PBcR (MHAP) or Falcon. Thereafter you need to use Quiver to correct errors.

    Comment


    • #3
      Sorry, I forgot to mention that what I'm actually doing is correcting RNASeq data. I have Illumina HiSeq data and PacBio data on mRNA, and I'm trying to correct the PacBio reads to end up with full length, high-quality transcripts.

      It's my understanding that you can only correct PacBio long read data with either Illumina, 454, or PacBio CCS data.

      Comment


      • #4
        @LampreyGuy: I moved this thread to PacBio sub-forum since there is a better chance that Dr. Hall from PacBio will see it. He participates in the forum but may only check threads under this sub-forum.

        Is this Iso-seq sequence data? Is it of good quality? Can you provide some stats for it? Is this a known genome or an unknown one? Have you tried to align the data, if a reference is available to see what the quality of PacBio data looks like? Is there a specific reason for your supervisor to ask that you correct this data?

        Comment


        • #5
          Originally posted by GenoMax View Post
          @LampreyGuy: I moved this thread to PacBio sub-forum since there is a better chance that Dr. Hall from PacBio will see it. He participates in the forum but may only check threads under this sub-forum.

          Is this Iso-seq sequence data? Is it of good quality? Can you provide some stats for it? Is this a known genome or an unknown one? Have you tried to align the data, if a reference is available to see what the quality of PacBio data looks like? Is there a specific reason for your supervisor to ask that you correct this data?
          Thanks. I'm still new here so I'm not familiar with where things should go.

          The two data sets that I have are both RNASeq data (mRNA that was converted into cDNA, then sequenced), so it's not a genome that I'm looking at, rather a transcriptome. One set was sequenced with Illumina HiSeq, and a later set was sequenced by a PacBio RSII (yes I believe IsoSeq). I have not tried to align the data yet, no. My understanding is that because Illumina data are of higher-quality, and PacBio of lower, both sets should be able to go right into an error correction pipeline, no?

          I'm a new grad student, and I can only do animal work over the summer months. Between september and may, it's going to be all computational work. I have no background in computer science so I really have to teach myself this stuff as I go along, but I'm trying.

          Comment


          • #6
            Transcriptome is ok. What I meant was is this data from an organism where the genome has been sequenced (i.e. you have a reference available). If you do then first thing you should do is align your data (illumina and PacBio) to get an idea of what the sequence quality (in terms of alignment) looks like. Is the transcriptome known for this organism or is that something you are trying to put together?

            PacBio data is not necessarily of lower quality but there is a bigger chance that it may have more errors in it compared to Illumina data.

            Did you get any kind of statistics (a report) from whoever you got the PacBio sequence done from? It would be useful to know "mean read length", "number of reads" etc. I don't have experience with IsoSeq data but if there are specific metrics for that you may want to post them.

            Comment


            • #7
              Hi there,

              First, a comprehensive resource of PacBio Iso-Seq is here:
              https://github.com/PacificBiosciences/cDNA_primer/wiki

              The section on error correction using hybrid data is here (BUT before you stop here, read below):
              https://github.com/PacificBioscience...na-short-reads

              Given that you are not a bioinformatician by training and you have limited time to process the PacBio Iso-Seq + Illumina RNA-seq data, may I ask what the goal of trying to combine them are?

              If the goal is to do gene annotation and there is a reference genome available (even if it's just a related species), you can align the PacBio CCS reads (better yet, if you first go through the first part of PacBio's RS_IsoSeq protocol to get the full-length reads only) alone to the genome. It will work. It won't be super pretty but aligners like GMAP and STAR will be able to align most of the CCS reads (tutorial is in the wiki). This is the path of least resistance. You can also align the short reads separately using TopHat2 and the sort. Then take a look at that in your genome browser (IGV?) and see what you want to do next.

              If you do not have a reference genome, and you insist on getting 99-100% accuracy with the PacBio reads, you have a few choices (from least to most work):

              1. Filter the CCS reads, just use those with very high quality. You can tune that by using the RS_IsoSeq protocol or just the CCS protocol (RS_ReadsOfInsert). You will lose of lot of slightly lower quality data but then you have zero extra work.

              2. Run Iso-Seq protocol in both parts: classify and cluster. This means you must have access to SMRTPortal/SMRTAnalysis and it will be computationally intensive. However this way you error correct PacBio with itself and you don't have to run programs yourself. If you have issue with running SMRTPortal, PacBio tech support is there to help.

              3. Hybrid error correction using either LSC+IDP. I don't recommend PacBioCA because the authors themselves don't use it for transcriptome (Iso-Seq) correction except for that one paper. They use it for genomes. LSC+IDP's author is at least still actively developing his software and he will respond to your emails.


              If you want to the quantification, there are some ways to go about combining the two. But since your post is currently focused on error correction and I'm trying to understand what your end goal is, I'll stop here for now.

              Comment


              • #8
                Originally posted by GenoMax View Post
                Transcriptome is ok. What I meant was is this data from an organism where the genome has been sequenced (i.e. you have a reference available). If you do then first thing you should do is align your data (illumina and PacBio) to get an idea of what the sequence quality (in terms of alignment) looks like. Is the transcriptome known for this organism or is that something you are trying to put together?

                PacBio data is not necessarily of lower quality but there is a bigger chance that it may have more errors in it compared to Illumina data.

                Did you get any kind of statistics (a report) from whoever you got the PacBio sequence done from? It would be useful to know "mean read length", "number of reads" etc. I don't have experience with IsoSeq data but if there are specific metrics for that you may want to post them.
                Sorry! Yes, there is a genome that was sequenced for this organism that my PI is trying to put together. The transcriptome is something that I am trying to put together. In particular I am interested in what the transcriptome looks like during the first few days of development (which is when the RNA was extracted and sent off for sequencing).

                I can ask my professor for the stats on the pacbio sequencing and get back to you with something more useful for you to look at.

                Comment


                • #9
                  Originally posted by Magdoll View Post
                  Hi there,

                  First, a comprehensive resource of PacBio Iso-Seq is here:
                  https://github.com/PacificBiosciences/cDNA_primer/wiki

                  The section on error correction using hybrid data is here (BUT before you stop here, read below):
                  https://github.com/PacificBioscience...na-short-reads

                  Given that you are not a bioinformatician by training and you have limited time to process the PacBio Iso-Seq + Illumina RNA-seq data, may I ask what the goal of trying to combine them are?
                  Sure. Something interesting happens in the first few days of development in the organism I am studying, and in order to know what genes are active in this process, I'd like to see which transcripts are present at different time points, with the end goal of 1) quantifying the expression values of the genes 2) identifying different isoforms present 3) looking at how expression values change over time. Hopefully this will give us more insight into what we should be looking for when trying to dissect this particular phenomenon.

                  I just happen to have data sets from the same few time-points from both illumina and pacbio iso-seq, so what I'd like to do is use the short reads to error correct the long reads so that in the end, I have long reads which represent full-length transcripts that are also very high-quality higher in quality than the pacbio reads by themselves).

                  Originally posted by Magdoll View Post
                  If the goal is to do gene annotation and there is a reference genome available (even if it's just a related species), you can align the PacBio CCS reads (better yet, if you first go through the first part of PacBio's RS_IsoSeq protocol to get the full-length reads only) alone to the genome. It will work. It won't be super pretty but aligners like GMAP and STAR will be able to align most of the CCS reads (tutorial is in the wiki). This is the path of least resistance. You can also align the short reads separately using TopHat2 and the sort. Then take a look at that in your genome browser (IGV?) and see what you want to do next.
                  I don't think we have pacbio CCS reads, just single-molecule long reads, although I can double check to make sure that I am correct about that.

                  If you do not have a reference genome, and you insist on getting 99-100% accuracy with the PacBio reads, you have a few choices (from least to most work):

                  1. Filter the CCS reads, just use those with very high quality. You can tune that by using the RS_IsoSeq protocol or just the CCS protocol (RS_ReadsOfInsert). You will lose of lot of slightly lower quality data but then you have zero extra work.

                  Originally posted by Magdoll View Post
                  2. Run Iso-Seq protocol in both parts: classify and cluster. This means you must have access to SMRTPortal/SMRTAnalysis and it will be computationally intensive. However this way you error correct PacBio with itself and you don't have to run programs yourself. If you have issue with running SMRTPortal, PacBio tech support is there to help.
                  There is a postdoc in the lab who works on the SMRTPortal program, and it's only available on the computer that she uses. I don't currently have access to it and I don't think either she or my PI would want me messing around with it, unfortunately.

                  Originally posted by Magdoll View Post
                  3. Hybrid error correction using either LSC+IDP. I don't recommend PacBioCA because the authors themselves don't use it for transcriptome (Iso-Seq) correction except for that one paper. They use it for genomes. LSC+IDP's author is at least still actively developing his software and he will respond to your emails.
                  This is the route that I am trying to take, because I don't need access to the SMRTPortal and I can do this on my lab-issued laptop. I have the paper on LSC so maybe I will consider doing that instead if I can get some assistance from the author.

                  For now, I'm only concentrating on error correction as this "project" is not just a part of my actual research, but a class project (class is taught by my PI). Anything beyond error correction will be a natural extension of my research, but performing the error correction is the hurdle I need to pass for now.

                  Comment


                  • #10
                    1. It is better if you can access SMRTPortal at least to get the CCS reads. It seems unreasonable to expect you to do all this work yet give you no computational support. Since you are not a computer science person, I would not encourage you to try to install SMRTAnalysis yourself (either on your own laptop or in the cloud).

                    2. Given your goal of eventually doing quantification, it seems like LSC + IDP is the fastest way for you to go. Fastest --- as in --- if you can get the author to help you with getting the software up and running, then you are very close to your end goal of analyzing the output. The lab that does LSC + IDP is here:
                    http://www.healthcare.uiowa.edu/labs/au/IDP/

                    3. I *still* think the fastest way to your problem is not (2) but using (1) because I consider that the path of least resistance. From what I see, the fastest thing is to get full-length CCS reads using the RS_IsoSeq classify protocol:
                    https://github.com/PacificBioscience...l-length-reads

                    After this step, every sequence is a full-length, high-quality CCS read. You can tune how high the expected CCS quality you want it to be. The tutorial uses a minimum predicted accuracy = 75 but for your purpose I suggest something like 90.

                    Once you have that, you *have* error corrected PacBio because CCS is intra-molecule consensus.

                    Then the next step is to cut down the redundantly mapped CCS reads (many of them may represent the same isoform) to the genome. I think cuffcompare or some tools from cufflinks can do that for you (FYI: there is a PacBio way to do this but it requires SMRTAnalysis and I'm helping you avoid it as much as possible).

                    After cuffxxxx , you now are left with a non-redundant unique set of PacBio full-length isoforms and move on to the next step of quantitating using short reads.

                    You can use RSEM, eXpress, etc by aligning short reads against the unique output from cuffxxxx.


                    Using this approach, you do not need to write a single line of code and all the software you use have tech support (PacBio) or wide community use (cufflinks suite and RSEM etc).

                    Good luck.

                    Comment


                    • #11
                      Just writing to let all of you who helped me know that I have figured out (for the most part!) what I am doing and how to do it. I decided to use LSC for the error correction, but am running my pacbio data through the RS_IsoSeq (ICE+Quiver) pipeline first, then removing redundancy using CD-HIT, and then 'tofu' to collapse the transcripts. After all lf that I can do the error correction, and use the transcript data for further analyses.

                      Thank you all.

                      Comment


                      • #12
                        Let us know if the error correction is helping any as compared to not, when you do alignments. Your original data may be of good enough quality already.

                        Comment

                        Working...
                        X