Header Leaderboard Ad

Collapse

matching up paired-end reads after fastx-toolkit filtering

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Originally posted by Brian Bushnell View Post
    Can you post the top 16 or so lines from the input file?

    The expected input is something like this:

    @blahA /1
    ACGT
    +
    ????
    @blahA /2
    ACGT
    +
    ????
    @blahB /2
    ACGT
    +
    ????
    @blahC /1
    ACGT
    +
    ????
    @blahC /2
    ACGT
    +
    ????


    In this case, "blahA /1" and "blahA /2" would be output as a pair, as would "blahC /1" and "blahC /2", while "blahB /2" would be output to singletons.

    If the reads don't contain a " /1" and " /2" or a " 1:" and a " 2:", it won't work.
    I see, so the trimmed files have to be interleaved prior to running this pairing program? I was trying to test it out, but to be honest this makes it kind of difficult to work with if you have to run a script or format the data beforehand. I'll try to test this and report back but I don't have time to keep working on this right now. Thanks.
    Last edited by SES; 02-24-2014, 08:28 PM.

    Comment


    • #32
      Originally posted by SES View Post
      I also struggled to get that script to work and I was a little frustrated once I did get it working. Mainly because it uses a lot of memory, as someone commented previously, but also it strips the pair information off the output and creates hardcoded file names.

      I ended up writing my own tool for pairing reads called Pairfq. The problem I kept running into is that most approaches assume 4 line Fastq as input and the sequence name has to be in a certain format. That means you have to come up with different ways to solve this simple task if you are using Fasta or your sequence names are a little different. It was my aim to try and solve these problems.

      Here is an example of the usage:

      Code:
      $ pairfq makepairs -f s_1_1_trimmed.fq \
      -r s_1_2_trimmed.fq \
      -fp s_1_1_trimmed_p.fq \
      -rp s_1_2_trimmed_p.fq \
      -fs s_1_1_trimmed_s.fq \
      -rs s_1_2_trimmed_s.fq
      My observations are that the above command uses about 43% as much memory as the Python script listed above in the thread. This command is a bit slower because it is not making any assumptions about the format (see below). It is also possible to specify that an index should be used. For example,

      Code:
      $ pairfq makepairs -f s_1_1_trimmed.fq \
      -r s_1_2_trimmed.fq \
      -fp s_1_1_trimmed_p.fq \
      -rp s_1_2_trimmed_p.fq \
      -fs s_1_1_trimmed_s.fq \
      -rs s_1_2_trimmed_s.fq \
      --index
      This will result in almost no memory being used (15 MB RAM actually). The execution will be much slower with this option, but this is the only method to my knowledge that can handle pairing really large sequence sets without a big memory machine.

      The input can be Fasta or Fastq, compressed (with gzip or bzip2) or uncompressed, and the sequence identifiers can be in Casava 1.4 or 1.8+ format as explained on the project wiki (note that pairing the reads is just one of the functions of Pairfq). The outputs are separate files of paired and unpaired forward and reverse reads (which can be optionally compressed).

      Hopefully, this will save you some time and help to avoid crafting custom shell commands for this task.
      SES,

      I am trying to install dependencies. I could not find the version of Berekely DB you listed with tar -xzvf db-5.1.19.tar.gz so I installed the next closest one of db-5.1.29.tar.gz.

      However, when I run the perl MakeFile.PL I get the following:
      perl Makefile.PL
      WARNING: MIN_PERL_VERSION is not a known parameter.
      WARNING: CONFIGURE_REQUIRES is not a known parameter.
      WARNING: BUILD_REQUIRES is not a known parameter.
      WARNING: LICENSE is not a known parameter.
      Checking if your kit is complete...
      Looks good
      Warning: prerequisite BerkeleyDB 0.54 not found.
      Warning: prerequisite IPC::System::Simple 1.21 not found.
      Warning: prerequisite List::MoreUtils 0.33 not found.
      'BUILD_REQUIRES' is not a known MakeMaker parameter name.
      'CONFIGURE_REQUIRES' is not a known MakeMaker parameter name.
      'LICENSE' is not a known MakeMaker parameter name.
      'MIN_PERL_VERSION' is not a known MakeMaker parameter name.
      Writing Makefile for bin/pairfq

      I still have to install the IPC::System::Simple 1.21 and the List::MoreUtils 0.33 as I did not know these were dependencies until I ran the file, but is it not finding the BerkeleyDB 0.54 because I have an updated version?

      Comment


      • #33
        Hi Smiller85, The immediate problem is that your version of ExtUtils::MakeMaker is too old to recognize those parameters. From what I can tell, those features were added to EU::MM version 6.48, which is about 6 years old. You can check your version with this command:

        Code:
        perl -MExtUtils::MakeMaker -e 'print ExtUtils::MakeMaker->VERSION'
        Thanks for noting this, I have never actually seen these warnings and I filed an issue about this on the project site. That should be a quick fix. For now, please run the same command above, but replace "ExtUtils::MakeMaker" with "BerkeleyDB" so I can see what is happening on your system. I don't think you have BerekelyDB installed. To be clear, you need the database backend and the Perl bindings, and the message is saying you don't have the Perl package (called "BerkeleyDB") installed. I wouldn't try to do this manually, do it through the CPAN shell, or better yet, use cpanminus and it will install all the deps for you. Also, please run "perl -v" so I can see what version of Perl you have.

        Let me know if you have any other questions. Feel free to send me an email, or post an issue on the project site.
        Last edited by SES; 03-18-2014, 09:03 AM.

        Comment


        • #34
          SES. Right after I sent you the error I noticed the perl version requirement. my version is 5.8.8. Also, looks like you are right about the ExtUtils::MakeMaker being too old. My version is 6.30.

          I ran the perl -MExtUtils::MakeMaker -e 'print BerkeleyDB->VERSION' I did not get any info

          With the BerkelyDB I had to install it manually because the server does not recognize the cpanminus. I downloaded the db-5.1.29.tar.gz and did the tar command. I then did the following commands to install it:
          ..dist/configure prefix=/home/smiller/blast/bin/pipeline-work/db-5.1.29/build_unix
          make
          make install

          I also figured that maybe since pairfq is in its own folder home/smiller/blast/bin/pipeline-work/pairfq that maybe that is where I went wrong, but then I also noticed my outdated version of perl, and now from the other code the MakeMaker is outdated.

          My school is currently on Spring Break, so I don't know how quick of a response I will get from the administrator on updating things like perl and the ExtUtils::MakeMaker.

          Comment


          • #35
            Originally posted by smiller85 View Post
            SES. Right after I sent you the error I noticed the perl version requirement. my version is 5.8.8. Also, looks like you are right about the ExtUtils::MakeMaker being too old. My version is 6.30.

            I ran the perl -MExtUtils::MakeMaker -e 'print BerkeleyDB->VERSION' I did not get any info

            With the BerkelyDB I had to install it manually because the server does not recognize the cpanminus. I downloaded the db-5.1.29.tar.gz and did the tar command. I then did the following commands to install it:
            ..dist/configure prefix=/home/smiller/blast/bin/pipeline-work/db-5.1.29/build_unix
            make
            make install

            I also figured that maybe since pairfq is in its own folder home/smiller/blast/bin/pipeline-work/pairfq that maybe that is where I went wrong, but then I also noticed my outdated version of perl, and now from the other code the MakeMaker is outdated.

            My school is currently on Spring Break, so I don't know how quick of a response I will get from the administrator on updating things like perl and the ExtUtils::MakeMaker.
            Thanks for the response. The version of EUMM you have is not even on CPAN anymore, meaning it is quite old and not supported. Though, I did add a check for this to solve that issue. Also, Perl version 5.10 or greater is required at this time, sorry about that (it is documented at least, under the installation instructions). This version was first released in 2007 but I know a lot of people are stuck with really old systems in academia (I know because I am). I will think about incorporating changes to allow older versions but that creates other problems. By the way, your command above is not quite correct (you were specifying two different modules). If you want to see if a module is installed, just try:
            Code:
            perl -MBerkeleyDB -e 1
            and if it prints nothing, it is installed. If it prints "Can't locate ... in @INC ..." then the module is not installed.

            Let me know if you are able to get help from your Sys Admin. I could make a version with no requirements if this is an issue, and that may serve most use cases. Though, my original goal was to solve the problem of having to pair hundreds of millions of reads and removing the deps would not solve that issue with the current design.

            Comment


            • #36
              Originally posted by smiller85 View Post
              SES. Right after I sent you the error I noticed the perl version requirement. my version is 5.8.8. Also, looks like you are right about the ExtUtils::MakeMaker being too old. My version is 6.30.
              This should not be a problem anymore because I have created a standalone script (called "pairfq_lite.pl") that has no dependencies and I have tested it with Perl 5.6.2. If this is still of interest, you may want to try this script that is now part of Pairfq. I should note that this has fewer features, mainly no indexing function, but it will still handle FASTA/FASTQ and compressed or uncompressed data. The only real limitation will be memory if you have very large read sets and little RAM available on your computer. In that case, it would be worthwhile to install the one dependency of the main application and then try to install as before. Let me know if anything is unclear or if any issues arise.
              Last edited by SES; 03-20-2014, 12:40 PM.

              Comment


              • #37
                Hi everybody, to help your discussion I can just give as an advice to NOT USE fastx_toolkit for pair end library.
                According to the authors, this tool was done for SHORT MOLECULE only. (e.g. shorter than 50 bp or 100 bp depending on your sequencer read length)
                FASTQ/A short-reads pre-processing tools

                Comment


                • #38
                  Wrong message...see below
                  Last edited by ericaramos; 04-24-2014, 11:25 AM.

                  Comment


                  • #39
                    Hi Carmen,
                    I'm facing the same problem when running the script. Did you received any answer about your problem?
                    If yes, could you share with us?

                    Thanks!

                    Comment


                    • #40
                      Originally posted by ericaramos View Post
                      Hi Carmen,
                      I'm facing the same problem when running the script. Did you received any answer about your problem?
                      If yes, could you share with us?

                      Thanks!
                      If you look through the discussion above you can see that a number of people had similar issues, and this script doesn't appear to be maintained. I think the best solution may be to find another approach unless you want to work on that shell/python code.

                      Did you try the tool Pairfq that was mentioned in the thread above? I'd be happy to help with this if you run into any issues. We can help with the other approach as well, but it is hard to see what the issue is and it's also a challenge to keep code updated on a forum such as this.

                      Comment


                      • #41
                        Originally posted by carmeyeii View Post
                        Dear btmb,
                        I'm afraid I still cannot run it. Sorry to keep bothering?

                        I have corrected tabs and spaces to avoid getting the Unexpected indent Error,

                        but now I get:



                        Thanks again for any help,

                        Carmen
                        Originally posted by SES View Post
                        If you look through the discussion above you can see that a number of people had similar issues, and this script doesn't appear to be maintained. I think the best solution may be to find another approach unless you want to work on that shell/python code.

                        Did you try the tool Pairfq that was mentioned in the thread above? I'd be happy to help with this if you run into any issues. We can help with the other approach as well, but it is hard to see what the issue is and it's also a challenge to keep code updated on a forum such as this.



                        ...................................................................................................................................Ok, I didn't try using Pairfq, but I will.

                        Thank you for the answer!

                        Comment


                        • #42
                          Originally posted by SES View Post
                          If you look through the discussion above you can see that a number of people had similar issues, and this script doesn't appear to be maintained. I think the best solution may be to find another approach unless you want to work on that shell/python code.

                          Did you try the tool Pairfq that was mentioned in the thread above? I'd be happy to help with this if you run into any issues. We can help with the other approach as well, but it is hard to see what the issue is and it's also a challenge to keep code updated on a forum such as this.
                          --------------------------------------------------------------------------------------------------------------------------------------
                          Pairfq worked pretty well!! Thank you!

                          Comment


                          • #43
                            After removing the adapters from cutadapt i got unsymmetrical pair end file so I want to know the script that could remove the orphan reads and make the data symmetric although I made it using hash but its very slow.The above mention script is showing error..

                            Comment


                            • #44
                              Originally posted by ranu1 View Post
                              After removing the adapters from cutadapt i got unsymmetrical pair end file so I want to know the script that could remove the orphan reads and make the data symmetric although I made it using hash but its very slow.The above mention script is showing error..
                              We will need some more details in order to help. For example, which script are you referring to, the Python script mentioned on the first page of this thread? If that is the script you are attempting to use, I don't think you'll be able to get it working without some code changes, as mentioned above.

                              Also, what do you mean when you say the script is showing error? It is not possible to know what the issue is based on that information alone.

                              Comment


                              • #45
                                BBTools has a tool to quickly re-pair arbitrarily disordered reads based on their names.

                                For interleaved reads:

                                repair.sh in=reads.fq out=fixed.fq outsingle=single.fq

                                For paired reads in two files:

                                repair.sh in1=read1.fq in2=read2.fq out1=fixed1.fq out2=fixed2.fq outsingle=single.fq

                                You can also repair simple broken interleaving much faster and with less memory, but this will not fix arbitrarily disordered reads, just reads that were interleaved and had some of the reads thrown away:

                                bbsplitpairs.sh in=reads.fq out=fixed.fq outsingle=single.fq fixinterleaving
                                Last edited by Brian Bushnell; 02-13-2015, 10:31 AM.

                                Comment

                                Latest Articles

                                Collapse

                                • seqadmin
                                  A Brief Overview and Common Challenges in Single-cell Sequencing Analysis
                                  by seqadmin


                                  ​​​​​​The introduction of single-cell sequencing has advanced the ability to study cell-to-cell heterogeneity. Its use has improved our understanding of somatic mutations1, cell lineages2, cellular diversity and regulation3, and development in multicellular organisms4. Single-cell sequencing encompasses hundreds of techniques with different approaches to studying the genomes, transcriptomes, epigenomes, and other omics of individual cells. The analysis of single-cell sequencing data i...

                                  01-24-2023, 01:19 PM
                                • seqadmin
                                  Introduction to Single-Cell Sequencing
                                  by seqadmin
                                  Single-cell sequencing is a technique used to investigate the genome, transcriptome, epigenome, and other omics of individual cells using high-throughput sequencing. This technology has provided many scientific breakthroughs and continues to be applied across many fields, including microbiology, oncology, immunology, neurobiology, precision medicine, and stem cell research.

                                  The advancement of single-cell sequencing began in 2009 when Tang et al. investigated the single-cell transcriptomes
                                  ...
                                  01-09-2023, 03:10 PM

                                ad_right_rmr

                                Collapse
                                Working...
                                X