Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Seeking advice on PathSeq

    Hi all,

    I'm interested in using the PathSeq software and I was wondering if anyone had some advice on what sort of Amazon EC2 instances result in reasonable run times for full Illumina GAIIx or HiSeq data sets.

    Thanks in advance!

  • #2
    Hi,

    The Amazon EC2 instances we used is Large instances. In our hands, for total RNA sequencing data (30 to 50 million reads) from GAIIx it takes about 1 to 1.2 days (on 20 nodes parallely) to finish the runs.

    We are working towards reducing these runs.

    Please let me know if you need more information on Pathseq.

    Thanks
    Chandra

    Comment


    • #3
      PathSeq is very slow on Ec2

      I also set up PathSeq on Ec2. I was able to run it but it was very slow. I tested a data set with 100,000 reads and it took 10 hours running on 10 instances. I would appreciate any advice you can give me.

      Comment


      • #4
        Hi,

        Could you send me the version of Pathseq you are running?

        Also, is your dataset from RNA based or DNA based?


        Thanks
        Chandra

        Comment


        • #5
          slow PathSeq

          Hi, Chandra,
          The version of PathSeq is 5.1. The data set is a sampling of 100,000 reads from the sample input files provided by the PathSeq web site. My problem is that these does not seem to be a difference whether I run it on 10 nodes or on 20 nodes. Both took a long time to run. I am concerned the Hadoop cluster is not set up correctly.
          Thansk for the prompt reply and I am looking forward to hearing from you.

          Yi

          Comment


          • #6
            Hi Yi,

            I will re-run it on the cloud and see how much time it will take.

            In our hands we run other samples with 40million reads in 1 to 1.2 days.

            Meanwhile, please download the latest version from our website.

            I will get back to you as soon as possible.

            I greatly appreciate your comments.

            Thanks
            Chandra

            Comment


            • #7
              pathseq logs

              Hi, Chandra,
              Thanks for the advice. I don't know whether it is helpful to you or not, but here is part of the Hadoop log I captured from the master node in Ec2:


              rmr: cannot remove config: No such file or directory.
              rmr: cannot remove s3config: No such file or directory.
              rmr: cannot remove load: No such file or directory.
              Master data_loader
              11/07/08 20:27:30 WARN streaming.StreamJob: -jobconf option is deprecated, plea
              se use -D instead.
              packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar324271362
              3624356081/] [] /tmp/streamjob4418070527408526594.jar tmpDir=null
              11/07/08 20:27:30 INFO mapred.FileInputFormat: Total input paths to process : 3
              11/07/08 20:27:31 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
              /local]
              11/07/08 20:27:31 INFO streaming.StreamJob: Running job: job_201107082024_0001
              11/07/08 20:27:31 INFO streaming.StreamJob: To kill this job, run:
              11/07/08 20:27:31 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
              /hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
              ill job_201107082024_0001
              11/07/08 20:27:31 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
              42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0001
              11/07/08 20:27:32 INFO streaming.StreamJob: map 0% reduce 0%
              11/07/08 20:27:44 INFO streaming.StreamJob: map 33% reduce 0%
              11/07/08 20:27:47 INFO streaming.StreamJob: map 67% reduce 0%
              11/07/08 20:27:48 INFO streaming.StreamJob: map 100% reduce 0%
              11/07/08 20:47:03 INFO streaming.StreamJob: Job complete: job_201107082024_0001
              11/07/08 20:47:03 INFO streaming.StreamJob: Output: load

              real 19m33.828s
              user 0m2.365s
              sys 0m0.665s
              Master loader completed
              ERROR: Bucket 'ami-yiweijob6-stat' does not exist
              Bucket 's3://ami-yiweijob6-stat/' removed
              Bucket 's3://ami-yiweijob6-stat/' created
              ERROR: Bucket 'ami-yiweijob6-output' does not exist
              Bucket 's3://ami-yiweijob6-output/' removed
              Bucket 's3://ami-yiweijob6-output/' created
              File s3://reads-yiwei-regeneron/input1.local saved as '/usr/local/hadoop-0.19.0
              /input1.local' (75 bytes in 0.0 seconds, 4.17 kB/s)
              File s3://reads-yiwei-regeneron/input10.local saved as '/usr/local/hadoop-0.19.
              0/input10.local' (76 bytes in 0.0 seconds, 3.08 kB/s)
              File s3://reads-yiwei-regeneron/input11.local saved as '/usr/local/hadoop-0.19.
              0/input11.local' (76 bytes in 0.0 seconds, 2.79 kB/s)
              File s3://reads-yiwei-regeneron/input12.local saved as '/usr/local/hadoop-0.19.
              0/input12.local' (76 bytes in 0.0 seconds, 3.16 kB/s)
              File s3://reads-yiwei-regeneron/input2.local saved as '/usr/local/hadoop-0.19.0
              /input2.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
              File s3://reads-yiwei-regeneron/input3.local saved as '/usr/local/hadoop-0.19.0
              /input3.local' (75 bytes in 0.0 seconds, 2.59 kB/s)
              File s3://reads-yiwei-regeneron/input4.local saved as '/usr/local/hadoop-0.19.0
              /input4.local' (75 bytes in 0.0 seconds, 3.34 kB/s)
              File s3://reads-yiwei-regeneron/input5.local saved as '/usr/local/hadoop-0.19.0
              /input5.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
              File s3://reads-yiwei-regeneron/input6.local saved as '/usr/local/hadoop-0.19.0
              /input6.local' (75 bytes in 0.0 seconds, 3.43 kB/s)
              File s3://reads-yiwei-regeneron/input7.local saved as '/usr/local/hadoop-0.19.0
              /input7.local' (75 bytes in 0.0 seconds, 3.18 kB/s)
              File s3://reads-yiwei-regeneron/input8.local saved as '/usr/local/hadoop-0.19.0
              /input8.local' (75 bytes in 0.0 seconds, 3.25 kB/s)
              File s3://reads-yiwei-regeneron/input9.local saved as '/usr/local/hadoop-0.19.0
              /input9.local' (75 bytes in 0.0 seconds, 3.46 kB/s)
              rmr: cannot remove test: No such file or directory.
              rmr: cannot remove maq: No such file or directory.
              Maq alignments + Duplicate remover
              11/07/08 20:47:10 WARN streaming.StreamJob: -jobconf option is deprecated, plea
              se use -D instead.
              packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone
              2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQ
              unmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar7827
              192869392733442/] [] /tmp/streamjob8065656650743401458.jar tmpDir=null
              11/07/08 20:47:10 INFO mapred.FileInputFormat: Total input paths to process : 1
              2
              11/07/08 20:47:11 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
              /local]
              11/07/08 20:47:11 INFO streaming.StreamJob: Running job: job_201107082024_0002
              11/07/08 20:47:11 INFO streaming.StreamJob: To kill this job, run:
              11/07/08 20:47:11 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
              /hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
              ill job_201107082024_0002
              11/07/08 20:47:11 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
              42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0002
              11/07/08 20:47:12 INFO streaming.StreamJob: map 0% reduce 0%
              11/07/08 20:47:24 INFO streaming.StreamJob: map 8% reduce 0%
              11/07/08 20:47:25 INFO streaming.StreamJob: map 17% reduce 0%
              11/07/08 20:47:29 INFO streaming.StreamJob: map 33% reduce 0%
              11/07/08 20:47:30 INFO streaming.StreamJob: map 42% reduce 0%
              11/07/08 20:47:34 INFO streaming.StreamJob: map 58% reduce 0%
              11/07/08 20:47:35 INFO streaming.StreamJob: map 67% reduce 0%
              11/07/08 20:47:39 INFO streaming.StreamJob: map 75% reduce 0%

              This is for running 100,000 reads in 3 instances in Ec2. I have to shut it down after 2 hours as the processing does not seem to be able to be finished in reasonable amount of time. I hope this log is useful for your trouble shooting. And thanks again for your help!

              Yi Wei

              Comment


              • #8
                Hi Yi Wei,

                Thanks for your log file.

                I am re-running Pathseq with the sample file provided with the package. This sample file contains 6 million unique reads. I will share my results with you, once it is done.

                I am looking at the log file which you posted. There are no errors produced. It seems the Pathseq is running fine. As you know we are running 4 maq alignments and 2 megablast alignments and 2 blastn alignments. This in turn takes time to finish them, which is independent of number of reads they go into up to a certain extent. What is mean is as follows:

                If you have 100,000 reads ---- running may take about 5 hours to finish
                If you have 1 million reads -----running may take about more or less the same time as that of 100,000 hours to finish
                If you have 40 million reads -----running may take about 16-18 hours to finish

                I will post you with my latest results i will get from 6 million reads.

                Meanwhile, Please let me know what is your requirements.

                1. How many reads you have in your real sequencing file?
                2. Is reads from Illumina?
                3. Are you using total RNAseq or WGS?

                Thanks
                Chandra

                Comment


                • #9
                  pathseq qustions

                  Hi, Chandra,
                  Thanks for the advice. It is very helpful. What we are trying to do is look for potential pathogen sequences from Illumina RNA-Seq data. We are probably going to get 40-80 million reads from each sample. Can you send me a copy of Hadoop log from your run of 6 million reads in the sample data file provided by the PathSeq package? I would like to run the same 6 million reads and compare the logs.
                  Best Regards,

                  Yi Wei

                  P.S.
                  1. Do you have plans to modify PathSeq so that it can be run on internal computer clusters instead of Amazon Ec2?
                  2. Are you considering using Bowtie or Bwa for initial filtering step, as they are much faster than Maq?

                  Comment


                  • #10
                    Hi Yi Wei,

                    Yes, we are working towards getting BWA implemented into the Pathseq. You are correct BWA is much faster then MAQ.

                    Also, working for hadoop based internal computing cluster.

                    What kind of internal computer cluster you have? Is it LSF?

                    Thanks
                    Chandra

                    Comment


                    • #11
                      Hi Yi Wei and Pathseq users,

                      Here is the log file from the Pathseq runs. I just removed some lines for clarity.

                      The log file created is from Pathseq runs on 6 million unique reads (Sample file with Pathseq package).

                      In summary:
                      Total time in hours (for all 20 nodes) 387.7
                      Wall to Wall time is ~19 hours

                      Most important thing to highlight here is:
                      6 million reads took 19hours, doesn't mean that 60 million takes 10 times more. In our hands, 40 millions sequencing run take about the same time of 19hours.

                      Currently, we are working towards faster Pathseq. From the preliminary runs, newer Pathseq takes half the time that of the current version. Once we are done with validation, we will go for public release.

                      Please let me know if you have more questions / help with Pathseq installation.

                      Thanks
                      Chandra


                      Log file:
                      ******
                      Master data_loader
                      **********************************
                      11/07/19 14:33:53 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar3823448146028608527/] [] /tmp/streamjob9062470838490462284.jar tmpDir=null
                      11/07/19 14:33:54 INFO mapred.FileInputFormat: Total input paths to process : 20
                      11/07/19 14:33:55 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/19 14:33:55 INFO streaming.StreamJob: Running job: job_201107191423_0001
                      11/07/19 14:33:55 INFO streaming.StreamJob: To kill this job, run:
                      11/07/19 14:33:55 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0001
                      11/07/19 14:33:55 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0001
                      11/07/19 14:33:56 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/19 14:34:09 INFO streaming.StreamJob: map 20% reduce 0%
                      11/07/19 14:34:10 INFO streaming.StreamJob: map 40% reduce 0%
                      11/07/19 14:34:11 INFO streaming.StreamJob: map 60% reduce 0%
                      11/07/19 14:34:12 INFO streaming.StreamJob: map 80% reduce 0%
                      11/07/19 14:34:13 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/19 14:34:14 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/19 15:32:58 INFO streaming.StreamJob: Job complete: job_201107191423_0001
                      11/07/19 15:32:58 INFO streaming.StreamJob: Output: load

                      real 59m5.290s
                      user 0m3.278s
                      sys 0m1.108s
                      Master loader completed



                      Maq alignments + Duplicate remover
                      **********************************
                      11/07/19 15:33:07 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar2138415996895576783/] [] /tmp/streamjob4610713994932979234.jar tmpDir=null
                      11/07/19 15:33:08 INFO mapred.FileInputFormat: Total input paths to process : 21
                      11/07/19 15:33:08 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/19 15:33:08 INFO streaming.StreamJob: Running job: job_201107191423_0002
                      11/07/19 15:33:08 INFO streaming.StreamJob: To kill this job, run:
                      11/07/19 15:33:08 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0002
                      11/07/19 15:33:08 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0002
                      11/07/19 15:33:09 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/19 15:33:21 INFO streaming.StreamJob: map 24% reduce 0%
                      11/07/19 15:33:22 INFO streaming.StreamJob: map 52% reduce 0%
                      11/07/19 15:33:23 INFO streaming.StreamJob: map 76% reduce 0%
                      11/07/19 15:33:24 INFO streaming.StreamJob: map 86% reduce 0%
                      11/07/19 15:33:25 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/19 15:33:26 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 01:14:01 INFO streaming.StreamJob: Job complete: job_201107191423_0002
                      11/07/20 01:14:02 INFO streaming.StreamJob: Output: maq

                      real 580m56.490s
                      user 0m6.135s
                      sys 0m12.924s
                      Maq alignments + Duplicate remover completed

                      Repeat masker loader
                      ********************

                      real 2m15.171s
                      user 1m11.617s
                      sys 0m12.480s
                      Repeat masker loader completed

                      Run repeat masker
                      ********************
                      11/07/20 01:16:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar6903467901556213816/] [] /tmp/streamjob3668474814406944845.jar tmpDir=null
                      11/07/20 01:16:24 INFO mapred.FileInputFormat: Total input paths to process : 60
                      11/07/20 01:16:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 01:16:24 INFO streaming.StreamJob: Running job: job_201107191423_0003
                      11/07/20 01:16:24 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 01:16:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0003
                      11/07/20 01:16:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0003
                      11/07/20 01:16:25 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 01:16:36 INFO streaming.StreamJob: map 5% reduce 0%
                      11/07/20 01:16:37 INFO streaming.StreamJob: map 10% reduce 0%
                      11/07/20 01:16:38 INFO streaming.StreamJob: map 20% reduce 0%
                      11/07/20 01:16:40 INFO streaming.StreamJob: map 32% reduce 0%
                      11/07/20 01:16:41 INFO streaming.StreamJob: map 37% reduce 0%
                      11/07/20 01:16:42 INFO streaming.StreamJob: map 43% reduce 0%
                      11/07/20 01:16:43 INFO streaming.StreamJob: map 53% reduce 0%
                      11/07/20 01:16:45 INFO streaming.StreamJob: map 65% reduce 0%
                      11/07/20 01:16:46 INFO streaming.StreamJob: map 72% reduce 0%
                      11/07/20 01:16:47 INFO streaming.StreamJob: map 77% reduce 0%
                      11/07/20 01:16:48 INFO streaming.StreamJob: map 85% reduce 0%
                      11/07/20 01:16:49 INFO streaming.StreamJob: map 88% reduce 0%
                      11/07/20 01:16:51 INFO streaming.StreamJob: map 97% reduce 0%
                      11/07/20 01:16:52 INFO streaming.StreamJob: map 98% reduce 0%
                      11/07/20 01:16:54 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 03:59:21 INFO streaming.StreamJob: Job complete: job_201107191423_0003
                      11/07/20 03:59:21 INFO streaming.StreamJob: Output: repeat

                      real 162m59.218s
                      user 0m4.786s
                      sys 0m1.192s
                      Repeat masker runs completed

                      Deleted hdfs://ip-10-118-59-251.ec2.internal:50001/user/root/load
                      Master data_loader for Post
                      ********************
                      11/07/20 03:59:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/mapper_data_postsub.py, /mnt/hadoop/hadoop-unjar4739713752539699730/] [] /tmp/streamjob4058317523841970356.jar tmpDir=null
                      11/07/20 03:59:24 INFO mapred.FileInputFormat: Total input paths to process : 20
                      11/07/20 03:59:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 03:59:24 INFO streaming.StreamJob: Running job: job_201107191423_0004
                      11/07/20 03:59:24 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 03:59:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0004
                      11/07/20 03:59:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0004
                      11/07/20 03:59:25 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 03:59:36 INFO streaming.StreamJob: map 20% reduce 0%
                      11/07/20 03:59:37 INFO streaming.StreamJob: map 45% reduce 0%
                      11/07/20 03:59:38 INFO streaming.StreamJob: map 60% reduce 0%
                      11/07/20 03:59:39 INFO streaming.StreamJob: map 80% reduce 0%
                      11/07/20 03:59:40 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/20 03:59:41 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 04:18:27 INFO streaming.StreamJob: Job complete: job_201107191423_0004
                      11/07/20 04:18:28 INFO streaming.StreamJob: Output: load

                      real 19m5.252s
                      user 0m2.360s
                      sys 0m1.129s
                      Master loader completed

                      Postsubtraction loader
                      ********************
                      real 0m27.082s
                      user 0m10.882s
                      sys 0m1.435s

                      Postsubstraction on the Unmapped reads
                      ********************
                      11/07/20 04:19:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped.py, /mnt/hadoop/hadoop-unjar3151623736871864286/] [] /tmp/streamjob6660721182687954669.jar tmpDir=null
                      11/07/20 04:19:03 INFO mapred.FileInputFormat: Total input paths to process : 40
                      11/07/20 04:19:03 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 04:19:03 INFO streaming.StreamJob: Running job: job_201107191423_0005
                      11/07/20 04:19:03 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 04:19:03 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0005
                      11/07/20 04:19:03 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0005
                      11/07/20 04:19:04 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 04:19:15 INFO streaming.StreamJob: map 5% reduce 0%
                      11/07/20 04:19:16 INFO streaming.StreamJob: map 15% reduce 0%
                      11/07/20 04:19:17 INFO streaming.StreamJob: map 18% reduce 0%
                      11/07/20 04:19:18 INFO streaming.StreamJob: map 28% reduce 0%
                      11/07/20 04:19:19 INFO streaming.StreamJob: map 48% reduce 0%
                      11/07/20 04:19:20 INFO streaming.StreamJob: map 55% reduce 0%
                      11/07/20 04:19:21 INFO streaming.StreamJob: map 65% reduce 0%
                      11/07/20 04:19:23 INFO streaming.StreamJob: map 72% reduce 0%
                      11/07/20 04:19:24 INFO streaming.StreamJob: map 92% reduce 0%
                      11/07/20 04:19:25 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/20 04:19:26 INFO streaming.StreamJob: map 97% reduce 0%
                      11/07/20 04:19:30 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 10:28:32 INFO streaming.StreamJob: Job complete: job_201107191423_0005
                      11/07/20 10:28:32 INFO streaming.StreamJob: Output: postsub

                      real 369m31.146s
                      user 0m5.041s
                      sys 0m1.540s

                      Postsubstraction on the contigs
                      ********************
                      11/07/20 10:28:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postvelvet.py, /mnt/hadoop/hadoop-unjar1426923625300485254/] [] /tmp/streamjob2059410816131312926.jar tmpDir=null
                      11/07/20 10:28:34 INFO mapred.FileInputFormat: Total input paths to process : 18
                      11/07/20 10:28:34 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 10:28:34 INFO streaming.StreamJob: Running job: job_201107191423_0006
                      11/07/20 10:28:34 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 10:28:34 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0006
                      11/07/20 10:28:34 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0006
                      11/07/20 10:28:35 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 10:28:47 INFO streaming.StreamJob: map 28% reduce 0%
                      11/07/20 10:28:48 INFO streaming.StreamJob: map 56% reduce 0%
                      11/07/20 10:28:49 INFO streaming.StreamJob: map 89% reduce 0%
                      11/07/20 10:28:50 INFO streaming.StreamJob: map 94% reduce 0%
                      11/07/20 10:28:51 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 10:31:56 INFO streaming.StreamJob: Job complete: job_201107191423_0006
                      11/07/20 10:31:56 INFO streaming.StreamJob: Output: postsubvel

                      real 3m24.070s
                      user 0m1.577s
                      sys 0m0.238s
                      Postsubtraction completed

                      File '/usr/local/hadoop-0.19.0/output/Output.tar' stored as 's3://ami-ami-QFnew-foutput/Output.tar' (106291200 bytes in 16.4 seconds, 6.19 MB/s) [1 of 1]

                      Results Summary:
                      *************
                      Results summary:

                      Substraction Pathseq_Cloud
                      Total number of reads 6369435
                      Total number of reads after duplicate remover 6369435
                      Total number of unmapped reads after Maq 1 alignment (Database: MAQ1) 1829265
                      Total number of unmapped reads after Maq 2 alignment (Database: MAQ2) 504427
                      Total number of unmapped reads after Maq 3 alignment (Database: MAQ3) 488954
                      Total number of unmapped reads after Maq 4 alignment (Database: MAQ4) 485479
                      Total number of unmapped reads after repeat masker 365393
                      Total number of unmapped reads after Megablast (Database: BLAST1) 70343
                      Total number of unmapped reads after Megablast (Database: BLAST2) 33808
                      Total number of unmapped reads after BlastN1 (Database: BLAST1) 33768
                      Total number of unmapped reads after BlastN2 (Database: BLAST2) 33746
                      Total number of unmapped reads 33746
                      Reads after computational subtraction (Unmapped reads) unmappedreads.fq1
                      Contigs from unmapped reads contigs.fq1

                      Comment


                      • #12
                        Hi, Chandra,
                        Thanks so much for the log! It is very helpful. I am looking forward to the new version of Pathseq. At the mean time I will start running PathSeq with our data and keep you updated on our progress.

                        Yi Wei

                        Comment


                        • #13
                          Pathseq AMI problem

                          I have problems to Build my own AMI how its explained in the last step
                          of
                          the PathSeq installation.

                          If I execute ./create-Ami.com I receive an error that the ami is not
                          available.
                          I assumpt that I just need this command to create an Instance with
                          Pathseq
                          installed on it?

                          I am working on developing a GUI for the use of PathSeq, therefore it
                          would be nice if you could give me a documentation (if you have one)
                          from
                          the tool?!

                          It would be great if you could help me!

                          Comment


                          • #14
                            Hi, Chandra,
                            The following is my experience running PathSeq with my own data:
                            I started with ~70 million human RNA-Seq 100 bp Illumina reads. I prefiltered these reads by running Bowtie against human 37.1 reference genome in my own desktop and ended up with ~11 million reads. After running Preprocessed_Reads.com, I got ~1.6 million reads. These reads were then uploaded onto S3 and PathSeq was launched on 20 nodes. PathSeq ran for more than 60 hour without finishing and I had to terminate the whole job. Here is the log I got from the master node.

                            Master data_loader
                            11/07/23 18:29:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar9078098862757602177/] [] /tmp/streamjob1077975618755703344.jar tmpDir=null
                            11/07/23 18:29:42 INFO mapred.FileInputFormat: Total input paths to process : 20
                            11/07/23 18:29:42 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/23 18:29:42 INFO streaming.StreamJob: Running job: job_201107231823_0001
                            11/07/23 18:29:42 INFO streaming.StreamJob: To kill this job, run:
                            11/07/23 18:29:42 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0001
                            11/07/23 18:29:42 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0001
                            11/07/23 18:29:43 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/23 18:29:55 INFO streaming.StreamJob: map 10% reduce 0%
                            11/07/23 18:29:56 INFO streaming.StreamJob: map 30% reduce 0%
                            11/07/23 18:29:57 INFO streaming.StreamJob: map 45% reduce 0%
                            11/07/23 18:29:58 INFO streaming.StreamJob: map 60% reduce 0%
                            11/07/23 18:29:59 INFO streaming.StreamJob: map 80% reduce 0%
                            11/07/23 18:30:00 INFO streaming.StreamJob: map 100% reduce 0%
                            11/07/23 18:54:42 INFO streaming.StreamJob: Job complete: job_201107231823_0001
                            11/07/23 18:54:42 INFO streaming.StreamJob: Output: load

                            real 25m1.703s
                            user 0m2.231s
                            sys 0m0.320s
                            Master loader completed

                            Maq alignments + Duplicate remover
                            11/07/23 18:54:51 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MA
                            Qunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar6628816665337722828/] [] /tmp/streamjob6711648814308398287.jar tmpDir=null
                            11/07/23 18:54:52 INFO mapred.FileInputFormat: Total input paths to process : 21
                            11/07/23 18:54:52 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/23 18:54:52 INFO streaming.StreamJob: Running job: job_201107231823_0002
                            11/07/23 18:54:52 INFO streaming.StreamJob: To kill this job, run:
                            11/07/23 18:54:52 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0002
                            11/07/23 18:54:52 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0002
                            11/07/23 18:54:53 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/23 18:55:04 INFO streaming.StreamJob: map 29% reduce 0%
                            11/07/23 18:55:05 INFO streaming.StreamJob: map 43% reduce 0%
                            11/07/23 18:55:06 INFO streaming.StreamJob: map 52% reduce 0%
                            11/07/23 18:55:07 INFO streaming.StreamJob: map 67% reduce 0%
                            11/07/23 18:55:08 INFO streaming.StreamJob: map 90% reduce 0%
                            11/07/23 18:55:09 INFO streaming.StreamJob: map 100% reduce 0%
                            11/07/24 03:34:56 INFO streaming.StreamJob: Job complete: job_201107231823_0002
                            11/07/24 03:34:56 INFO streaming.StreamJob: Output: maq

                            real 520m5.778s
                            user 0m7.229s
                            sys 0m2.200s
                            Maq alignments + Duplicate remover completed

                            Run repeat masker
                            11/07/24 03:38:54 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root
                            /mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar682608492557274042/] [] /tmp/streamjob8244164963266699673.jar tmpDir=null
                            11/07/24 03:38:55 INFO mapred.FileInputFormat: Total input paths to process : 108
                            11/07/24 03:38:56 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/24 03:38:56 INFO streaming.StreamJob: Running job: job_201107231823_0003
                            11/07/24 03:38:56 INFO streaming.StreamJob: To kill this job, run:
                            11/07/24 03:38:56 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0003
                            11/07/24 03:38:56 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0003
                            11/07/24 03:38:57 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/24 03:39:09 INFO streaming.StreamJob: map 3% reduce 0%
                            11/07/24 03:39:10 INFO streaming.StreamJob: map 7% reduce 0%
                            11/07/24 03:39:11 INFO streaming.StreamJob: map 11% reduce 0%
                            11/07/24 03:39:12 INFO streaming.StreamJob: map 17% reduce 0%
                            11/07/24 03:39:14 INFO streaming.StreamJob: map 21% reduce 0%
                            11/07/24 03:39:15 INFO streaming.StreamJob: map 26% reduce 0%
                            11/07/24 03:39:16 INFO streaming.StreamJob: map 29% reduce 0%
                            11/07/24 03:39:17 INFO streaming.StreamJob: map 35% reduce 0%
                            11/07/24 03:39:19 INFO streaming.StreamJob: map 40% reduce 0%
                            11/07/24 03:39:20 INFO streaming.StreamJob: map 44% reduce 0%
                            11/07/24 03:39:21 INFO streaming.StreamJob: map 47% reduce 0%
                            11/07/24 03:39:22 INFO streaming.StreamJob: map 53% reduce 0%
                            11/07/24 03:39:24 INFO streaming.StreamJob: map 55% reduce 0%
                            11/07/24 03:39:25 INFO streaming.StreamJob: map 56% reduce 0%
                            11/07/24 10:06:27 INFO streaming.StreamJob: map 57% reduce 0%
                            11/07/24 10:22:28 INFO streaming.StreamJob: map 58% reduce 0%
                            11/07/24 10:27:19 INFO streaming.StreamJob: map 59% reduce 0%
                            11/07/24 10:35:28 INFO streaming.StreamJob: map 60% reduce 0%
                            11/07/24 10:42:34 INFO streaming.StreamJob: map 61% reduce 0%
                            11/07/24 10:44:05 INFO streaming.StreamJob: map 62% reduce 0%
                            11/07/24 10:55:36 INFO streaming.StreamJob: map 63% reduce 0%
                            11/07/24 11:16:47 INFO streaming.StreamJob: map 64% reduce 0%
                            11/07/24 11:19:35 INFO streaming.StreamJob: map 65% reduce 0%
                            11/07/24 11:24:42 INFO streaming.StreamJob: map 66% reduce 0%
                            11/07/24 11:41:27 INFO streaming.StreamJob: map 67% reduce 0%
                            11/07/24 11:42:06 INFO streaming.StreamJob: map 68% reduce 0%
                            11/07/24 11:44:15 INFO streaming.StreamJob: map 69% reduce 0%
                            11/07/24 11:46:07 INFO streaming.StreamJob: map 70% reduce 0%
                            11/07/24 11:47:14 INFO streaming.StreamJob: map 71% reduce 0%
                            11/07/24 11:51:46 INFO streaming.StreamJob: map 72% reduce 0%
                            11/07/24 11:59:00 INFO streaming.StreamJob: map 73% reduce 0%
                            11/07/24 12:01:08 INFO streaming.StreamJob: map 74% reduce 0%
                            11/07/24 12:01:29 INFO streaming.StreamJob: map 75% reduce 0%
                            11/07/24 12:03:11 INFO streaming.StreamJob: map 76% reduce 0%
                            11/07/24 12:04:47 INFO streaming.StreamJob: map 77% reduce 0%
                            11/07/24 12:14:29 INFO streaming.StreamJob: map 78% reduce 0%
                            11/07/24 12:14:52 INFO streaming.StreamJob: map 79% reduce 0%
                            11/07/24 12:17:10 INFO streaming.StreamJob: map 80% reduce 0%
                            11/07/24 12:19:49 INFO streaming.StreamJob: map 81% reduce 0%
                            11/07/24 12:26:02 INFO streaming.StreamJob: map 82% reduce 0%
                            11/07/24 12:27:51 INFO streaming.StreamJob: map 83% reduce 0%
                            11/07/24 12:30:37 INFO streaming.StreamJob: map 84% reduce 0%
                            11/07/24 12:33:11 INFO streaming.StreamJob: map 85% reduce 0%
                            11/07/24 12:34:36 INFO streaming.StreamJob: map 86% reduce 0%
                            11/07/24 12:40:57 INFO streaming.StreamJob: map 87% reduce 0%
                            11/07/24 12:41:13 INFO streaming.StreamJob: map 88% reduce 0%
                            11/07/24 12:42:51 INFO streaming.StreamJob: map 89% reduce 0%
                            11/07/24 12:51:58 INFO streaming.StreamJob: map 90% reduce 0%
                            11/07/24 12:56:46 INFO streaming.StreamJob: map 91% reduce 0%
                            11/07/24 13:01:17 INFO streaming.StreamJob: map 92% reduce 0%
                            11/07/24 13:06:20 INFO streaming.StreamJob: map 93% reduce 0%
                            11/07/24 13:13:11 INFO streaming.StreamJob: map 94% reduce 0%
                            11/07/24 13:18:50 INFO streaming.StreamJob: map 95% reduce 0%
                            11/07/24 13:19:26 INFO streaming.StreamJob: map 96% reduce 0%
                            11/07/24 13:23:19 INFO streaming.StreamJob: map 97% reduce 0%
                            11/07/24 13:24:00 INFO streaming.StreamJob: map 98% reduce 0%
                            11/07/24 13:28:37 INFO streaming.StreamJob: map 99% reduce 0%
                            11/07/24 13:36:03 INFO streaming.StreamJob: map 100% reduce 0%
                            11/07/24 22:08:03 INFO streaming.StreamJob: Job complete: job_201107231823_0003
                            11/07/24 22:08:03 INFO streaming.StreamJob: Output: repeat

                            real 1109m9.111s
                            user 0m10.320s
                            sys 0m1.582s
                            Repeat masker runs completed


                            Postsubstraction on the Unmapped reads
                            11/07/24 22:28:29 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java,
                            /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped
                            .py, /mnt/hadoop/hadoop-unjar1994272368229376705/] [] /tmp/streamjob104650512695835986.jar tmpDir=null
                            11/07/24 22:28:29 INFO mapred.FileInputFormat: Total input paths to process : 40
                            11/07/24 22:28:30 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/24 22:28:30 INFO streaming.StreamJob: Running job: job_201107231823_0005
                            11/07/24 22:28:30 INFO streaming.StreamJob: To kill this job, run:
                            11/07/24 22:28:30 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0005
                            11/07/24 22:28:30 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0005
                            11/07/24 22:28:31 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/24 22:28:43 INFO streaming.StreamJob: map 5% reduce 0%
                            11/07/24 22:28:44 INFO streaming.StreamJob: map 8% reduce 0%
                            11/07/24 22:28:45 INFO streaming.StreamJob: map 15% reduce 0%
                            11/07/24 22:28:46 INFO streaming.StreamJob: map 20% reduce 0%
                            11/07/24 22:28:48 INFO streaming.StreamJob: map 38% reduce 0%
                            11/07/24 22:28:49 INFO streaming.StreamJob: map 55% reduce 0%
                            11/07/24 22:28:50 INFO streaming.StreamJob: map 58% reduce 0%
                            11/07/24 22:28:51 INFO streaming.StreamJob: map 65% reduce 0%
                            11/07/24 22:28:52 INFO streaming.StreamJob: map 72% reduce 0%
                            11/07/24 22:28:53 INFO streaming.StreamJob: map 90% reduce 0%
                            11/07/24 22:28:54 INFO streaming.StreamJob: map 100% reduce 0%

                            Job 5 ran for more than 34 hours before I terminated it.

                            From the output in S3 buckets I estimate that there were ~1 million reads after Maq subtraction and ~ 200,000 reads after repeat masking and Blast. PathSeq ran much slower than I expected and I don’t know what I did wrong. Can you take a look at the logs and let me know what you think?
                            Thanks so much for your help!

                            Yi Wei

                            Comment


                            • #15
                              Same Problem

                              Hi,

                              I just wanted to tell you that I have the same Problem.
                              Well i watched through the log files and I saw that their is an execption and I guess that this is the reason because there are not really good results.

                              Well actually i tested it with some sequenced data from illumina, but unfortunetly I receive no reliable results. The output file is always showing that no reads were identified as human or well known pathogens, but that's not possible.
                              and it also tooks very long,altough I had used a small amount of data.

                              Well here is the exception which i found:

                              ///////////////////////////////////////////////////////////////
                              Exception in thread "Timer thread for monitoring dfs" java.lang.NullPointerException
                              at org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(GangliaContext.java:195)
                              at org.apache.hadoop.metrics.ganglia.GangliaContext.emitMetric(GangliaContext.java:138)
                              at org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(GangliaContext.java:123)
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(AbstractMetricsContext.java:304)
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(AbstractMetricsContext.java:290)
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(AbstractMetricsContext.java:50)
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext$1.run(AbstractMetricsContext.java:249)
                              at java.util.TimerThread.mainLoop(Unknown Source)
                              at java.util.TimerThread.run(Unknown Source)

                              /////////////////////////////////////////////////////////////////////////

                              I guess there is a problem with the hadoop cluster, I am trying now to use the newer version of hadoop, maybe this will change something.

                              But I am quite sure, that this is not a config problem.

                              I will tell you if i found a solution!


                              with best regards,
                              Tomi

                              Originally posted by yiweiny View Post
                              Hi, Chandra,
                              The following is my experience running PathSeq with my own data:
                              I started with ~70 million human RNA-Seq 100 bp Illumina reads. I prefiltered these reads by running Bowtie against human 37.1 reference genome in my own desktop and ended up with ~11 million reads. After running Preprocessed_Reads.com, I got ~1.6 million reads. These reads were then uploaded onto S3 and PathSeq was launched on 20 nodes. PathSeq ran for more than 60 hour without finishing and I had to terminate the whole job. Here is the log I got from the master node.

                              Master data_loader
                              11/07/23 18:29:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar9078098862757602177/] [] /tmp/streamjob1077975618755703344.jar tmpDir=null
                              11/07/23 18:29:42 INFO mapred.FileInputFormat: Total input paths to process : 20
                              11/07/23 18:29:42 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/23 18:29:42 INFO streaming.StreamJob: Running job: job_201107231823_0001
                              11/07/23 18:29:42 INFO streaming.StreamJob: To kill this job, run:
                              11/07/23 18:29:42 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0001
                              11/07/23 18:29:42 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0001
                              11/07/23 18:29:43 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/23 18:29:55 INFO streaming.StreamJob: map 10% reduce 0%
                              11/07/23 18:29:56 INFO streaming.StreamJob: map 30% reduce 0%
                              11/07/23 18:29:57 INFO streaming.StreamJob: map 45% reduce 0%
                              11/07/23 18:29:58 INFO streaming.StreamJob: map 60% reduce 0%
                              11/07/23 18:29:59 INFO streaming.StreamJob: map 80% reduce 0%
                              11/07/23 18:30:00 INFO streaming.StreamJob: map 100% reduce 0%
                              11/07/23 18:54:42 INFO streaming.StreamJob: Job complete: job_201107231823_0001
                              11/07/23 18:54:42 INFO streaming.StreamJob: Output: load

                              real 25m1.703s
                              user 0m2.231s
                              sys 0m0.320s
                              Master loader completed

                              Maq alignments + Duplicate remover
                              11/07/23 18:54:51 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MA
                              Qunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar6628816665337722828/] [] /tmp/streamjob6711648814308398287.jar tmpDir=null
                              11/07/23 18:54:52 INFO mapred.FileInputFormat: Total input paths to process : 21
                              11/07/23 18:54:52 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/23 18:54:52 INFO streaming.StreamJob: Running job: job_201107231823_0002
                              11/07/23 18:54:52 INFO streaming.StreamJob: To kill this job, run:
                              11/07/23 18:54:52 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0002
                              11/07/23 18:54:52 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0002
                              11/07/23 18:54:53 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/23 18:55:04 INFO streaming.StreamJob: map 29% reduce 0%
                              11/07/23 18:55:05 INFO streaming.StreamJob: map 43% reduce 0%
                              11/07/23 18:55:06 INFO streaming.StreamJob: map 52% reduce 0%
                              11/07/23 18:55:07 INFO streaming.StreamJob: map 67% reduce 0%
                              11/07/23 18:55:08 INFO streaming.StreamJob: map 90% reduce 0%
                              11/07/23 18:55:09 INFO streaming.StreamJob: map 100% reduce 0%
                              11/07/24 03:34:56 INFO streaming.StreamJob: Job complete: job_201107231823_0002
                              11/07/24 03:34:56 INFO streaming.StreamJob: Output: maq

                              real 520m5.778s
                              user 0m7.229s
                              sys 0m2.200s
                              Maq alignments + Duplicate remover completed

                              Run repeat masker
                              11/07/24 03:38:54 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root
                              /mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar682608492557274042/] [] /tmp/streamjob8244164963266699673.jar tmpDir=null
                              11/07/24 03:38:55 INFO mapred.FileInputFormat: Total input paths to process : 108
                              11/07/24 03:38:56 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/24 03:38:56 INFO streaming.StreamJob: Running job: job_201107231823_0003
                              11/07/24 03:38:56 INFO streaming.StreamJob: To kill this job, run:
                              11/07/24 03:38:56 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0003
                              11/07/24 03:38:56 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0003
                              11/07/24 03:38:57 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/24 03:39:09 INFO streaming.StreamJob: map 3% reduce 0%
                              11/07/24 03:39:10 INFO streaming.StreamJob: map 7% reduce 0%
                              11/07/24 03:39:11 INFO streaming.StreamJob: map 11% reduce 0%
                              11/07/24 03:39:12 INFO streaming.StreamJob: map 17% reduce 0%
                              11/07/24 03:39:14 INFO streaming.StreamJob: map 21% reduce 0%
                              11/07/24 03:39:15 INFO streaming.StreamJob: map 26% reduce 0%
                              11/07/24 03:39:16 INFO streaming.StreamJob: map 29% reduce 0%
                              11/07/24 03:39:17 INFO streaming.StreamJob: map 35% reduce 0%
                              11/07/24 03:39:19 INFO streaming.StreamJob: map 40% reduce 0%
                              11/07/24 03:39:20 INFO streaming.StreamJob: map 44% reduce 0%
                              11/07/24 03:39:21 INFO streaming.StreamJob: map 47% reduce 0%
                              11/07/24 03:39:22 INFO streaming.StreamJob: map 53% reduce 0%
                              11/07/24 03:39:24 INFO streaming.StreamJob: map 55% reduce 0%
                              11/07/24 03:39:25 INFO streaming.StreamJob: map 56% reduce 0%
                              11/07/24 10:06:27 INFO streaming.StreamJob: map 57% reduce 0%
                              11/07/24 10:22:28 INFO streaming.StreamJob: map 58% reduce 0%
                              11/07/24 10:27:19 INFO streaming.StreamJob: map 59% reduce 0%
                              11/07/24 10:35:28 INFO streaming.StreamJob: map 60% reduce 0%
                              11/07/24 10:42:34 INFO streaming.StreamJob: map 61% reduce 0%
                              11/07/24 10:44:05 INFO streaming.StreamJob: map 62% reduce 0%
                              11/07/24 10:55:36 INFO streaming.StreamJob: map 63% reduce 0%
                              11/07/24 11:16:47 INFO streaming.StreamJob: map 64% reduce 0%
                              11/07/24 11:19:35 INFO streaming.StreamJob: map 65% reduce 0%
                              11/07/24 11:24:42 INFO streaming.StreamJob: map 66% reduce 0%
                              11/07/24 11:41:27 INFO streaming.StreamJob: map 67% reduce 0%
                              11/07/24 11:42:06 INFO streaming.StreamJob: map 68% reduce 0%
                              11/07/24 11:44:15 INFO streaming.StreamJob: map 69% reduce 0%
                              11/07/24 11:46:07 INFO streaming.StreamJob: map 70% reduce 0%
                              11/07/24 11:47:14 INFO streaming.StreamJob: map 71% reduce 0%
                              11/07/24 11:51:46 INFO streaming.StreamJob: map 72% reduce 0%
                              11/07/24 11:59:00 INFO streaming.StreamJob: map 73% reduce 0%
                              11/07/24 12:01:08 INFO streaming.StreamJob: map 74% reduce 0%
                              11/07/24 12:01:29 INFO streaming.StreamJob: map 75% reduce 0%
                              11/07/24 12:03:11 INFO streaming.StreamJob: map 76% reduce 0%
                              11/07/24 12:04:47 INFO streaming.StreamJob: map 77% reduce 0%
                              11/07/24 12:14:29 INFO streaming.StreamJob: map 78% reduce 0%
                              11/07/24 12:14:52 INFO streaming.StreamJob: map 79% reduce 0%
                              11/07/24 12:17:10 INFO streaming.StreamJob: map 80% reduce 0%
                              11/07/24 12:19:49 INFO streaming.StreamJob: map 81% reduce 0%
                              11/07/24 12:26:02 INFO streaming.StreamJob: map 82% reduce 0%
                              11/07/24 12:27:51 INFO streaming.StreamJob: map 83% reduce 0%
                              11/07/24 12:30:37 INFO streaming.StreamJob: map 84% reduce 0%
                              11/07/24 12:33:11 INFO streaming.StreamJob: map 85% reduce 0%
                              11/07/24 12:34:36 INFO streaming.StreamJob: map 86% reduce 0%
                              11/07/24 12:40:57 INFO streaming.StreamJob: map 87% reduce 0%
                              11/07/24 12:41:13 INFO streaming.StreamJob: map 88% reduce 0%
                              11/07/24 12:42:51 INFO streaming.StreamJob: map 89% reduce 0%
                              11/07/24 12:51:58 INFO streaming.StreamJob: map 90% reduce 0%
                              11/07/24 12:56:46 INFO streaming.StreamJob: map 91% reduce 0%
                              11/07/24 13:01:17 INFO streaming.StreamJob: map 92% reduce 0%
                              11/07/24 13:06:20 INFO streaming.StreamJob: map 93% reduce 0%
                              11/07/24 13:13:11 INFO streaming.StreamJob: map 94% reduce 0%
                              11/07/24 13:18:50 INFO streaming.StreamJob: map 95% reduce 0%
                              11/07/24 13:19:26 INFO streaming.StreamJob: map 96% reduce 0%
                              11/07/24 13:23:19 INFO streaming.StreamJob: map 97% reduce 0%
                              11/07/24 13:24:00 INFO streaming.StreamJob: map 98% reduce 0%
                              11/07/24 13:28:37 INFO streaming.StreamJob: map 99% reduce 0%
                              11/07/24 13:36:03 INFO streaming.StreamJob: map 100% reduce 0%
                              11/07/24 22:08:03 INFO streaming.StreamJob: Job complete: job_201107231823_0003
                              11/07/24 22:08:03 INFO streaming.StreamJob: Output: repeat

                              real 1109m9.111s
                              user 0m10.320s
                              sys 0m1.582s
                              Repeat masker runs completed


                              Postsubstraction on the Unmapped reads
                              11/07/24 22:28:29 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java,
                              /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped
                              .py, /mnt/hadoop/hadoop-unjar1994272368229376705/] [] /tmp/streamjob104650512695835986.jar tmpDir=null
                              11/07/24 22:28:29 INFO mapred.FileInputFormat: Total input paths to process : 40
                              11/07/24 22:28:30 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/24 22:28:30 INFO streaming.StreamJob: Running job: job_201107231823_0005
                              11/07/24 22:28:30 INFO streaming.StreamJob: To kill this job, run:
                              11/07/24 22:28:30 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0005
                              11/07/24 22:28:30 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0005
                              11/07/24 22:28:31 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/24 22:28:43 INFO streaming.StreamJob: map 5% reduce 0%
                              11/07/24 22:28:44 INFO streaming.StreamJob: map 8% reduce 0%
                              11/07/24 22:28:45 INFO streaming.StreamJob: map 15% reduce 0%
                              11/07/24 22:28:46 INFO streaming.StreamJob: map 20% reduce 0%
                              11/07/24 22:28:48 INFO streaming.StreamJob: map 38% reduce 0%
                              11/07/24 22:28:49 INFO streaming.StreamJob: map 55% reduce 0%
                              11/07/24 22:28:50 INFO streaming.StreamJob: map 58% reduce 0%
                              11/07/24 22:28:51 INFO streaming.StreamJob: map 65% reduce 0%
                              11/07/24 22:28:52 INFO streaming.StreamJob: map 72% reduce 0%
                              11/07/24 22:28:53 INFO streaming.StreamJob: map 90% reduce 0%
                              11/07/24 22:28:54 INFO streaming.StreamJob: map 100% reduce 0%

                              Job 5 ran for more than 34 hours before I terminated it.

                              From the output in S3 buckets I estimate that there were ~1 million reads after Maq subtraction and ~ 200,000 reads after repeat masking and Blast. PathSeq ran much slower than I expected and I don’t know what I did wrong. Can you take a look at the logs and let me know what you think?
                              Thanks so much for your help!

                              Yi Wei

                              Comment

                              Latest Articles

                              Collapse

                              • seqadmin
                                Strategies for Sequencing Challenging Samples
                                by seqadmin


                                Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
                                03-22-2024, 06:39 AM
                              • seqadmin
                                Techniques and Challenges in Conservation Genomics
                                by seqadmin



                                The field of conservation genomics centers on applying genomics technologies in support of conservation efforts and the preservation of biodiversity. This article features interviews with two researchers who showcase their innovative work and highlight the current state and future of conservation genomics.

                                Avian Conservation
                                Matthew DeSaix, a recent doctoral graduate from Kristen Ruegg’s lab at The University of Colorado, shared that most of his research...
                                03-08-2024, 10:41 AM

                              ad_right_rmr

                              Collapse

                              News

                              Collapse

                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:37 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, Yesterday, 06:07 PM
                              0 responses
                              7 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-22-2024, 10:03 AM
                              0 responses
                              49 views
                              0 likes
                              Last Post seqadmin  
                              Started by seqadmin, 03-21-2024, 07:32 AM
                              0 responses
                              66 views
                              0 likes
                              Last Post seqadmin  
                              Working...
                              X