Seqanswers Leaderboard Ad



No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Seeking advice on PathSeq

    Hi all,

    I'm interested in using the PathSeq software and I was wondering if anyone had some advice on what sort of Amazon EC2 instances result in reasonable run times for full Illumina GAIIx or HiSeq data sets.

    Thanks in advance!

  • #2

    The Amazon EC2 instances we used is Large instances. In our hands, for total RNA sequencing data (30 to 50 million reads) from GAIIx it takes about 1 to 1.2 days (on 20 nodes parallely) to finish the runs.

    We are working towards reducing these runs.

    Please let me know if you need more information on Pathseq.



    • #3
      PathSeq is very slow on Ec2

      I also set up PathSeq on Ec2. I was able to run it but it was very slow. I tested a data set with 100,000 reads and it took 10 hours running on 10 instances. I would appreciate any advice you can give me.


      • #4

        Could you send me the version of Pathseq you are running?

        Also, is your dataset from RNA based or DNA based?



        • #5
          slow PathSeq

          Hi, Chandra,
          The version of PathSeq is 5.1. The data set is a sampling of 100,000 reads from the sample input files provided by the PathSeq web site. My problem is that these does not seem to be a difference whether I run it on 10 nodes or on 20 nodes. Both took a long time to run. I am concerned the Hadoop cluster is not set up correctly.
          Thansk for the prompt reply and I am looking forward to hearing from you.



          • #6
            Hi Yi,

            I will re-run it on the cloud and see how much time it will take.

            In our hands we run other samples with 40million reads in 1 to 1.2 days.

            Meanwhile, please download the latest version from our website.

            I will get back to you as soon as possible.

            I greatly appreciate your comments.



            • #7
              pathseq logs

              Hi, Chandra,
              Thanks for the advice. I don't know whether it is helpful to you or not, but here is part of the Hadoop log I captured from the master node in Ec2:

              rmr: cannot remove config: No such file or directory.
              rmr: cannot remove s3config: No such file or directory.
              rmr: cannot remove load: No such file or directory.
              Master data_loader
              11/07/08 20:27:30 WARN streaming.StreamJob: -jobconf option is deprecated, plea
              se use -D instead.
              packageJobJar: [/root/, /mnt/hadoop/hadoop-unjar324271362
              3624356081/] [] /tmp/streamjob4418070527408526594.jar tmpDir=null
              11/07/08 20:27:30 INFO mapred.FileInputFormat: Total input paths to process : 3
              11/07/08 20:27:31 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
              11/07/08 20:27:31 INFO streaming.StreamJob: Running job: job_201107082024_0001
              11/07/08 20:27:31 INFO streaming.StreamJob: To kill this job, run:
              11/07/08 20:27:31 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
              /hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
              ill job_201107082024_0001
              11/07/08 20:27:31 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
              11/07/08 20:27:32 INFO streaming.StreamJob: map 0% reduce 0%
              11/07/08 20:27:44 INFO streaming.StreamJob: map 33% reduce 0%
              11/07/08 20:27:47 INFO streaming.StreamJob: map 67% reduce 0%
              11/07/08 20:27:48 INFO streaming.StreamJob: map 100% reduce 0%
              11/07/08 20:47:03 INFO streaming.StreamJob: Job complete: job_201107082024_0001
              11/07/08 20:47:03 INFO streaming.StreamJob: Output: load

              real 19m33.828s
              user 0m2.365s
              sys 0m0.665s
              Master loader completed
              ERROR: Bucket 'ami-yiweijob6-stat' does not exist
              Bucket 's3://ami-yiweijob6-stat/' removed
              Bucket 's3://ami-yiweijob6-stat/' created
              ERROR: Bucket 'ami-yiweijob6-output' does not exist
              Bucket 's3://ami-yiweijob6-output/' removed
              Bucket 's3://ami-yiweijob6-output/' created
              File s3://reads-yiwei-regeneron/input1.local saved as '/usr/local/hadoop-0.19.0
              /input1.local' (75 bytes in 0.0 seconds, 4.17 kB/s)
              File s3://reads-yiwei-regeneron/input10.local saved as '/usr/local/hadoop-0.19.
              0/input10.local' (76 bytes in 0.0 seconds, 3.08 kB/s)
              File s3://reads-yiwei-regeneron/input11.local saved as '/usr/local/hadoop-0.19.
              0/input11.local' (76 bytes in 0.0 seconds, 2.79 kB/s)
              File s3://reads-yiwei-regeneron/input12.local saved as '/usr/local/hadoop-0.19.
              0/input12.local' (76 bytes in 0.0 seconds, 3.16 kB/s)
              File s3://reads-yiwei-regeneron/input2.local saved as '/usr/local/hadoop-0.19.0
              /input2.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
              File s3://reads-yiwei-regeneron/input3.local saved as '/usr/local/hadoop-0.19.0
              /input3.local' (75 bytes in 0.0 seconds, 2.59 kB/s)
              File s3://reads-yiwei-regeneron/input4.local saved as '/usr/local/hadoop-0.19.0
              /input4.local' (75 bytes in 0.0 seconds, 3.34 kB/s)
              File s3://reads-yiwei-regeneron/input5.local saved as '/usr/local/hadoop-0.19.0
              /input5.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
              File s3://reads-yiwei-regeneron/input6.local saved as '/usr/local/hadoop-0.19.0
              /input6.local' (75 bytes in 0.0 seconds, 3.43 kB/s)
              File s3://reads-yiwei-regeneron/input7.local saved as '/usr/local/hadoop-0.19.0
              /input7.local' (75 bytes in 0.0 seconds, 3.18 kB/s)
              File s3://reads-yiwei-regeneron/input8.local saved as '/usr/local/hadoop-0.19.0
              /input8.local' (75 bytes in 0.0 seconds, 3.25 kB/s)
              File s3://reads-yiwei-regeneron/input9.local saved as '/usr/local/hadoop-0.19.0
              /input9.local' (75 bytes in 0.0 seconds, 3.46 kB/s)
              rmr: cannot remove test: No such file or directory.
              rmr: cannot remove maq: No such file or directory.
              Maq alignments + Duplicate remover
              11/07/08 20:47:10 WARN streaming.StreamJob: -jobconf option is deprecated, plea
              se use -D instead.
              packageJobJar: [/root/, /root/, /root/FQone
    , /root/, /root/, /root/MAQ
    , /root/, /mnt/hadoop/hadoop-unjar7827
              192869392733442/] [] /tmp/streamjob8065656650743401458.jar tmpDir=null
              11/07/08 20:47:10 INFO mapred.FileInputFormat: Total input paths to process : 1
              11/07/08 20:47:11 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
              11/07/08 20:47:11 INFO streaming.StreamJob: Running job: job_201107082024_0002
              11/07/08 20:47:11 INFO streaming.StreamJob: To kill this job, run:
              11/07/08 20:47:11 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
              /hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
              ill job_201107082024_0002
              11/07/08 20:47:11 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
              11/07/08 20:47:12 INFO streaming.StreamJob: map 0% reduce 0%
              11/07/08 20:47:24 INFO streaming.StreamJob: map 8% reduce 0%
              11/07/08 20:47:25 INFO streaming.StreamJob: map 17% reduce 0%
              11/07/08 20:47:29 INFO streaming.StreamJob: map 33% reduce 0%
              11/07/08 20:47:30 INFO streaming.StreamJob: map 42% reduce 0%
              11/07/08 20:47:34 INFO streaming.StreamJob: map 58% reduce 0%
              11/07/08 20:47:35 INFO streaming.StreamJob: map 67% reduce 0%
              11/07/08 20:47:39 INFO streaming.StreamJob: map 75% reduce 0%

              This is for running 100,000 reads in 3 instances in Ec2. I have to shut it down after 2 hours as the processing does not seem to be able to be finished in reasonable amount of time. I hope this log is useful for your trouble shooting. And thanks again for your help!

              Yi Wei


              • #8
                Hi Yi Wei,

                Thanks for your log file.

                I am re-running Pathseq with the sample file provided with the package. This sample file contains 6 million unique reads. I will share my results with you, once it is done.

                I am looking at the log file which you posted. There are no errors produced. It seems the Pathseq is running fine. As you know we are running 4 maq alignments and 2 megablast alignments and 2 blastn alignments. This in turn takes time to finish them, which is independent of number of reads they go into up to a certain extent. What is mean is as follows:

                If you have 100,000 reads ---- running may take about 5 hours to finish
                If you have 1 million reads -----running may take about more or less the same time as that of 100,000 hours to finish
                If you have 40 million reads -----running may take about 16-18 hours to finish

                I will post you with my latest results i will get from 6 million reads.

                Meanwhile, Please let me know what is your requirements.

                1. How many reads you have in your real sequencing file?
                2. Is reads from Illumina?
                3. Are you using total RNAseq or WGS?



                • #9
                  pathseq qustions

                  Hi, Chandra,
                  Thanks for the advice. It is very helpful. What we are trying to do is look for potential pathogen sequences from Illumina RNA-Seq data. We are probably going to get 40-80 million reads from each sample. Can you send me a copy of Hadoop log from your run of 6 million reads in the sample data file provided by the PathSeq package? I would like to run the same 6 million reads and compare the logs.
                  Best Regards,

                  Yi Wei

                  1. Do you have plans to modify PathSeq so that it can be run on internal computer clusters instead of Amazon Ec2?
                  2. Are you considering using Bowtie or Bwa for initial filtering step, as they are much faster than Maq?


                  • #10
                    Hi Yi Wei,

                    Yes, we are working towards getting BWA implemented into the Pathseq. You are correct BWA is much faster then MAQ.

                    Also, working for hadoop based internal computing cluster.

                    What kind of internal computer cluster you have? Is it LSF?



                    • #11
                      Hi Yi Wei and Pathseq users,

                      Here is the log file from the Pathseq runs. I just removed some lines for clarity.

                      The log file created is from Pathseq runs on 6 million unique reads (Sample file with Pathseq package).

                      In summary:
                      Total time in hours (for all 20 nodes) 387.7
                      Wall to Wall time is ~19 hours

                      Most important thing to highlight here is:
                      6 million reads took 19hours, doesn't mean that 60 million takes 10 times more. In our hands, 40 millions sequencing run take about the same time of 19hours.

                      Currently, we are working towards faster Pathseq. From the preliminary runs, newer Pathseq takes half the time that of the current version. Once we are done with validation, we will go for public release.

                      Please let me know if you have more questions / help with Pathseq installation.


                      Log file:
                      Master data_loader
                      11/07/19 14:33:53 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/, /mnt/hadoop/hadoop-unjar3823448146028608527/] [] /tmp/streamjob9062470838490462284.jar tmpDir=null
                      11/07/19 14:33:54 INFO mapred.FileInputFormat: Total input paths to process : 20
                      11/07/19 14:33:55 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/19 14:33:55 INFO streaming.StreamJob: Running job: job_201107191423_0001
                      11/07/19 14:33:55 INFO streaming.StreamJob: To kill this job, run:
                      11/07/19 14:33:55 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0001
                      11/07/19 14:33:55 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0001
                      11/07/19 14:33:56 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/19 14:34:09 INFO streaming.StreamJob: map 20% reduce 0%
                      11/07/19 14:34:10 INFO streaming.StreamJob: map 40% reduce 0%
                      11/07/19 14:34:11 INFO streaming.StreamJob: map 60% reduce 0%
                      11/07/19 14:34:12 INFO streaming.StreamJob: map 80% reduce 0%
                      11/07/19 14:34:13 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/19 14:34:14 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/19 15:32:58 INFO streaming.StreamJob: Job complete: job_201107191423_0001
                      11/07/19 15:32:58 INFO streaming.StreamJob: Output: load

                      real 59m5.290s
                      user 0m3.278s
                      sys 0m1.108s
                      Master loader completed

                      Maq alignments + Duplicate remover
                      11/07/19 15:33:07 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root/, /root/, /mnt/hadoop/hadoop-unjar2138415996895576783/] [] /tmp/streamjob4610713994932979234.jar tmpDir=null
                      11/07/19 15:33:08 INFO mapred.FileInputFormat: Total input paths to process : 21
                      11/07/19 15:33:08 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/19 15:33:08 INFO streaming.StreamJob: Running job: job_201107191423_0002
                      11/07/19 15:33:08 INFO streaming.StreamJob: To kill this job, run:
                      11/07/19 15:33:08 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0002
                      11/07/19 15:33:08 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0002
                      11/07/19 15:33:09 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/19 15:33:21 INFO streaming.StreamJob: map 24% reduce 0%
                      11/07/19 15:33:22 INFO streaming.StreamJob: map 52% reduce 0%
                      11/07/19 15:33:23 INFO streaming.StreamJob: map 76% reduce 0%
                      11/07/19 15:33:24 INFO streaming.StreamJob: map 86% reduce 0%
                      11/07/19 15:33:25 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/19 15:33:26 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 01:14:01 INFO streaming.StreamJob: Job complete: job_201107191423_0002
                      11/07/20 01:14:02 INFO streaming.StreamJob: Output: maq

                      real 580m56.490s
                      user 0m6.135s
                      sys 0m12.924s
                      Maq alignments + Duplicate remover completed

                      Repeat masker loader

                      real 2m15.171s
                      user 1m11.617s
                      sys 0m12.480s
                      Repeat masker loader completed

                      Run repeat masker
                      11/07/20 01:16:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root/, /mnt/hadoop/hadoop-unjar6903467901556213816/] [] /tmp/streamjob3668474814406944845.jar tmpDir=null
                      11/07/20 01:16:24 INFO mapred.FileInputFormat: Total input paths to process : 60
                      11/07/20 01:16:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 01:16:24 INFO streaming.StreamJob: Running job: job_201107191423_0003
                      11/07/20 01:16:24 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 01:16:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0003
                      11/07/20 01:16:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0003
                      11/07/20 01:16:25 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 01:16:36 INFO streaming.StreamJob: map 5% reduce 0%
                      11/07/20 01:16:37 INFO streaming.StreamJob: map 10% reduce 0%
                      11/07/20 01:16:38 INFO streaming.StreamJob: map 20% reduce 0%
                      11/07/20 01:16:40 INFO streaming.StreamJob: map 32% reduce 0%
                      11/07/20 01:16:41 INFO streaming.StreamJob: map 37% reduce 0%
                      11/07/20 01:16:42 INFO streaming.StreamJob: map 43% reduce 0%
                      11/07/20 01:16:43 INFO streaming.StreamJob: map 53% reduce 0%
                      11/07/20 01:16:45 INFO streaming.StreamJob: map 65% reduce 0%
                      11/07/20 01:16:46 INFO streaming.StreamJob: map 72% reduce 0%
                      11/07/20 01:16:47 INFO streaming.StreamJob: map 77% reduce 0%
                      11/07/20 01:16:48 INFO streaming.StreamJob: map 85% reduce 0%
                      11/07/20 01:16:49 INFO streaming.StreamJob: map 88% reduce 0%
                      11/07/20 01:16:51 INFO streaming.StreamJob: map 97% reduce 0%
                      11/07/20 01:16:52 INFO streaming.StreamJob: map 98% reduce 0%
                      11/07/20 01:16:54 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 03:59:21 INFO streaming.StreamJob: Job complete: job_201107191423_0003
                      11/07/20 03:59:21 INFO streaming.StreamJob: Output: repeat

                      real 162m59.218s
                      user 0m4.786s
                      sys 0m1.192s
                      Repeat masker runs completed

                      Deleted hdfs://ip-10-118-59-251.ec2.internal:50001/user/root/load
                      Master data_loader for Post
                      11/07/20 03:59:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/, /mnt/hadoop/hadoop-unjar4739713752539699730/] [] /tmp/streamjob4058317523841970356.jar tmpDir=null
                      11/07/20 03:59:24 INFO mapred.FileInputFormat: Total input paths to process : 20
                      11/07/20 03:59:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 03:59:24 INFO streaming.StreamJob: Running job: job_201107191423_0004
                      11/07/20 03:59:24 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 03:59:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0004
                      11/07/20 03:59:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0004
                      11/07/20 03:59:25 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 03:59:36 INFO streaming.StreamJob: map 20% reduce 0%
                      11/07/20 03:59:37 INFO streaming.StreamJob: map 45% reduce 0%
                      11/07/20 03:59:38 INFO streaming.StreamJob: map 60% reduce 0%
                      11/07/20 03:59:39 INFO streaming.StreamJob: map 80% reduce 0%
                      11/07/20 03:59:40 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/20 03:59:41 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 04:18:27 INFO streaming.StreamJob: Job complete: job_201107191423_0004
                      11/07/20 04:18:28 INFO streaming.StreamJob: Output: load

                      real 19m5.252s
                      user 0m2.360s
                      sys 0m1.129s
                      Master loader completed

                      Postsubtraction loader
                      real 0m27.082s
                      user 0m10.882s
                      sys 0m1.435s

                      Postsubstraction on the Unmapped reads
                      11/07/20 04:19:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /mnt/hadoop/hadoop-unjar3151623736871864286/] [] /tmp/streamjob6660721182687954669.jar tmpDir=null
                      11/07/20 04:19:03 INFO mapred.FileInputFormat: Total input paths to process : 40
                      11/07/20 04:19:03 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 04:19:03 INFO streaming.StreamJob: Running job: job_201107191423_0005
                      11/07/20 04:19:03 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 04:19:03 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0005
                      11/07/20 04:19:03 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0005
                      11/07/20 04:19:04 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 04:19:15 INFO streaming.StreamJob: map 5% reduce 0%
                      11/07/20 04:19:16 INFO streaming.StreamJob: map 15% reduce 0%
                      11/07/20 04:19:17 INFO streaming.StreamJob: map 18% reduce 0%
                      11/07/20 04:19:18 INFO streaming.StreamJob: map 28% reduce 0%
                      11/07/20 04:19:19 INFO streaming.StreamJob: map 48% reduce 0%
                      11/07/20 04:19:20 INFO streaming.StreamJob: map 55% reduce 0%
                      11/07/20 04:19:21 INFO streaming.StreamJob: map 65% reduce 0%
                      11/07/20 04:19:23 INFO streaming.StreamJob: map 72% reduce 0%
                      11/07/20 04:19:24 INFO streaming.StreamJob: map 92% reduce 0%
                      11/07/20 04:19:25 INFO streaming.StreamJob: map 95% reduce 0%
                      11/07/20 04:19:26 INFO streaming.StreamJob: map 97% reduce 0%
                      11/07/20 04:19:30 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 10:28:32 INFO streaming.StreamJob: Job complete: job_201107191423_0005
                      11/07/20 10:28:32 INFO streaming.StreamJob: Output: postsub

                      real 369m31.146s
                      user 0m5.041s
                      sys 0m1.540s

                      Postsubstraction on the contigs
                      11/07/20 10:28:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                      packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /root/, /mnt/hadoop/hadoop-unjar1426923625300485254/] [] /tmp/streamjob2059410816131312926.jar tmpDir=null
                      11/07/20 10:28:34 INFO mapred.FileInputFormat: Total input paths to process : 18
                      11/07/20 10:28:34 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                      11/07/20 10:28:34 INFO streaming.StreamJob: Running job: job_201107191423_0006
                      11/07/20 10:28:34 INFO streaming.StreamJob: To kill this job, run:
                      11/07/20 10:28:34 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0006
                      11/07/20 10:28:34 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0006
                      11/07/20 10:28:35 INFO streaming.StreamJob: map 0% reduce 0%
                      11/07/20 10:28:47 INFO streaming.StreamJob: map 28% reduce 0%
                      11/07/20 10:28:48 INFO streaming.StreamJob: map 56% reduce 0%
                      11/07/20 10:28:49 INFO streaming.StreamJob: map 89% reduce 0%
                      11/07/20 10:28:50 INFO streaming.StreamJob: map 94% reduce 0%
                      11/07/20 10:28:51 INFO streaming.StreamJob: map 100% reduce 0%
                      11/07/20 10:31:56 INFO streaming.StreamJob: Job complete: job_201107191423_0006
                      11/07/20 10:31:56 INFO streaming.StreamJob: Output: postsubvel

                      real 3m24.070s
                      user 0m1.577s
                      sys 0m0.238s
                      Postsubtraction completed

                      File '/usr/local/hadoop-0.19.0/output/Output.tar' stored as 's3://ami-ami-QFnew-foutput/Output.tar' (106291200 bytes in 16.4 seconds, 6.19 MB/s) [1 of 1]

                      Results Summary:
                      Results summary:

                      Substraction Pathseq_Cloud
                      Total number of reads 6369435
                      Total number of reads after duplicate remover 6369435
                      Total number of unmapped reads after Maq 1 alignment (Database: MAQ1) 1829265
                      Total number of unmapped reads after Maq 2 alignment (Database: MAQ2) 504427
                      Total number of unmapped reads after Maq 3 alignment (Database: MAQ3) 488954
                      Total number of unmapped reads after Maq 4 alignment (Database: MAQ4) 485479
                      Total number of unmapped reads after repeat masker 365393
                      Total number of unmapped reads after Megablast (Database: BLAST1) 70343
                      Total number of unmapped reads after Megablast (Database: BLAST2) 33808
                      Total number of unmapped reads after BlastN1 (Database: BLAST1) 33768
                      Total number of unmapped reads after BlastN2 (Database: BLAST2) 33746
                      Total number of unmapped reads 33746
                      Reads after computational subtraction (Unmapped reads) unmappedreads.fq1
                      Contigs from unmapped reads contigs.fq1


                      • #12
                        Hi, Chandra,
                        Thanks so much for the log! It is very helpful. I am looking forward to the new version of Pathseq. At the mean time I will start running PathSeq with our data and keep you updated on our progress.

                        Yi Wei


                        • #13
                          Pathseq AMI problem

                          I have problems to Build my own AMI how its explained in the last step
                          the PathSeq installation.

                          If I execute ./ I receive an error that the ami is not
                          I assumpt that I just need this command to create an Instance with
                          installed on it?

                          I am working on developing a GUI for the use of PathSeq, therefore it
                          would be nice if you could give me a documentation (if you have one)
                          the tool?!

                          It would be great if you could help me!


                          • #14
                            Hi, Chandra,
                            The following is my experience running PathSeq with my own data:
                            I started with ~70 million human RNA-Seq 100 bp Illumina reads. I prefiltered these reads by running Bowtie against human 37.1 reference genome in my own desktop and ended up with ~11 million reads. After running, I got ~1.6 million reads. These reads were then uploaded onto S3 and PathSeq was launched on 20 nodes. PathSeq ran for more than 60 hour without finishing and I had to terminate the whole job. Here is the log I got from the master node.

                            Master data_loader
                            11/07/23 18:29:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/, /mnt/hadoop/hadoop-unjar9078098862757602177/] [] /tmp/streamjob1077975618755703344.jar tmpDir=null
                            11/07/23 18:29:42 INFO mapred.FileInputFormat: Total input paths to process : 20
                            11/07/23 18:29:42 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/23 18:29:42 INFO streaming.StreamJob: Running job: job_201107231823_0001
                            11/07/23 18:29:42 INFO streaming.StreamJob: To kill this job, run:
                            11/07/23 18:29:42 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0001
                            11/07/23 18:29:42 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0001
                            11/07/23 18:29:43 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/23 18:29:55 INFO streaming.StreamJob: map 10% reduce 0%
                            11/07/23 18:29:56 INFO streaming.StreamJob: map 30% reduce 0%
                            11/07/23 18:29:57 INFO streaming.StreamJob: map 45% reduce 0%
                            11/07/23 18:29:58 INFO streaming.StreamJob: map 60% reduce 0%
                            11/07/23 18:29:59 INFO streaming.StreamJob: map 80% reduce 0%
                            11/07/23 18:30:00 INFO streaming.StreamJob: map 100% reduce 0%
                            11/07/23 18:54:42 INFO streaming.StreamJob: Job complete: job_201107231823_0001
                            11/07/23 18:54:42 INFO streaming.StreamJob: Output: load

                            real 25m1.703s
                            user 0m2.231s
                            sys 0m0.320s
                            Master loader completed

                            Maq alignments + Duplicate remover
                            11/07/23 18:54:51 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root/MA
                  , /root/, /mnt/hadoop/hadoop-unjar6628816665337722828/] [] /tmp/streamjob6711648814308398287.jar tmpDir=null
                            11/07/23 18:54:52 INFO mapred.FileInputFormat: Total input paths to process : 21
                            11/07/23 18:54:52 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/23 18:54:52 INFO streaming.StreamJob: Running job: job_201107231823_0002
                            11/07/23 18:54:52 INFO streaming.StreamJob: To kill this job, run:
                            11/07/23 18:54:52 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0002
                            11/07/23 18:54:52 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0002
                            11/07/23 18:54:53 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/23 18:55:04 INFO streaming.StreamJob: map 29% reduce 0%
                            11/07/23 18:55:05 INFO streaming.StreamJob: map 43% reduce 0%
                            11/07/23 18:55:06 INFO streaming.StreamJob: map 52% reduce 0%
                            11/07/23 18:55:07 INFO streaming.StreamJob: map 67% reduce 0%
                            11/07/23 18:55:08 INFO streaming.StreamJob: map 90% reduce 0%
                            11/07/23 18:55:09 INFO streaming.StreamJob: map 100% reduce 0%
                            11/07/24 03:34:56 INFO streaming.StreamJob: Job complete: job_201107231823_0002
                            11/07/24 03:34:56 INFO streaming.StreamJob: Output: maq

                            real 520m5.778s
                            user 0m7.229s
                            sys 0m2.200s
                            Maq alignments + Duplicate remover completed

                            Run repeat masker
                            11/07/24 03:38:54 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root
                            /, /mnt/hadoop/hadoop-unjar682608492557274042/] [] /tmp/streamjob8244164963266699673.jar tmpDir=null
                            11/07/24 03:38:55 INFO mapred.FileInputFormat: Total input paths to process : 108
                            11/07/24 03:38:56 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/24 03:38:56 INFO streaming.StreamJob: Running job: job_201107231823_0003
                            11/07/24 03:38:56 INFO streaming.StreamJob: To kill this job, run:
                            11/07/24 03:38:56 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0003
                            11/07/24 03:38:56 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0003
                            11/07/24 03:38:57 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/24 03:39:09 INFO streaming.StreamJob: map 3% reduce 0%
                            11/07/24 03:39:10 INFO streaming.StreamJob: map 7% reduce 0%
                            11/07/24 03:39:11 INFO streaming.StreamJob: map 11% reduce 0%
                            11/07/24 03:39:12 INFO streaming.StreamJob: map 17% reduce 0%
                            11/07/24 03:39:14 INFO streaming.StreamJob: map 21% reduce 0%
                            11/07/24 03:39:15 INFO streaming.StreamJob: map 26% reduce 0%
                            11/07/24 03:39:16 INFO streaming.StreamJob: map 29% reduce 0%
                            11/07/24 03:39:17 INFO streaming.StreamJob: map 35% reduce 0%
                            11/07/24 03:39:19 INFO streaming.StreamJob: map 40% reduce 0%
                            11/07/24 03:39:20 INFO streaming.StreamJob: map 44% reduce 0%
                            11/07/24 03:39:21 INFO streaming.StreamJob: map 47% reduce 0%
                            11/07/24 03:39:22 INFO streaming.StreamJob: map 53% reduce 0%
                            11/07/24 03:39:24 INFO streaming.StreamJob: map 55% reduce 0%
                            11/07/24 03:39:25 INFO streaming.StreamJob: map 56% reduce 0%
                            11/07/24 10:06:27 INFO streaming.StreamJob: map 57% reduce 0%
                            11/07/24 10:22:28 INFO streaming.StreamJob: map 58% reduce 0%
                            11/07/24 10:27:19 INFO streaming.StreamJob: map 59% reduce 0%
                            11/07/24 10:35:28 INFO streaming.StreamJob: map 60% reduce 0%
                            11/07/24 10:42:34 INFO streaming.StreamJob: map 61% reduce 0%
                            11/07/24 10:44:05 INFO streaming.StreamJob: map 62% reduce 0%
                            11/07/24 10:55:36 INFO streaming.StreamJob: map 63% reduce 0%
                            11/07/24 11:16:47 INFO streaming.StreamJob: map 64% reduce 0%
                            11/07/24 11:19:35 INFO streaming.StreamJob: map 65% reduce 0%
                            11/07/24 11:24:42 INFO streaming.StreamJob: map 66% reduce 0%
                            11/07/24 11:41:27 INFO streaming.StreamJob: map 67% reduce 0%
                            11/07/24 11:42:06 INFO streaming.StreamJob: map 68% reduce 0%
                            11/07/24 11:44:15 INFO streaming.StreamJob: map 69% reduce 0%
                            11/07/24 11:46:07 INFO streaming.StreamJob: map 70% reduce 0%
                            11/07/24 11:47:14 INFO streaming.StreamJob: map 71% reduce 0%
                            11/07/24 11:51:46 INFO streaming.StreamJob: map 72% reduce 0%
                            11/07/24 11:59:00 INFO streaming.StreamJob: map 73% reduce 0%
                            11/07/24 12:01:08 INFO streaming.StreamJob: map 74% reduce 0%
                            11/07/24 12:01:29 INFO streaming.StreamJob: map 75% reduce 0%
                            11/07/24 12:03:11 INFO streaming.StreamJob: map 76% reduce 0%
                            11/07/24 12:04:47 INFO streaming.StreamJob: map 77% reduce 0%
                            11/07/24 12:14:29 INFO streaming.StreamJob: map 78% reduce 0%
                            11/07/24 12:14:52 INFO streaming.StreamJob: map 79% reduce 0%
                            11/07/24 12:17:10 INFO streaming.StreamJob: map 80% reduce 0%
                            11/07/24 12:19:49 INFO streaming.StreamJob: map 81% reduce 0%
                            11/07/24 12:26:02 INFO streaming.StreamJob: map 82% reduce 0%
                            11/07/24 12:27:51 INFO streaming.StreamJob: map 83% reduce 0%
                            11/07/24 12:30:37 INFO streaming.StreamJob: map 84% reduce 0%
                            11/07/24 12:33:11 INFO streaming.StreamJob: map 85% reduce 0%
                            11/07/24 12:34:36 INFO streaming.StreamJob: map 86% reduce 0%
                            11/07/24 12:40:57 INFO streaming.StreamJob: map 87% reduce 0%
                            11/07/24 12:41:13 INFO streaming.StreamJob: map 88% reduce 0%
                            11/07/24 12:42:51 INFO streaming.StreamJob: map 89% reduce 0%
                            11/07/24 12:51:58 INFO streaming.StreamJob: map 90% reduce 0%
                            11/07/24 12:56:46 INFO streaming.StreamJob: map 91% reduce 0%
                            11/07/24 13:01:17 INFO streaming.StreamJob: map 92% reduce 0%
                            11/07/24 13:06:20 INFO streaming.StreamJob: map 93% reduce 0%
                            11/07/24 13:13:11 INFO streaming.StreamJob: map 94% reduce 0%
                            11/07/24 13:18:50 INFO streaming.StreamJob: map 95% reduce 0%
                            11/07/24 13:19:26 INFO streaming.StreamJob: map 96% reduce 0%
                            11/07/24 13:23:19 INFO streaming.StreamJob: map 97% reduce 0%
                            11/07/24 13:24:00 INFO streaming.StreamJob: map 98% reduce 0%
                            11/07/24 13:28:37 INFO streaming.StreamJob: map 99% reduce 0%
                            11/07/24 13:36:03 INFO streaming.StreamJob: map 100% reduce 0%
                            11/07/24 22:08:03 INFO streaming.StreamJob: Job complete: job_201107231823_0003
                            11/07/24 22:08:03 INFO streaming.StreamJob: Output: repeat

                            real 1109m9.111s
                            user 0m10.320s
                            sys 0m1.582s
                            Repeat masker runs completed

                            Postsubstraction on the Unmapped reads
                            11/07/24 22:28:29 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                            packageJobJar: [/root/, /root/, /root/, /root/, /root/,
                            /root/, /root/, /root/, /root/, /root/, /root/mapper_postunmapped
                            .py, /mnt/hadoop/hadoop-unjar1994272368229376705/] [] /tmp/streamjob104650512695835986.jar tmpDir=null
                            11/07/24 22:28:29 INFO mapred.FileInputFormat: Total input paths to process : 40
                            11/07/24 22:28:30 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                            11/07/24 22:28:30 INFO streaming.StreamJob: Running job: job_201107231823_0005
                            11/07/24 22:28:30 INFO streaming.StreamJob: To kill this job, run:
                            11/07/24 22:28:30 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                            -kill job_201107231823_0005
                            11/07/24 22:28:30 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0005
                            11/07/24 22:28:31 INFO streaming.StreamJob: map 0% reduce 0%
                            11/07/24 22:28:43 INFO streaming.StreamJob: map 5% reduce 0%
                            11/07/24 22:28:44 INFO streaming.StreamJob: map 8% reduce 0%
                            11/07/24 22:28:45 INFO streaming.StreamJob: map 15% reduce 0%
                            11/07/24 22:28:46 INFO streaming.StreamJob: map 20% reduce 0%
                            11/07/24 22:28:48 INFO streaming.StreamJob: map 38% reduce 0%
                            11/07/24 22:28:49 INFO streaming.StreamJob: map 55% reduce 0%
                            11/07/24 22:28:50 INFO streaming.StreamJob: map 58% reduce 0%
                            11/07/24 22:28:51 INFO streaming.StreamJob: map 65% reduce 0%
                            11/07/24 22:28:52 INFO streaming.StreamJob: map 72% reduce 0%
                            11/07/24 22:28:53 INFO streaming.StreamJob: map 90% reduce 0%
                            11/07/24 22:28:54 INFO streaming.StreamJob: map 100% reduce 0%

                            Job 5 ran for more than 34 hours before I terminated it.

                            From the output in S3 buckets I estimate that there were ~1 million reads after Maq subtraction and ~ 200,000 reads after repeat masking and Blast. PathSeq ran much slower than I expected and I don’t know what I did wrong. Can you take a look at the logs and let me know what you think?
                            Thanks so much for your help!

                            Yi Wei


                            • #15
                              Same Problem


                              I just wanted to tell you that I have the same Problem.
                              Well i watched through the log files and I saw that their is an execption and I guess that this is the reason because there are not really good results.

                              Well actually i tested it with some sequenced data from illumina, but unfortunetly I receive no reliable results. The output file is always showing that no reads were identified as human or well known pathogens, but that's not possible.
                              and it also tooks very long,altough I had used a small amount of data.

                              Well here is the exception which i found:

                              Exception in thread "Timer thread for monitoring dfs" java.lang.NullPointerException
                              at org.apache.hadoop.metrics.ganglia.GangliaContext.xdr_string(
                              at org.apache.hadoop.metrics.ganglia.GangliaContext.emitMetric(
                              at org.apache.hadoop.metrics.ganglia.GangliaContext.emitRecord(
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext.emitRecords(
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext.timerEvent(
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext.access$000(
                              at org.apache.hadoop.metrics.spi.AbstractMetricsContext$
                              at java.util.TimerThread.mainLoop(Unknown Source)
                              at Source)


                              I guess there is a problem with the hadoop cluster, I am trying now to use the newer version of hadoop, maybe this will change something.

                              But I am quite sure, that this is not a config problem.

                              I will tell you if i found a solution!

                              with best regards,

                              Originally posted by yiweiny View Post
                              Hi, Chandra,
                              The following is my experience running PathSeq with my own data:
                              I started with ~70 million human RNA-Seq 100 bp Illumina reads. I prefiltered these reads by running Bowtie against human 37.1 reference genome in my own desktop and ended up with ~11 million reads. After running, I got ~1.6 million reads. These reads were then uploaded onto S3 and PathSeq was launched on 20 nodes. PathSeq ran for more than 60 hour without finishing and I had to terminate the whole job. Here is the log I got from the master node.

                              Master data_loader
                              11/07/23 18:29:41 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/, /mnt/hadoop/hadoop-unjar9078098862757602177/] [] /tmp/streamjob1077975618755703344.jar tmpDir=null
                              11/07/23 18:29:42 INFO mapred.FileInputFormat: Total input paths to process : 20
                              11/07/23 18:29:42 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/23 18:29:42 INFO streaming.StreamJob: Running job: job_201107231823_0001
                              11/07/23 18:29:42 INFO streaming.StreamJob: To kill this job, run:
                              11/07/23 18:29:42 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0001
                              11/07/23 18:29:42 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0001
                              11/07/23 18:29:43 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/23 18:29:55 INFO streaming.StreamJob: map 10% reduce 0%
                              11/07/23 18:29:56 INFO streaming.StreamJob: map 30% reduce 0%
                              11/07/23 18:29:57 INFO streaming.StreamJob: map 45% reduce 0%
                              11/07/23 18:29:58 INFO streaming.StreamJob: map 60% reduce 0%
                              11/07/23 18:29:59 INFO streaming.StreamJob: map 80% reduce 0%
                              11/07/23 18:30:00 INFO streaming.StreamJob: map 100% reduce 0%
                              11/07/23 18:54:42 INFO streaming.StreamJob: Job complete: job_201107231823_0001
                              11/07/23 18:54:42 INFO streaming.StreamJob: Output: load

                              real 25m1.703s
                              user 0m2.231s
                              sys 0m0.320s
                              Master loader completed

                              Maq alignments + Duplicate remover
                              11/07/23 18:54:51 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root/MA
                    , /root/, /mnt/hadoop/hadoop-unjar6628816665337722828/] [] /tmp/streamjob6711648814308398287.jar tmpDir=null
                              11/07/23 18:54:52 INFO mapred.FileInputFormat: Total input paths to process : 21
                              11/07/23 18:54:52 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/23 18:54:52 INFO streaming.StreamJob: Running job: job_201107231823_0002
                              11/07/23 18:54:52 INFO streaming.StreamJob: To kill this job, run:
                              11/07/23 18:54:52 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0002
                              11/07/23 18:54:52 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0002
                              11/07/23 18:54:53 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/23 18:55:04 INFO streaming.StreamJob: map 29% reduce 0%
                              11/07/23 18:55:05 INFO streaming.StreamJob: map 43% reduce 0%
                              11/07/23 18:55:06 INFO streaming.StreamJob: map 52% reduce 0%
                              11/07/23 18:55:07 INFO streaming.StreamJob: map 67% reduce 0%
                              11/07/23 18:55:08 INFO streaming.StreamJob: map 90% reduce 0%
                              11/07/23 18:55:09 INFO streaming.StreamJob: map 100% reduce 0%
                              11/07/24 03:34:56 INFO streaming.StreamJob: Job complete: job_201107231823_0002
                              11/07/24 03:34:56 INFO streaming.StreamJob: Output: maq

                              real 520m5.778s
                              user 0m7.229s
                              sys 0m2.200s
                              Maq alignments + Duplicate remover completed

                              Run repeat masker
                              11/07/24 03:38:54 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/, /root/, /root/, /root/, /root/, /root
                              /, /mnt/hadoop/hadoop-unjar682608492557274042/] [] /tmp/streamjob8244164963266699673.jar tmpDir=null
                              11/07/24 03:38:55 INFO mapred.FileInputFormat: Total input paths to process : 108
                              11/07/24 03:38:56 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/24 03:38:56 INFO streaming.StreamJob: Running job: job_201107231823_0003
                              11/07/24 03:38:56 INFO streaming.StreamJob: To kill this job, run:
                              11/07/24 03:38:56 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0003
                              11/07/24 03:38:56 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0003
                              11/07/24 03:38:57 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/24 03:39:09 INFO streaming.StreamJob: map 3% reduce 0%
                              11/07/24 03:39:10 INFO streaming.StreamJob: map 7% reduce 0%
                              11/07/24 03:39:11 INFO streaming.StreamJob: map 11% reduce 0%
                              11/07/24 03:39:12 INFO streaming.StreamJob: map 17% reduce 0%
                              11/07/24 03:39:14 INFO streaming.StreamJob: map 21% reduce 0%
                              11/07/24 03:39:15 INFO streaming.StreamJob: map 26% reduce 0%
                              11/07/24 03:39:16 INFO streaming.StreamJob: map 29% reduce 0%
                              11/07/24 03:39:17 INFO streaming.StreamJob: map 35% reduce 0%
                              11/07/24 03:39:19 INFO streaming.StreamJob: map 40% reduce 0%
                              11/07/24 03:39:20 INFO streaming.StreamJob: map 44% reduce 0%
                              11/07/24 03:39:21 INFO streaming.StreamJob: map 47% reduce 0%
                              11/07/24 03:39:22 INFO streaming.StreamJob: map 53% reduce 0%
                              11/07/24 03:39:24 INFO streaming.StreamJob: map 55% reduce 0%
                              11/07/24 03:39:25 INFO streaming.StreamJob: map 56% reduce 0%
                              11/07/24 10:06:27 INFO streaming.StreamJob: map 57% reduce 0%
                              11/07/24 10:22:28 INFO streaming.StreamJob: map 58% reduce 0%
                              11/07/24 10:27:19 INFO streaming.StreamJob: map 59% reduce 0%
                              11/07/24 10:35:28 INFO streaming.StreamJob: map 60% reduce 0%
                              11/07/24 10:42:34 INFO streaming.StreamJob: map 61% reduce 0%
                              11/07/24 10:44:05 INFO streaming.StreamJob: map 62% reduce 0%
                              11/07/24 10:55:36 INFO streaming.StreamJob: map 63% reduce 0%
                              11/07/24 11:16:47 INFO streaming.StreamJob: map 64% reduce 0%
                              11/07/24 11:19:35 INFO streaming.StreamJob: map 65% reduce 0%
                              11/07/24 11:24:42 INFO streaming.StreamJob: map 66% reduce 0%
                              11/07/24 11:41:27 INFO streaming.StreamJob: map 67% reduce 0%
                              11/07/24 11:42:06 INFO streaming.StreamJob: map 68% reduce 0%
                              11/07/24 11:44:15 INFO streaming.StreamJob: map 69% reduce 0%
                              11/07/24 11:46:07 INFO streaming.StreamJob: map 70% reduce 0%
                              11/07/24 11:47:14 INFO streaming.StreamJob: map 71% reduce 0%
                              11/07/24 11:51:46 INFO streaming.StreamJob: map 72% reduce 0%
                              11/07/24 11:59:00 INFO streaming.StreamJob: map 73% reduce 0%
                              11/07/24 12:01:08 INFO streaming.StreamJob: map 74% reduce 0%
                              11/07/24 12:01:29 INFO streaming.StreamJob: map 75% reduce 0%
                              11/07/24 12:03:11 INFO streaming.StreamJob: map 76% reduce 0%
                              11/07/24 12:04:47 INFO streaming.StreamJob: map 77% reduce 0%
                              11/07/24 12:14:29 INFO streaming.StreamJob: map 78% reduce 0%
                              11/07/24 12:14:52 INFO streaming.StreamJob: map 79% reduce 0%
                              11/07/24 12:17:10 INFO streaming.StreamJob: map 80% reduce 0%
                              11/07/24 12:19:49 INFO streaming.StreamJob: map 81% reduce 0%
                              11/07/24 12:26:02 INFO streaming.StreamJob: map 82% reduce 0%
                              11/07/24 12:27:51 INFO streaming.StreamJob: map 83% reduce 0%
                              11/07/24 12:30:37 INFO streaming.StreamJob: map 84% reduce 0%
                              11/07/24 12:33:11 INFO streaming.StreamJob: map 85% reduce 0%
                              11/07/24 12:34:36 INFO streaming.StreamJob: map 86% reduce 0%
                              11/07/24 12:40:57 INFO streaming.StreamJob: map 87% reduce 0%
                              11/07/24 12:41:13 INFO streaming.StreamJob: map 88% reduce 0%
                              11/07/24 12:42:51 INFO streaming.StreamJob: map 89% reduce 0%
                              11/07/24 12:51:58 INFO streaming.StreamJob: map 90% reduce 0%
                              11/07/24 12:56:46 INFO streaming.StreamJob: map 91% reduce 0%
                              11/07/24 13:01:17 INFO streaming.StreamJob: map 92% reduce 0%
                              11/07/24 13:06:20 INFO streaming.StreamJob: map 93% reduce 0%
                              11/07/24 13:13:11 INFO streaming.StreamJob: map 94% reduce 0%
                              11/07/24 13:18:50 INFO streaming.StreamJob: map 95% reduce 0%
                              11/07/24 13:19:26 INFO streaming.StreamJob: map 96% reduce 0%
                              11/07/24 13:23:19 INFO streaming.StreamJob: map 97% reduce 0%
                              11/07/24 13:24:00 INFO streaming.StreamJob: map 98% reduce 0%
                              11/07/24 13:28:37 INFO streaming.StreamJob: map 99% reduce 0%
                              11/07/24 13:36:03 INFO streaming.StreamJob: map 100% reduce 0%
                              11/07/24 22:08:03 INFO streaming.StreamJob: Job complete: job_201107231823_0003
                              11/07/24 22:08:03 INFO streaming.StreamJob: Output: repeat

                              real 1109m9.111s
                              user 0m10.320s
                              sys 0m1.582s
                              Repeat masker runs completed

                              Postsubstraction on the Unmapped reads
                              11/07/24 22:28:29 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
                              packageJobJar: [/root/, /root/, /root/, /root/, /root/,
                              /root/, /root/, /root/, /root/, /root/, /root/mapper_postunmapped
                              .py, /mnt/hadoop/hadoop-unjar1994272368229376705/] [] /tmp/streamjob104650512695835986.jar tmpDir=null
                              11/07/24 22:28:29 INFO mapred.FileInputFormat: Total input paths to process : 40
                              11/07/24 22:28:30 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
                              11/07/24 22:28:30 INFO streaming.StreamJob: Running job: job_201107231823_0005
                              11/07/24 22:28:30 INFO streaming.StreamJob: To kill this job, run:
                              11/07/24 22:28:30 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-204-131-114.ec2.internal:50002
                              -kill job_201107231823_0005
                              11/07/24 22:28:30 INFO streaming.StreamJob: Tracking URL: http://ip-10-204-131-114.ec2.interna...107231823_0005
                              11/07/24 22:28:31 INFO streaming.StreamJob: map 0% reduce 0%
                              11/07/24 22:28:43 INFO streaming.StreamJob: map 5% reduce 0%
                              11/07/24 22:28:44 INFO streaming.StreamJob: map 8% reduce 0%
                              11/07/24 22:28:45 INFO streaming.StreamJob: map 15% reduce 0%
                              11/07/24 22:28:46 INFO streaming.StreamJob: map 20% reduce 0%
                              11/07/24 22:28:48 INFO streaming.StreamJob: map 38% reduce 0%
                              11/07/24 22:28:49 INFO streaming.StreamJob: map 55% reduce 0%
                              11/07/24 22:28:50 INFO streaming.StreamJob: map 58% reduce 0%
                              11/07/24 22:28:51 INFO streaming.StreamJob: map 65% reduce 0%
                              11/07/24 22:28:52 INFO streaming.StreamJob: map 72% reduce 0%
                              11/07/24 22:28:53 INFO streaming.StreamJob: map 90% reduce 0%
                              11/07/24 22:28:54 INFO streaming.StreamJob: map 100% reduce 0%

                              Job 5 ran for more than 34 hours before I terminated it.

                              From the output in S3 buckets I estimate that there were ~1 million reads after Maq subtraction and ~ 200,000 reads after repeat masking and Blast. PathSeq ran much slower than I expected and I don’t know what I did wrong. Can you take a look at the logs and let me know what you think?
                              Thanks so much for your help!

                              Yi Wei


                              Latest Articles


                              • seqadmin
                                Best Practices for Single-Cell Sequencing Analysis
                                by seqadmin

                                While isolating and preparing single cells for sequencing was historically the bottleneck, recent technological advancements have shifted the challenge to data analysis. This highlights the rapidly evolving nature of single-cell sequencing. The inherent complexity of single-cell analysis has intensified with the surge in data volume and the incorporation of diverse and more complex datasets. This article explores the challenges in analysis, examines common pitfalls, offers...
                                06-06-2024, 07:15 AM
                              • seqadmin
                                Latest Developments in Precision Medicine
                                by seqadmin

                                Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

                                Somatic Genomics
                                “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
                                05-24-2024, 01:16 PM





                              Topics Statistics Last Post
                              Started by seqadmin, Yesterday, 06:58 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 06-06-2024, 08:18 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 06-06-2024, 08:04 AM
                              0 responses
                              Last Post seqadmin  
                              Started by seqadmin, 06-03-2024, 06:55 AM
                              0 responses
                              Last Post seqadmin  