Seeking advice on PathSeq

yiweiny replied

07-20-2011, 03:44 PM
Hi, Chandra,
Thanks so much for the log! It is very helpful. I am looking forward to the new version of Pathseq. At the mean time I will start running PathSeq with our data and keep you updated on our progress.

Yi Wei
Leave a comment:
pcs_murali replied

07-20-2011, 07:00 AM
Hi Yi Wei and Pathseq users,

Here is the log file from the Pathseq runs. I just removed some lines for clarity.

The log file created is from Pathseq runs on 6 million unique reads (Sample file with Pathseq package).

In summary:
Total time in hours (for all 20 nodes) 387.7
Wall to Wall time is ~19 hours

Most important thing to highlight here is:
6 million reads took 19hours, doesn't mean that 60 million takes 10 times more. In our hands, 40 millions sequencing run take about the same time of 19hours.

Currently, we are working towards faster Pathseq. From the preliminary runs, newer Pathseq takes half the time that of the current version. Once we are done with validation, we will go for public release.

Please let me know if you have more questions / help with Pathseq installation.

Thanks
Chandra

Log file:
******
Master data_loader
**********************************
11/07/19 14:33:53 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar3823448146028608527/] [] /tmp/streamjob9062470838490462284.jar tmpDir=null
11/07/19 14:33:54 INFO mapred.FileInputFormat: Total input paths to process : 20
11/07/19 14:33:55 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/19 14:33:55 INFO streaming.StreamJob: Running job: job_201107191423_0001
11/07/19 14:33:55 INFO streaming.StreamJob: To kill this job, run:
11/07/19 14:33:55 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0001
11/07/19 14:33:55 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0001
11/07/19 14:33:56 INFO streaming.StreamJob: map 0% reduce 0%
11/07/19 14:34:09 INFO streaming.StreamJob: map 20% reduce 0%
11/07/19 14:34:10 INFO streaming.StreamJob: map 40% reduce 0%
11/07/19 14:34:11 INFO streaming.StreamJob: map 60% reduce 0%
11/07/19 14:34:12 INFO streaming.StreamJob: map 80% reduce 0%
11/07/19 14:34:13 INFO streaming.StreamJob: map 95% reduce 0%
11/07/19 14:34:14 INFO streaming.StreamJob: map 100% reduce 0%
11/07/19 15:32:58 INFO streaming.StreamJob: Job complete: job_201107191423_0001
11/07/19 15:32:58 INFO streaming.StreamJob: Output: load

real 59m5.290s
user 0m3.278s
sys 0m1.108s
Master loader completed

Maq alignments + Duplicate remover
**********************************
11/07/19 15:33:07 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar2138415996895576783/] [] /tmp/streamjob4610713994932979234.jar tmpDir=null
11/07/19 15:33:08 INFO mapred.FileInputFormat: Total input paths to process : 21
11/07/19 15:33:08 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/19 15:33:08 INFO streaming.StreamJob: Running job: job_201107191423_0002
11/07/19 15:33:08 INFO streaming.StreamJob: To kill this job, run:
11/07/19 15:33:08 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0002
11/07/19 15:33:08 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0002
11/07/19 15:33:09 INFO streaming.StreamJob: map 0% reduce 0%
11/07/19 15:33:21 INFO streaming.StreamJob: map 24% reduce 0%
11/07/19 15:33:22 INFO streaming.StreamJob: map 52% reduce 0%
11/07/19 15:33:23 INFO streaming.StreamJob: map 76% reduce 0%
11/07/19 15:33:24 INFO streaming.StreamJob: map 86% reduce 0%
11/07/19 15:33:25 INFO streaming.StreamJob: map 95% reduce 0%
11/07/19 15:33:26 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 01:14:01 INFO streaming.StreamJob: Job complete: job_201107191423_0002
11/07/20 01:14:02 INFO streaming.StreamJob: Output: maq

real 580m56.490s
user 0m6.135s
sys 0m12.924s
Maq alignments + Duplicate remover completed

Repeat masker loader
********************

real 2m15.171s
user 1m11.617s
sys 0m12.480s
Repeat masker loader completed

Run repeat masker
********************
11/07/20 01:16:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar6903467901556213816/] [] /tmp/streamjob3668474814406944845.jar tmpDir=null
11/07/20 01:16:24 INFO mapred.FileInputFormat: Total input paths to process : 60
11/07/20 01:16:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 01:16:24 INFO streaming.StreamJob: Running job: job_201107191423_0003
11/07/20 01:16:24 INFO streaming.StreamJob: To kill this job, run:
11/07/20 01:16:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0003
11/07/20 01:16:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0003
11/07/20 01:16:25 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 01:16:36 INFO streaming.StreamJob: map 5% reduce 0%
11/07/20 01:16:37 INFO streaming.StreamJob: map 10% reduce 0%
11/07/20 01:16:38 INFO streaming.StreamJob: map 20% reduce 0%
11/07/20 01:16:40 INFO streaming.StreamJob: map 32% reduce 0%
11/07/20 01:16:41 INFO streaming.StreamJob: map 37% reduce 0%
11/07/20 01:16:42 INFO streaming.StreamJob: map 43% reduce 0%
11/07/20 01:16:43 INFO streaming.StreamJob: map 53% reduce 0%
11/07/20 01:16:45 INFO streaming.StreamJob: map 65% reduce 0%
11/07/20 01:16:46 INFO streaming.StreamJob: map 72% reduce 0%
11/07/20 01:16:47 INFO streaming.StreamJob: map 77% reduce 0%
11/07/20 01:16:48 INFO streaming.StreamJob: map 85% reduce 0%
11/07/20 01:16:49 INFO streaming.StreamJob: map 88% reduce 0%
11/07/20 01:16:51 INFO streaming.StreamJob: map 97% reduce 0%
11/07/20 01:16:52 INFO streaming.StreamJob: map 98% reduce 0%
11/07/20 01:16:54 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 03:59:21 INFO streaming.StreamJob: Job complete: job_201107191423_0003
11/07/20 03:59:21 INFO streaming.StreamJob: Output: repeat

real 162m59.218s
user 0m4.786s
sys 0m1.192s
Repeat masker runs completed

Deleted hdfs://ip-10-118-59-251.ec2.internal:50001/user/root/load
Master data_loader for Post
********************
11/07/20 03:59:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/mapper_data_postsub.py, /mnt/hadoop/hadoop-unjar4739713752539699730/] [] /tmp/streamjob4058317523841970356.jar tmpDir=null
11/07/20 03:59:24 INFO mapred.FileInputFormat: Total input paths to process : 20
11/07/20 03:59:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 03:59:24 INFO streaming.StreamJob: Running job: job_201107191423_0004
11/07/20 03:59:24 INFO streaming.StreamJob: To kill this job, run:
11/07/20 03:59:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0004
11/07/20 03:59:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0004
11/07/20 03:59:25 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 03:59:36 INFO streaming.StreamJob: map 20% reduce 0%
11/07/20 03:59:37 INFO streaming.StreamJob: map 45% reduce 0%
11/07/20 03:59:38 INFO streaming.StreamJob: map 60% reduce 0%
11/07/20 03:59:39 INFO streaming.StreamJob: map 80% reduce 0%
11/07/20 03:59:40 INFO streaming.StreamJob: map 95% reduce 0%
11/07/20 03:59:41 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 04:18:27 INFO streaming.StreamJob: Job complete: job_201107191423_0004
11/07/20 04:18:28 INFO streaming.StreamJob: Output: load

real 19m5.252s
user 0m2.360s
sys 0m1.129s
Master loader completed

Postsubtraction loader
********************
real 0m27.082s
user 0m10.882s
sys 0m1.435s

Postsubstraction on the Unmapped reads
********************
11/07/20 04:19:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped.py, /mnt/hadoop/hadoop-unjar3151623736871864286/] [] /tmp/streamjob6660721182687954669.jar tmpDir=null
11/07/20 04:19:03 INFO mapred.FileInputFormat: Total input paths to process : 40
11/07/20 04:19:03 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 04:19:03 INFO streaming.StreamJob: Running job: job_201107191423_0005
11/07/20 04:19:03 INFO streaming.StreamJob: To kill this job, run:
11/07/20 04:19:03 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0005
11/07/20 04:19:03 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0005
11/07/20 04:19:04 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 04:19:15 INFO streaming.StreamJob: map 5% reduce 0%
11/07/20 04:19:16 INFO streaming.StreamJob: map 15% reduce 0%
11/07/20 04:19:17 INFO streaming.StreamJob: map 18% reduce 0%
11/07/20 04:19:18 INFO streaming.StreamJob: map 28% reduce 0%
11/07/20 04:19:19 INFO streaming.StreamJob: map 48% reduce 0%
11/07/20 04:19:20 INFO streaming.StreamJob: map 55% reduce 0%
11/07/20 04:19:21 INFO streaming.StreamJob: map 65% reduce 0%
11/07/20 04:19:23 INFO streaming.StreamJob: map 72% reduce 0%
11/07/20 04:19:24 INFO streaming.StreamJob: map 92% reduce 0%
11/07/20 04:19:25 INFO streaming.StreamJob: map 95% reduce 0%
11/07/20 04:19:26 INFO streaming.StreamJob: map 97% reduce 0%
11/07/20 04:19:30 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 10:28:32 INFO streaming.StreamJob: Job complete: job_201107191423_0005
11/07/20 10:28:32 INFO streaming.StreamJob: Output: postsub

real 369m31.146s
user 0m5.041s
sys 0m1.540s

Postsubstraction on the contigs
********************
11/07/20 10:28:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postvelvet.py, /mnt/hadoop/hadoop-unjar1426923625300485254/] [] /tmp/streamjob2059410816131312926.jar tmpDir=null
11/07/20 10:28:34 INFO mapred.FileInputFormat: Total input paths to process : 18
11/07/20 10:28:34 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
11/07/20 10:28:34 INFO streaming.StreamJob: Running job: job_201107191423_0006
11/07/20 10:28:34 INFO streaming.StreamJob: To kill this job, run:
11/07/20 10:28:34 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0006
11/07/20 10:28:34 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0006
11/07/20 10:28:35 INFO streaming.StreamJob: map 0% reduce 0%
11/07/20 10:28:47 INFO streaming.StreamJob: map 28% reduce 0%
11/07/20 10:28:48 INFO streaming.StreamJob: map 56% reduce 0%
11/07/20 10:28:49 INFO streaming.StreamJob: map 89% reduce 0%
11/07/20 10:28:50 INFO streaming.StreamJob: map 94% reduce 0%
11/07/20 10:28:51 INFO streaming.StreamJob: map 100% reduce 0%
11/07/20 10:31:56 INFO streaming.StreamJob: Job complete: job_201107191423_0006
11/07/20 10:31:56 INFO streaming.StreamJob: Output: postsubvel

real 3m24.070s
user 0m1.577s
sys 0m0.238s
Postsubtraction completed

File '/usr/local/hadoop-0.19.0/output/Output.tar' stored as 's3://ami-ami-QFnew-foutput/Output.tar' (106291200 bytes in 16.4 seconds, 6.19 MB/s) [1 of 1]

Results Summary:
*************
Results summary:

Substraction Pathseq_Cloud
Total number of reads 6369435
Total number of reads after duplicate remover 6369435
Total number of unmapped reads after Maq 1 alignment (Database: MAQ1) 1829265
Total number of unmapped reads after Maq 2 alignment (Database: MAQ2) 504427
Total number of unmapped reads after Maq 3 alignment (Database: MAQ3) 488954
Total number of unmapped reads after Maq 4 alignment (Database: MAQ4) 485479
Total number of unmapped reads after repeat masker 365393
Total number of unmapped reads after Megablast (Database: BLAST1) 70343
Total number of unmapped reads after Megablast (Database: BLAST2) 33808
Total number of unmapped reads after BlastN1 (Database: BLAST1) 33768
Total number of unmapped reads after BlastN2 (Database: BLAST2) 33746
Total number of unmapped reads 33746
Reads after computational subtraction (Unmapped reads) unmappedreads.fq1
Contigs from unmapped reads contigs.fq1
Leave a comment:
pcs_murali replied

07-11-2011, 08:02 AM
Hi Yi Wei,

Yes, we are working towards getting BWA implemented into the Pathseq. You are correct BWA is much faster then MAQ.

Also, working for hadoop based internal computing cluster.

What kind of internal computer cluster you have? Is it LSF?

Thanks
Chandra
Leave a comment:
yiweiny replied

07-11-2011, 07:50 AM
pathseq qustions

Hi, Chandra,
Thanks for the advice. It is very helpful. What we are trying to do is look for potential pathogen sequences from Illumina RNA-Seq data. We are probably going to get 40-80 million reads from each sample. Can you send me a copy of Hadoop log from your run of 6 million reads in the sample data file provided by the PathSeq package? I would like to run the same 6 million reads and compare the logs.
Best Regards,

Yi Wei

P.S.
1. Do you have plans to modify PathSeq so that it can be run on internal computer clusters instead of Amazon Ec2?
2. Are you considering using Bowtie or Bwa for initial filtering step, as they are much faster than Maq?
Leave a comment:
pcs_murali replied

07-11-2011, 07:24 AM
Hi Yi Wei,

Thanks for your log file.

I am re-running Pathseq with the sample file provided with the package. This sample file contains 6 million unique reads. I will share my results with you, once it is done.

I am looking at the log file which you posted. There are no errors produced. It seems the Pathseq is running fine. As you know we are running 4 maq alignments and 2 megablast alignments and 2 blastn alignments. This in turn takes time to finish them, which is independent of number of reads they go into up to a certain extent. What is mean is as follows:

If you have 100,000 reads ---- running may take about 5 hours to finish
If you have 1 million reads -----running may take about more or less the same time as that of 100,000 hours to finish
If you have 40 million reads -----running may take about 16-18 hours to finish

I will post you with my latest results i will get from 6 million reads.

Meanwhile, Please let me know what is your requirements.

1. How many reads you have in your real sequencing file?
2. Is reads from Illumina?
3. Are you using total RNAseq or WGS?

Thanks
Chandra
Leave a comment:
yiweiny replied

07-11-2011, 07:05 AM
pathseq logs

Hi, Chandra,
Thanks for the advice. I don't know whether it is helpful to you or not, but here is part of the Hadoop log I captured from the master node in Ec2:

rmr: cannot remove config: No such file or directory.
rmr: cannot remove s3config: No such file or directory.
rmr: cannot remove load: No such file or directory.
Master data_loader
11/07/08 20:27:30 WARN streaming.StreamJob: -jobconf option is deprecated, plea
se use -D instead.
packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar324271362
3624356081/] [] /tmp/streamjob4418070527408526594.jar tmpDir=null
11/07/08 20:27:30 INFO mapred.FileInputFormat: Total input paths to process : 3
11/07/08 20:27:31 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
/local]
11/07/08 20:27:31 INFO streaming.StreamJob: Running job: job_201107082024_0001
11/07/08 20:27:31 INFO streaming.StreamJob: To kill this job, run:
11/07/08 20:27:31 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
/hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
ill job_201107082024_0001
11/07/08 20:27:31 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0001
11/07/08 20:27:32 INFO streaming.StreamJob: map 0% reduce 0%
11/07/08 20:27:44 INFO streaming.StreamJob: map 33% reduce 0%
11/07/08 20:27:47 INFO streaming.StreamJob: map 67% reduce 0%
11/07/08 20:27:48 INFO streaming.StreamJob: map 100% reduce 0%
11/07/08 20:47:03 INFO streaming.StreamJob: Job complete: job_201107082024_0001
11/07/08 20:47:03 INFO streaming.StreamJob: Output: load

real 19m33.828s
user 0m2.365s
sys 0m0.665s
Master loader completed
ERROR: Bucket 'ami-yiweijob6-stat' does not exist
Bucket 's3://ami-yiweijob6-stat/' removed
Bucket 's3://ami-yiweijob6-stat/' created
ERROR: Bucket 'ami-yiweijob6-output' does not exist
Bucket 's3://ami-yiweijob6-output/' removed
Bucket 's3://ami-yiweijob6-output/' created
File s3://reads-yiwei-regeneron/input1.local saved as '/usr/local/hadoop-0.19.0
/input1.local' (75 bytes in 0.0 seconds, 4.17 kB/s)
File s3://reads-yiwei-regeneron/input10.local saved as '/usr/local/hadoop-0.19.
0/input10.local' (76 bytes in 0.0 seconds, 3.08 kB/s)
File s3://reads-yiwei-regeneron/input11.local saved as '/usr/local/hadoop-0.19.
0/input11.local' (76 bytes in 0.0 seconds, 2.79 kB/s)
File s3://reads-yiwei-regeneron/input12.local saved as '/usr/local/hadoop-0.19.
0/input12.local' (76 bytes in 0.0 seconds, 3.16 kB/s)
File s3://reads-yiwei-regeneron/input2.local saved as '/usr/local/hadoop-0.19.0
/input2.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
File s3://reads-yiwei-regeneron/input3.local saved as '/usr/local/hadoop-0.19.0
/input3.local' (75 bytes in 0.0 seconds, 2.59 kB/s)
File s3://reads-yiwei-regeneron/input4.local saved as '/usr/local/hadoop-0.19.0
/input4.local' (75 bytes in 0.0 seconds, 3.34 kB/s)
File s3://reads-yiwei-regeneron/input5.local saved as '/usr/local/hadoop-0.19.0
/input5.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
File s3://reads-yiwei-regeneron/input6.local saved as '/usr/local/hadoop-0.19.0
/input6.local' (75 bytes in 0.0 seconds, 3.43 kB/s)
File s3://reads-yiwei-regeneron/input7.local saved as '/usr/local/hadoop-0.19.0
/input7.local' (75 bytes in 0.0 seconds, 3.18 kB/s)
File s3://reads-yiwei-regeneron/input8.local saved as '/usr/local/hadoop-0.19.0
/input8.local' (75 bytes in 0.0 seconds, 3.25 kB/s)
File s3://reads-yiwei-regeneron/input9.local saved as '/usr/local/hadoop-0.19.0
/input9.local' (75 bytes in 0.0 seconds, 3.46 kB/s)
rmr: cannot remove test: No such file or directory.
rmr: cannot remove maq: No such file or directory.
Maq alignments + Duplicate remover
11/07/08 20:47:10 WARN streaming.StreamJob: -jobconf option is deprecated, plea
se use -D instead.
packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone
2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQ
unmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar7827
192869392733442/] [] /tmp/streamjob8065656650743401458.jar tmpDir=null
11/07/08 20:47:10 INFO mapred.FileInputFormat: Total input paths to process : 1
2
11/07/08 20:47:11 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
/local]
11/07/08 20:47:11 INFO streaming.StreamJob: Running job: job_201107082024_0002
11/07/08 20:47:11 INFO streaming.StreamJob: To kill this job, run:
11/07/08 20:47:11 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
/hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
ill job_201107082024_0002
11/07/08 20:47:11 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0002
11/07/08 20:47:12 INFO streaming.StreamJob: map 0% reduce 0%
11/07/08 20:47:24 INFO streaming.StreamJob: map 8% reduce 0%
11/07/08 20:47:25 INFO streaming.StreamJob: map 17% reduce 0%
11/07/08 20:47:29 INFO streaming.StreamJob: map 33% reduce 0%
11/07/08 20:47:30 INFO streaming.StreamJob: map 42% reduce 0%
11/07/08 20:47:34 INFO streaming.StreamJob: map 58% reduce 0%
11/07/08 20:47:35 INFO streaming.StreamJob: map 67% reduce 0%
11/07/08 20:47:39 INFO streaming.StreamJob: map 75% reduce 0%

This is for running 100,000 reads in 3 instances in Ec2. I have to shut it down after 2 hours as the processing does not seem to be able to be finished in reasonable amount of time. I hope this log is useful for your trouble shooting. And thanks again for your help!

Yi Wei
Leave a comment:
pcs_murali replied

07-09-2011, 07:03 AM
Hi Yi,

I will re-run it on the cloud and see how much time it will take.

In our hands we run other samples with 40million reads in 1 to 1.2 days.

Meanwhile, please download the latest version from our website.

I will get back to you as soon as possible.

I greatly appreciate your comments.

Thanks
Chandra
Leave a comment:
yiweiny replied

07-08-2011, 11:23 AM
slow PathSeq

Hi, Chandra,
The version of PathSeq is 5.1. The data set is a sampling of 100,000 reads from the sample input files provided by the PathSeq web site. My problem is that these does not seem to be a difference whether I run it on 10 nodes or on 20 nodes. Both took a long time to run. I am concerned the Hadoop cluster is not set up correctly.
Thansk for the prompt reply and I am looking forward to hearing from you.

Yi
Leave a comment:
pcs_murali replied

07-08-2011, 08:16 AM
Hi,

Could you send me the version of Pathseq you are running?

Also, is your dataset from RNA based or DNA based?

Thanks
Chandra
Leave a comment:
yiweiny replied

07-07-2011, 12:52 PM
PathSeq is very slow on Ec2

I also set up PathSeq on Ec2. I was able to run it but it was very slow. I tested a data set with 100,000 reads and it took 10 hours running on 10 instances. I would appreciate any advice you can give me.
Leave a comment:
pcs_murali replied

07-02-2011, 01:44 PM
Hi,

The Amazon EC2 instances we used is Large instances. In our hands, for total RNA sequencing data (30 to 50 million reads) from GAIIx it takes about 1 to 1.2 days (on 20 nodes parallely) to finish the runs.

We are working towards reducing these runs.

Please let me know if you need more information on Pathseq.

Thanks
Chandra
Leave a comment:
dbrazel started a topic Seeking advice on PathSeq

06-29-2011, 03:17 PM
Seeking advice on PathSeq

Hi all,

I'm interested in using the PathSeq software and I was wondering if anyone had some advice on what sort of Amazon EC2 instances result in reasonable run times for full Illumina GAIIx or HiSeq data sets.

Thanks in advance!
Tags: cloud computing

Previous 1 2 3 4 template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 61 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment:

Leave a comment: