Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • yiweiny
    replied
    Hi, Chandra,
    Thanks so much for the log! It is very helpful. I am looking forward to the new version of Pathseq. At the mean time I will start running PathSeq with our data and keep you updated on our progress.

    Yi Wei

    Leave a comment:


  • pcs_murali
    replied
    Hi Yi Wei and Pathseq users,

    Here is the log file from the Pathseq runs. I just removed some lines for clarity.

    The log file created is from Pathseq runs on 6 million unique reads (Sample file with Pathseq package).

    In summary:
    Total time in hours (for all 20 nodes) 387.7
    Wall to Wall time is ~19 hours

    Most important thing to highlight here is:
    6 million reads took 19hours, doesn't mean that 60 million takes 10 times more. In our hands, 40 millions sequencing run take about the same time of 19hours.

    Currently, we are working towards faster Pathseq. From the preliminary runs, newer Pathseq takes half the time that of the current version. Once we are done with validation, we will go for public release.

    Please let me know if you have more questions / help with Pathseq installation.

    Thanks
    Chandra


    Log file:
    ******
    Master data_loader
    **********************************
    11/07/19 14:33:53 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar3823448146028608527/] [] /tmp/streamjob9062470838490462284.jar tmpDir=null
    11/07/19 14:33:54 INFO mapred.FileInputFormat: Total input paths to process : 20
    11/07/19 14:33:55 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
    11/07/19 14:33:55 INFO streaming.StreamJob: Running job: job_201107191423_0001
    11/07/19 14:33:55 INFO streaming.StreamJob: To kill this job, run:
    11/07/19 14:33:55 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0001
    11/07/19 14:33:55 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0001
    11/07/19 14:33:56 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/19 14:34:09 INFO streaming.StreamJob: map 20% reduce 0%
    11/07/19 14:34:10 INFO streaming.StreamJob: map 40% reduce 0%
    11/07/19 14:34:11 INFO streaming.StreamJob: map 60% reduce 0%
    11/07/19 14:34:12 INFO streaming.StreamJob: map 80% reduce 0%
    11/07/19 14:34:13 INFO streaming.StreamJob: map 95% reduce 0%
    11/07/19 14:34:14 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/19 15:32:58 INFO streaming.StreamJob: Job complete: job_201107191423_0001
    11/07/19 15:32:58 INFO streaming.StreamJob: Output: load

    real 59m5.290s
    user 0m3.278s
    sys 0m1.108s
    Master loader completed



    Maq alignments + Duplicate remover
    **********************************
    11/07/19 15:33:07 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQunmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar2138415996895576783/] [] /tmp/streamjob4610713994932979234.jar tmpDir=null
    11/07/19 15:33:08 INFO mapred.FileInputFormat: Total input paths to process : 21
    11/07/19 15:33:08 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
    11/07/19 15:33:08 INFO streaming.StreamJob: Running job: job_201107191423_0002
    11/07/19 15:33:08 INFO streaming.StreamJob: To kill this job, run:
    11/07/19 15:33:08 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0002
    11/07/19 15:33:08 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0002
    11/07/19 15:33:09 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/19 15:33:21 INFO streaming.StreamJob: map 24% reduce 0%
    11/07/19 15:33:22 INFO streaming.StreamJob: map 52% reduce 0%
    11/07/19 15:33:23 INFO streaming.StreamJob: map 76% reduce 0%
    11/07/19 15:33:24 INFO streaming.StreamJob: map 86% reduce 0%
    11/07/19 15:33:25 INFO streaming.StreamJob: map 95% reduce 0%
    11/07/19 15:33:26 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/20 01:14:01 INFO streaming.StreamJob: Job complete: job_201107191423_0002
    11/07/20 01:14:02 INFO streaming.StreamJob: Output: maq

    real 580m56.490s
    user 0m6.135s
    sys 0m12.924s
    Maq alignments + Duplicate remover completed

    Repeat masker loader
    ********************

    real 2m15.171s
    user 1m11.617s
    sys 0m12.480s
    Repeat masker loader completed

    Run repeat masker
    ********************
    11/07/20 01:16:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_repeatmasker.py, /mnt/hadoop/hadoop-unjar6903467901556213816/] [] /tmp/streamjob3668474814406944845.jar tmpDir=null
    11/07/20 01:16:24 INFO mapred.FileInputFormat: Total input paths to process : 60
    11/07/20 01:16:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
    11/07/20 01:16:24 INFO streaming.StreamJob: Running job: job_201107191423_0003
    11/07/20 01:16:24 INFO streaming.StreamJob: To kill this job, run:
    11/07/20 01:16:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0003
    11/07/20 01:16:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0003
    11/07/20 01:16:25 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/20 01:16:36 INFO streaming.StreamJob: map 5% reduce 0%
    11/07/20 01:16:37 INFO streaming.StreamJob: map 10% reduce 0%
    11/07/20 01:16:38 INFO streaming.StreamJob: map 20% reduce 0%
    11/07/20 01:16:40 INFO streaming.StreamJob: map 32% reduce 0%
    11/07/20 01:16:41 INFO streaming.StreamJob: map 37% reduce 0%
    11/07/20 01:16:42 INFO streaming.StreamJob: map 43% reduce 0%
    11/07/20 01:16:43 INFO streaming.StreamJob: map 53% reduce 0%
    11/07/20 01:16:45 INFO streaming.StreamJob: map 65% reduce 0%
    11/07/20 01:16:46 INFO streaming.StreamJob: map 72% reduce 0%
    11/07/20 01:16:47 INFO streaming.StreamJob: map 77% reduce 0%
    11/07/20 01:16:48 INFO streaming.StreamJob: map 85% reduce 0%
    11/07/20 01:16:49 INFO streaming.StreamJob: map 88% reduce 0%
    11/07/20 01:16:51 INFO streaming.StreamJob: map 97% reduce 0%
    11/07/20 01:16:52 INFO streaming.StreamJob: map 98% reduce 0%
    11/07/20 01:16:54 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/20 03:59:21 INFO streaming.StreamJob: Job complete: job_201107191423_0003
    11/07/20 03:59:21 INFO streaming.StreamJob: Output: repeat

    real 162m59.218s
    user 0m4.786s
    sys 0m1.192s
    Repeat masker runs completed

    Deleted hdfs://ip-10-118-59-251.ec2.internal:50001/user/root/load
    Master data_loader for Post
    ********************
    11/07/20 03:59:23 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    packageJobJar: [/root/mapper_data_postsub.py, /mnt/hadoop/hadoop-unjar4739713752539699730/] [] /tmp/streamjob4058317523841970356.jar tmpDir=null
    11/07/20 03:59:24 INFO mapred.FileInputFormat: Total input paths to process : 20
    11/07/20 03:59:24 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
    11/07/20 03:59:24 INFO streaming.StreamJob: Running job: job_201107191423_0004
    11/07/20 03:59:24 INFO streaming.StreamJob: To kill this job, run:
    11/07/20 03:59:24 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0004
    11/07/20 03:59:24 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0004
    11/07/20 03:59:25 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/20 03:59:36 INFO streaming.StreamJob: map 20% reduce 0%
    11/07/20 03:59:37 INFO streaming.StreamJob: map 45% reduce 0%
    11/07/20 03:59:38 INFO streaming.StreamJob: map 60% reduce 0%
    11/07/20 03:59:39 INFO streaming.StreamJob: map 80% reduce 0%
    11/07/20 03:59:40 INFO streaming.StreamJob: map 95% reduce 0%
    11/07/20 03:59:41 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/20 04:18:27 INFO streaming.StreamJob: Job complete: job_201107191423_0004
    11/07/20 04:18:28 INFO streaming.StreamJob: Output: load

    real 19m5.252s
    user 0m2.360s
    sys 0m1.129s
    Master loader completed

    Postsubtraction loader
    ********************
    real 0m27.082s
    user 0m10.882s
    sys 0m1.435s

    Postsubstraction on the Unmapped reads
    ********************
    11/07/20 04:19:02 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    packageJobJar: [/root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/Fas2FQ1.java, /root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postunmapped.py, /mnt/hadoop/hadoop-unjar3151623736871864286/] [] /tmp/streamjob6660721182687954669.jar tmpDir=null
    11/07/20 04:19:03 INFO mapred.FileInputFormat: Total input paths to process : 40
    11/07/20 04:19:03 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
    11/07/20 04:19:03 INFO streaming.StreamJob: Running job: job_201107191423_0005
    11/07/20 04:19:03 INFO streaming.StreamJob: To kill this job, run:
    11/07/20 04:19:03 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0005
    11/07/20 04:19:03 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0005
    11/07/20 04:19:04 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/20 04:19:15 INFO streaming.StreamJob: map 5% reduce 0%
    11/07/20 04:19:16 INFO streaming.StreamJob: map 15% reduce 0%
    11/07/20 04:19:17 INFO streaming.StreamJob: map 18% reduce 0%
    11/07/20 04:19:18 INFO streaming.StreamJob: map 28% reduce 0%
    11/07/20 04:19:19 INFO streaming.StreamJob: map 48% reduce 0%
    11/07/20 04:19:20 INFO streaming.StreamJob: map 55% reduce 0%
    11/07/20 04:19:21 INFO streaming.StreamJob: map 65% reduce 0%
    11/07/20 04:19:23 INFO streaming.StreamJob: map 72% reduce 0%
    11/07/20 04:19:24 INFO streaming.StreamJob: map 92% reduce 0%
    11/07/20 04:19:25 INFO streaming.StreamJob: map 95% reduce 0%
    11/07/20 04:19:26 INFO streaming.StreamJob: map 97% reduce 0%
    11/07/20 04:19:30 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/20 10:28:32 INFO streaming.StreamJob: Job complete: job_201107191423_0005
    11/07/20 10:28:32 INFO streaming.StreamJob: Output: postsub

    real 369m31.146s
    user 0m5.041s
    sys 0m1.540s

    Postsubstraction on the contigs
    ********************
    11/07/20 10:28:33 WARN streaming.StreamJob: -jobconf option is deprecated, please use -D instead.
    packageJobJar: [/root/FQone2Fastq.java, /root/RepeatMaskerFormat.java, /root/ParsedBlastParser.cc, /root/blastxml.cc, /root/FQone2Fasta.java, /root/extractFullQuert4BHitTable.java, /root/extractUnmapped_latest.java, /root/BlastParser.java, /root/RepeatMaskerRead.java, /root/mapper_postvelvet.py, /mnt/hadoop/hadoop-unjar1426923625300485254/] [] /tmp/streamjob2059410816131312926.jar tmpDir=null
    11/07/20 10:28:34 INFO mapred.FileInputFormat: Total input paths to process : 18
    11/07/20 10:28:34 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred/local]
    11/07/20 10:28:34 INFO streaming.StreamJob: Running job: job_201107191423_0006
    11/07/20 10:28:34 INFO streaming.StreamJob: To kill this job, run:
    11/07/20 10:28:34 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hdfs://ip-10-118-59-251.ec2.internal:50002 -kill job_201107191423_0006
    11/07/20 10:28:34 INFO streaming.StreamJob: Tracking URL: http://ip-10-118-59-251.ec2.internal...107191423_0006
    11/07/20 10:28:35 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/20 10:28:47 INFO streaming.StreamJob: map 28% reduce 0%
    11/07/20 10:28:48 INFO streaming.StreamJob: map 56% reduce 0%
    11/07/20 10:28:49 INFO streaming.StreamJob: map 89% reduce 0%
    11/07/20 10:28:50 INFO streaming.StreamJob: map 94% reduce 0%
    11/07/20 10:28:51 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/20 10:31:56 INFO streaming.StreamJob: Job complete: job_201107191423_0006
    11/07/20 10:31:56 INFO streaming.StreamJob: Output: postsubvel

    real 3m24.070s
    user 0m1.577s
    sys 0m0.238s
    Postsubtraction completed

    File '/usr/local/hadoop-0.19.0/output/Output.tar' stored as 's3://ami-ami-QFnew-foutput/Output.tar' (106291200 bytes in 16.4 seconds, 6.19 MB/s) [1 of 1]

    Results Summary:
    *************
    Results summary:

    Substraction Pathseq_Cloud
    Total number of reads 6369435
    Total number of reads after duplicate remover 6369435
    Total number of unmapped reads after Maq 1 alignment (Database: MAQ1) 1829265
    Total number of unmapped reads after Maq 2 alignment (Database: MAQ2) 504427
    Total number of unmapped reads after Maq 3 alignment (Database: MAQ3) 488954
    Total number of unmapped reads after Maq 4 alignment (Database: MAQ4) 485479
    Total number of unmapped reads after repeat masker 365393
    Total number of unmapped reads after Megablast (Database: BLAST1) 70343
    Total number of unmapped reads after Megablast (Database: BLAST2) 33808
    Total number of unmapped reads after BlastN1 (Database: BLAST1) 33768
    Total number of unmapped reads after BlastN2 (Database: BLAST2) 33746
    Total number of unmapped reads 33746
    Reads after computational subtraction (Unmapped reads) unmappedreads.fq1
    Contigs from unmapped reads contigs.fq1

    Leave a comment:


  • pcs_murali
    replied
    Hi Yi Wei,

    Yes, we are working towards getting BWA implemented into the Pathseq. You are correct BWA is much faster then MAQ.

    Also, working for hadoop based internal computing cluster.

    What kind of internal computer cluster you have? Is it LSF?

    Thanks
    Chandra

    Leave a comment:


  • yiweiny
    replied
    pathseq qustions

    Hi, Chandra,
    Thanks for the advice. It is very helpful. What we are trying to do is look for potential pathogen sequences from Illumina RNA-Seq data. We are probably going to get 40-80 million reads from each sample. Can you send me a copy of Hadoop log from your run of 6 million reads in the sample data file provided by the PathSeq package? I would like to run the same 6 million reads and compare the logs.
    Best Regards,

    Yi Wei

    P.S.
    1. Do you have plans to modify PathSeq so that it can be run on internal computer clusters instead of Amazon Ec2?
    2. Are you considering using Bowtie or Bwa for initial filtering step, as they are much faster than Maq?

    Leave a comment:


  • pcs_murali
    replied
    Hi Yi Wei,

    Thanks for your log file.

    I am re-running Pathseq with the sample file provided with the package. This sample file contains 6 million unique reads. I will share my results with you, once it is done.

    I am looking at the log file which you posted. There are no errors produced. It seems the Pathseq is running fine. As you know we are running 4 maq alignments and 2 megablast alignments and 2 blastn alignments. This in turn takes time to finish them, which is independent of number of reads they go into up to a certain extent. What is mean is as follows:

    If you have 100,000 reads ---- running may take about 5 hours to finish
    If you have 1 million reads -----running may take about more or less the same time as that of 100,000 hours to finish
    If you have 40 million reads -----running may take about 16-18 hours to finish

    I will post you with my latest results i will get from 6 million reads.

    Meanwhile, Please let me know what is your requirements.

    1. How many reads you have in your real sequencing file?
    2. Is reads from Illumina?
    3. Are you using total RNAseq or WGS?

    Thanks
    Chandra

    Leave a comment:


  • yiweiny
    replied
    pathseq logs

    Hi, Chandra,
    Thanks for the advice. I don't know whether it is helpful to you or not, but here is part of the Hadoop log I captured from the master node in Ec2:


    rmr: cannot remove config: No such file or directory.
    rmr: cannot remove s3config: No such file or directory.
    rmr: cannot remove load: No such file or directory.
    Master data_loader
    11/07/08 20:27:30 WARN streaming.StreamJob: -jobconf option is deprecated, plea
    se use -D instead.
    packageJobJar: [/root/mapper_data_compsub.py, /mnt/hadoop/hadoop-unjar324271362
    3624356081/] [] /tmp/streamjob4418070527408526594.jar tmpDir=null
    11/07/08 20:27:30 INFO mapred.FileInputFormat: Total input paths to process : 3
    11/07/08 20:27:31 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
    /local]
    11/07/08 20:27:31 INFO streaming.StreamJob: Running job: job_201107082024_0001
    11/07/08 20:27:31 INFO streaming.StreamJob: To kill this job, run:
    11/07/08 20:27:31 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
    /hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
    ill job_201107082024_0001
    11/07/08 20:27:31 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
    42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0001
    11/07/08 20:27:32 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/08 20:27:44 INFO streaming.StreamJob: map 33% reduce 0%
    11/07/08 20:27:47 INFO streaming.StreamJob: map 67% reduce 0%
    11/07/08 20:27:48 INFO streaming.StreamJob: map 100% reduce 0%
    11/07/08 20:47:03 INFO streaming.StreamJob: Job complete: job_201107082024_0001
    11/07/08 20:47:03 INFO streaming.StreamJob: Output: load

    real 19m33.828s
    user 0m2.365s
    sys 0m0.665s
    Master loader completed
    ERROR: Bucket 'ami-yiweijob6-stat' does not exist
    Bucket 's3://ami-yiweijob6-stat/' removed
    Bucket 's3://ami-yiweijob6-stat/' created
    ERROR: Bucket 'ami-yiweijob6-output' does not exist
    Bucket 's3://ami-yiweijob6-output/' removed
    Bucket 's3://ami-yiweijob6-output/' created
    File s3://reads-yiwei-regeneron/input1.local saved as '/usr/local/hadoop-0.19.0
    /input1.local' (75 bytes in 0.0 seconds, 4.17 kB/s)
    File s3://reads-yiwei-regeneron/input10.local saved as '/usr/local/hadoop-0.19.
    0/input10.local' (76 bytes in 0.0 seconds, 3.08 kB/s)
    File s3://reads-yiwei-regeneron/input11.local saved as '/usr/local/hadoop-0.19.
    0/input11.local' (76 bytes in 0.0 seconds, 2.79 kB/s)
    File s3://reads-yiwei-regeneron/input12.local saved as '/usr/local/hadoop-0.19.
    0/input12.local' (76 bytes in 0.0 seconds, 3.16 kB/s)
    File s3://reads-yiwei-regeneron/input2.local saved as '/usr/local/hadoop-0.19.0
    /input2.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
    File s3://reads-yiwei-regeneron/input3.local saved as '/usr/local/hadoop-0.19.0
    /input3.local' (75 bytes in 0.0 seconds, 2.59 kB/s)
    File s3://reads-yiwei-regeneron/input4.local saved as '/usr/local/hadoop-0.19.0
    /input4.local' (75 bytes in 0.0 seconds, 3.34 kB/s)
    File s3://reads-yiwei-regeneron/input5.local saved as '/usr/local/hadoop-0.19.0
    /input5.local' (75 bytes in 0.0 seconds, 2.99 kB/s)
    File s3://reads-yiwei-regeneron/input6.local saved as '/usr/local/hadoop-0.19.0
    /input6.local' (75 bytes in 0.0 seconds, 3.43 kB/s)
    File s3://reads-yiwei-regeneron/input7.local saved as '/usr/local/hadoop-0.19.0
    /input7.local' (75 bytes in 0.0 seconds, 3.18 kB/s)
    File s3://reads-yiwei-regeneron/input8.local saved as '/usr/local/hadoop-0.19.0
    /input8.local' (75 bytes in 0.0 seconds, 3.25 kB/s)
    File s3://reads-yiwei-regeneron/input9.local saved as '/usr/local/hadoop-0.19.0
    /input9.local' (75 bytes in 0.0 seconds, 3.46 kB/s)
    rmr: cannot remove test: No such file or directory.
    rmr: cannot remove maq: No such file or directory.
    Maq alignments + Duplicate remover
    11/07/08 20:47:10 WARN streaming.StreamJob: -jobconf option is deprecated, plea
    se use -D instead.
    packageJobJar: [/root/mapper_maqalignment.py, /root/Sam2Fastq.java, /root/FQone
    2Fastq.java, /root/Fastq2FQone.java, /root/removeduplicates_new.java, /root/MAQ
    unmapped2FQone.java, /root/MAQunmapped2fastq.java, /mnt/hadoop/hadoop-unjar7827
    192869392733442/] [] /tmp/streamjob8065656650743401458.jar tmpDir=null
    11/07/08 20:47:10 INFO mapred.FileInputFormat: Total input paths to process : 1
    2
    11/07/08 20:47:11 INFO streaming.StreamJob: getLocalDirs(): [/mnt/hadoop/mapred
    /local]
    11/07/08 20:47:11 INFO streaming.StreamJob: Running job: job_201107082024_0002
    11/07/08 20:47:11 INFO streaming.StreamJob: To kill this job, run:
    11/07/08 20:47:11 INFO streaming.StreamJob: /usr/local/hadoop-0.19.0/bin/../bin
    /hadoop job -Dmapred.job.tracker=hdfs://ip-10-82-215-242.ec2.internal:50002 -k
    ill job_201107082024_0002
    11/07/08 20:47:11 INFO streaming.StreamJob: Tracking URL: http://ip-10-82-215-2
    42.ec2.internal:50030/jobdetails.jsp?jobid=job_201107082024_0002
    11/07/08 20:47:12 INFO streaming.StreamJob: map 0% reduce 0%
    11/07/08 20:47:24 INFO streaming.StreamJob: map 8% reduce 0%
    11/07/08 20:47:25 INFO streaming.StreamJob: map 17% reduce 0%
    11/07/08 20:47:29 INFO streaming.StreamJob: map 33% reduce 0%
    11/07/08 20:47:30 INFO streaming.StreamJob: map 42% reduce 0%
    11/07/08 20:47:34 INFO streaming.StreamJob: map 58% reduce 0%
    11/07/08 20:47:35 INFO streaming.StreamJob: map 67% reduce 0%
    11/07/08 20:47:39 INFO streaming.StreamJob: map 75% reduce 0%

    This is for running 100,000 reads in 3 instances in Ec2. I have to shut it down after 2 hours as the processing does not seem to be able to be finished in reasonable amount of time. I hope this log is useful for your trouble shooting. And thanks again for your help!

    Yi Wei

    Leave a comment:


  • pcs_murali
    replied
    Hi Yi,

    I will re-run it on the cloud and see how much time it will take.

    In our hands we run other samples with 40million reads in 1 to 1.2 days.

    Meanwhile, please download the latest version from our website.

    I will get back to you as soon as possible.

    I greatly appreciate your comments.

    Thanks
    Chandra

    Leave a comment:


  • yiweiny
    replied
    slow PathSeq

    Hi, Chandra,
    The version of PathSeq is 5.1. The data set is a sampling of 100,000 reads from the sample input files provided by the PathSeq web site. My problem is that these does not seem to be a difference whether I run it on 10 nodes or on 20 nodes. Both took a long time to run. I am concerned the Hadoop cluster is not set up correctly.
    Thansk for the prompt reply and I am looking forward to hearing from you.

    Yi

    Leave a comment:


  • pcs_murali
    replied
    Hi,

    Could you send me the version of Pathseq you are running?

    Also, is your dataset from RNA based or DNA based?


    Thanks
    Chandra

    Leave a comment:


  • yiweiny
    replied
    PathSeq is very slow on Ec2

    I also set up PathSeq on Ec2. I was able to run it but it was very slow. I tested a data set with 100,000 reads and it took 10 hours running on 10 instances. I would appreciate any advice you can give me.

    Leave a comment:


  • pcs_murali
    replied
    Hi,

    The Amazon EC2 instances we used is Large instances. In our hands, for total RNA sequencing data (30 to 50 million reads) from GAIIx it takes about 1 to 1.2 days (on 20 nodes parallely) to finish the runs.

    We are working towards reducing these runs.

    Please let me know if you need more information on Pathseq.

    Thanks
    Chandra

    Leave a comment:


  • dbrazel
    started a topic Seeking advice on PathSeq

    Seeking advice on PathSeq

    Hi all,

    I'm interested in using the PathSeq software and I was wondering if anyone had some advice on what sort of Amazon EC2 instances result in reasonable run times for full Illumina GAIIx or HiSeq data sets.

    Thanks in advance!

Latest Articles

Collapse

  • seqadmin
    Essential Discoveries and Tools in Epitranscriptomics
    by seqadmin




    The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
    04-22-2024, 07:01 AM
  • seqadmin
    Current Approaches to Protein Sequencing
    by seqadmin


    Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
    04-04-2024, 04:25 PM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 04-25-2024, 11:49 AM
0 responses
19 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-24-2024, 08:47 AM
0 responses
20 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-11-2024, 12:08 PM
0 responses
62 views
0 likes
Last Post seqadmin  
Started by seqadmin, 04-10-2024, 10:19 PM
0 responses
61 views
0 likes
Last Post seqadmin  
Working...
X