Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Once you have access enabled to Hadoop on Windows Azure you can run any mahout sample on head node. I am just trying to run original Apache Mahout (https://mahout.apache.org/) sample which is derived from the clustering sample on Mahout's website (https://cwiki.apache.org/confluence/display/MAHOUT/Clustering+of+synthetic+control+data).
Step 1: Please RDP to your head node and open the Hadoop command line window.
Here you can just launch MAHOUT to see what happens
Step 2: Download necessary data file from the Internet:
Please download Synthetic control data from https://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data and place it under c:\apps\dist\mahout\examples\bin\work\synthetic_control.data"
Step 3: Go to folder c:\apps\dist\mahout\examples\bin and Run command "build-cluster-syntheticcontrol.cmd" and select the desired clustering algorithm from the driver script.
c:\Apps\dist\mahout\examples\bin>build-cluster-syntheticcontrol.cmd
"Please select a number to choose the corresponding clustering algorithm"
"1. canopy clustering"
"2. kmeans clustering"
"3. fuzzykmeans clustering"
"4. dirichlet clustering"
"5. meanshift clustering"
Enter your choice:1
"ok. You chose 1 and we'll use canopy Clustering"
"DFS is healthy... "
"Uploading Synthetic control data to HDFS"
rmr: cannot remove testdata: No such file or directory.
"Successfully Uploaded Synthetic control data to HDFS "
"Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver org.apache.mahout.clustering.synthet
iccontrol.canopy.Job
12/03/06 00:50:10 WARN driver.MahoutDriver: No org.apache.mahout.clustering.syntheticcontrol.canopy.Job.props found on classpath, will use command-lin
e arguments only
12/03/06 00:50:10 INFO canopy.Job: Running with default arguments
12/03/06 00:50:17 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:50:18 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:50:20 INFO mapred.JobClient: Running job: job_201203052259_0001
12/03/06 00:50:21 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:51:00 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:51:11 INFO mapred.JobClient: Job complete: job_201203052259_0001
12/03/06 00:51:11 INFO mapred.JobClient: Counters: 16
12/03/06 00:51:11 INFO mapred.JobClient: Job Counters
12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=33969
12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:51:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:51:11 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:51:11 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 00:51:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/06 00:51:11 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:51:11 INFO mapred.JobClient: Bytes Written=335470
12/03/06 00:51:11 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_READ=130
12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_READ=288508
12/03/06 00:51:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=21557
12/03/06 00:51:11 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=335470
12/03/06 00:51:11 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:51:11 INFO mapred.JobClient: Bytes Read=288374
12/03/06 00:51:11 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:51:11 INFO mapred.JobClient: Map input records=600
12/03/06 00:51:11 INFO mapred.JobClient: Spilled Records=0
12/03/06 00:51:11 INFO mapred.JobClient: Map output records=600
12/03/06 00:51:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=134
12/03/06 00:51:11 INFO canopy.CanopyDriver: Build Clusters Input: output/data Out: output Measure: org.apache.mahout.common.distance.EuclideanDistance
Measure@1997c1d8 t1: 80.0 t2: 55.0
12/03/06 00:51:11 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:51:12 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:51:13 INFO mapred.JobClient: Running job: job_201203052259_0002
12/03/06 00:51:14 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:51:58 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:52:16 INFO mapred.JobClient: map 100% reduce 100%
12/03/06 00:52:27 INFO mapred.JobClient: Job complete: job_201203052259_0002
12/03/06 00:52:27 INFO mapred.JobClient: Counters: 25
12/03/06 00:52:27 INFO mapred.JobClient: Job Counters
12/03/06 00:52:27 INFO mapred.JobClient: Launched reduce tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30345
12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:52:27 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:52:27 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: Data-local map tasks=1
12/03/06 00:52:27 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=15968
12/03/06 00:52:27 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:52:27 INFO mapred.JobClient: Bytes Written=6615
12/03/06 00:52:27 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_READ=14296
12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_READ=335597
12/03/06 00:52:27 INFO mapred.JobClient: FILE_BYTES_WRITTEN=73063
12/03/06 00:52:27 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=6615
12/03/06 00:52:27 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:52:27 INFO mapred.JobClient: Bytes Read=335470
12/03/06 00:52:27 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:52:27 INFO mapred.JobClient: Reduce input groups=1
12/03/06 00:52:27 INFO mapred.JobClient: Map output materialized bytes=13906
12/03/06 00:52:27 INFO mapred.JobClient: Combine output records=0
12/03/06 00:52:27 INFO mapred.JobClient: Map input records=600
12/03/06 00:52:27 INFO mapred.JobClient: Reduce shuffle bytes=0
12/03/06 00:52:27 INFO mapred.JobClient: Reduce output records=6
12/03/06 00:52:27 INFO mapred.JobClient: Spilled Records=50
12/03/06 00:52:27 INFO mapred.JobClient: Map output bytes=13800
12/03/06 00:52:27 INFO mapred.JobClient: Combine input records=0
12/03/06 00:52:27 INFO mapred.JobClient: Map output records=25
12/03/06 00:52:27 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
12/03/06 00:52:27 INFO mapred.JobClient: Reduce input records=25
12/03/06 00:52:27 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
12/03/06 00:52:27 INFO input.FileInputFormat: Total input paths to process : 1
12/03/06 00:52:28 INFO mapred.JobClient: Running job: job_201203052259_0003
12/03/06 00:52:29 INFO mapred.JobClient: map 0% reduce 0%
12/03/06 00:53:46 INFO mapred.JobClient: map 100% reduce 0%
12/03/06 00:58:20 INFO mapred.JobClient: Job complete: job_201203052259_0003
12/03/06 00:58:20 INFO mapred.JobClient: Counters: 16
12/03/06 00:58:20 INFO mapred.JobClient: Job Counters
12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30407
12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
12/03/06 00:58:20 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
12/03/06 00:58:20 INFO mapred.JobClient: Rack-local map tasks=1
12/03/06 00:58:20 INFO mapred.JobClient: Launched map tasks=1
12/03/06 00:58:20 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=0
12/03/06 00:58:20 INFO mapred.JobClient: File Output Format Counters
12/03/06 00:58:20 INFO mapred.JobClient: Bytes Written=340891
12/03/06 00:58:20 INFO mapred.JobClient: FileSystemCounters
12/03/06 00:58:20 INFO mapred.JobClient: FILE_BYTES_READ=130
12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_READ=342212
12/03/06 00:58:21 INFO mapred.JobClient: FILE_BYTES_WRITTEN=22251
12/03/06 00:58:21 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=340891
12/03/06 00:58:21 INFO mapred.JobClient: File Input Format Counters
12/03/06 00:58:21 INFO mapred.JobClient: Bytes Read=335470
12/03/06 00:58:21 INFO mapred.JobClient: Map-Reduce Framework
12/03/06 00:58:21 INFO mapred.JobClient: Map input records=600
12/03/06 00:58:21 INFO mapred.JobClient: Spilled Records=0
12/03/06 00:58:21 INFO mapred.JobClient: Map output records=600
12/03/06 00:58:21 INFO mapred.JobClient: SPLIT_RAW_BYTES=127
C-0{n=21 c=[29.552, 33.073, 35.876, 36.375, 35.118, 32.761, 29.566, 26.983, 25.272, 24.967, 25.691, 28.252, 30.994, 33.088, 34.015, 34.349, 32.826, 31
.053, 29.116, 27.975, 27.879, 28.103, 28.775, 30.585, 31.049, 31.652, 31.956, 31.278, 30.719, 29.901, 29.545, 30.207, 30.672, 31.366, 31.032, 31.567,
30.610, 30.204, 29.266, 29.753, 29.296, 29.930, 31.207, 31.191, 31.474, 32.154, 31.746, 30.771, 30.250, 29.807, 29.543, 29.397, 29.838, 30.489, 30.705
, 31.503, 31.360, 30.827, 30.426, 30.399] r=[0.979, 3.352, 5.334, 5.851, 4.868, 3.000, 3.376, 4.812, 5.159, 5.596, 4.940, 4.793, 5.415, 5.014, 5.155,
4.262, 4.891, 5.475, 6.626, 5.691, 5.240, 4.385, 5.767, 7.035, 6.238, 6.349, 5.587, 6.006, 6.282, 7.483, 6.872, 6.952, 7.374, 8.077, 8.676, 8.636, 8.6
97, 9.066, 9.835, 10.148, 10.091, 10.175, 9.929, 10.241, 9.824, 10.128, 10.595, 9.799, 10.306, 10.036, 10.069, 10.058, 10.008, 10.335, 10.160, 10.249,
10.222, 10.081, 10.274, 10.145]}
Weight: Point:
……...
……..
…….
1.0: [27.414, 25.397, 26.460, 31.978, 26.125, 27.463, 30.489, 34.929, 27.558, 30.686, 27.511, 32.269, 32.834, 27.129, 24.991, 32.610, 25.387,
32.674, 34.607, 33.519, 29.012, 28.705, 32.116, 29.121, 26.424, 33.452, 33.623, 29.457, 35.025, 26.607, 34.442, 34.847, 28.897, 34.439, 32.011, 34.816
, 27.773, 11.549, 20.219, 19.678, 14.715, 14.384, 15.556, 9.573, 10.636, 16.639, 17.236, 19.643, 18.317, 15.323, 19.106, 11.455, 16.888, 18.269, 11.58
3, 112/03/06 00:58:24 INFO driver.MahoutDriver: Program took 493470 ms
After the Mahout job was completed the output was stored as below:
js> #ls Found 3 items drwxr-xr-x - avkash supergroup 0 2012-03-06 01:05 /user/avkash/.oink drwxr-xr-x - avkash supergroup 0 2012-03-06 00:52 /user/avkash/output drwxr-xr-x - avkash supergroup 0 2012-03-06 00:49 /user/avkash/testdata js> #ls /user/avkash/output Found 3 items drwxr-xr-x - avkash supergroup 0 2012-03-06 00:53 /user/avkash/output/clusteredPoints drwxr-xr-x - avkash supergroup 0 2012-03-06 00:52 /user/avkash/output/clusters-0 drwxr-xr-x - avkash supergroup 0 2012-03-06 00:51 /user/avkash/output/data |
Now let’s analyzing mahout cluster output using clusterdump utility:
Clusterdump utility takes 3 parameters:
- –seqFileDir – this is the path folder where clustering sequence folder is (in this case output/clusters-0)
- –pointsDir – this is the path folder where clustering points folder is (in this case output/clusteredPoints)
- --output– this is the path where you would want to create your analysis result.
- Be sure that this parameter will force to create analysis result text in local machine not on HDFS
Running the command as below:
c:\Apps\dist\mahout\examples\bin>mahout clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
"Running on hadoop, using HADOOP_HOME=c:\Apps\dist"
c:\Apps\dist\bin\hadoop jar c:\Apps\dist\mahout\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver clusterdump --seqFileDir output\clusters-0 --pointsDir output\clusteredPoints --output clusteranalyze.txt
12/03/06 21:05:53 WARN driver.MahoutDriver: No clusterdump.props found on classpath, will use command-line arguments only
12/03/06 21:05:53 INFO common.AbstractJob: Command line arguments: {--dictionaryType=text, --endPhase=2147483647, --output=clusteranalyze.txt, --pointsDir=output\clusteredPoints, --seqFileDir=output\clusters-0, --startPhase=0, --tempDir=temp}
12/03/06 21:05:55 INFO driver.MahoutDriver: Program took 2031 ms
Now if you open folder at your machine, will find “clusteranalyze.txt” as below:
Opening clusteranalyze.txt shows the data as below:
Cluster Dumper Reference:
- https://cwiki.apache.org/confluence/display/MAHOUT/Cluster+Dumper
- https://cwiki.apache.org/MAHOUT/cluster-dumper.html
Comments
- Anonymous
November 05, 2012
I run into the following when trying to run Mahout on my Azure environment. I don't have much experience with Windows shell scripting, so please forgive me if it's something obvious: c:appsdistmahoutbin>mahout Running here: c:appsdisthadoop-1.1.0-SNAPSHOTbinhadoop jar c:appsdistmah outbin..\mahout-examples-0.5-job.jar org.apache.mahout.driver.MahoutDriver Usage: java [-options] class [args...] (to execute a class) or java [-options] -jar jarfile [args...] (to execute a jar file) where options include: -d32 use a 32-bit data model if available -d64 use a 64-bit data model if available -server to select the "server" VM -hotspot is a synonym for the "server" VM [deprecated] The default VM is server. -cp <class search path of directories and zip/jar files> -classpath <class search path of directories and zip/jar files> A ; separated list of directories, JAR archives, and ZIP archives to search for class files. -D<name>=<value> set a system property -verbose[:class|gc|jni] enable verbose output -version print product version and exit -version:<value> require the specified version to run -showversion print product version and continue -jre-restrict-search | -no-jre-restrict-search include/exclude user private JREs in the version search -? -help print this help message -X print help on non-standard options -ea[:<packagename>...|:<classname>] -enableassertions[:<packagename>...|:<classname>] enable assertions with specified granularity -da[:<packagename>...|:<classname>] -disableassertions[:<packagename>...|:<classname>] disable assertions with specified granularity -esa | -enablesystemassertions enable system assertions -dsa | -disablesystemassertions disable system assertions -agentlib:<libname>[=<options>] load native agent library <libname>, e.g. -agentlib:hprof see also, -agentlib:jdwp=help and -agentlib:hprof=help -agentpath:<pathname>[=<options>] load native agent library by full pathname -javaagent:<jarpath>[=<options>] load Java programming language agent, see java.lang.instrument -splash:<imagepath> show splash screen with specified image See java.sun.com/.../reference for more details.