"Unrecognized option: -files" in hadoop streaming job
August 11, 2014 -I was recently working on an Elastic MapReduce Streaming setup, that required copying a few required Python files to the nodes in addition to the mapper/reducer.
After much trial & error, I ending up using the following .NET AWS SDK code to accomplish the file upload:
var mapReduce = new StreamingStep {
Inputs = new List<string> { "s3://<bucket>/input.txt" },
Output = "s3://<bucket>/output/",
Mapper = "s3://<bucket>/mapper.py",
Reducer = "s3://<bucket>/reducer.py",
}.ToHadoopJarStepConfig();
mapReduce.Args.Add("-files");
mapReduce.Args.Add("s3://<bucket>/python\_module\_1.py,s3://<bucket>/python\_module\_2.py");
var step = new StepConfig { Name = "python\_mapreduce", ActionOnFailure = "TERMINATE\_JOB\_FLOW", HadoopJarStep = mapReduce };
// Then build & submit the RunJobFlowRequest
This generated the rather odd error:
ERROR org.apache.hadoop.streaming.StreamJob (main): Unrecognized option: -files
Odd, because -files
most certainly is an option.
Prolonged googling later, and I discovered that the -files
option needs to come first. However, StreamingStep
doesn't give me any way to change the order of the arguments - or does it?
I eventually realised I was being a bit dense. ToHadoopJarStepConfig()
is a convenience method that just generates a regular JarStep... which exposes the args as a List. Change the code to this:
mapReduce.Args.Insert(0, "-files");
mapReduce.Args.Insert(1, "s3://<bucket>/python\_module\_1.py,s3://<bucket>/python\_module\_2.py");
and everything is awesome.