"Unrecognized option: -files" in hadoop streaming job

August 11, 2014 - hive-memory hadoop

I was recently working on an Elastic MapReduce Streaming setup, that required copying a few required Python files to the nodes in addition to the mapper/reducer.

After much trial & error, I ending up using the following .NET AWS SDK code to accomplish the file upload:

var mapReduce = new StreamingStep { 
  Inputs = new List<string> { "s3://<bucket>/input.txt" }, 
  Output = "s3://<bucket>/output/", 
  Mapper = "s3://<bucket>/mapper.py", 
  Reducer = "s3://<bucket>/reducer.py", 
}.ToHadoopJarStepConfig();

mapReduce.Args.Add("-files"); 
mapReduce.Args.Add("s3://<bucket>/python\_module\_1.py,s3://<bucket>/python\_module\_2.py");

var step = new StepConfig { Name = "python\_mapreduce", ActionOnFailure = "TERMINATE\_JOB\_FLOW", HadoopJarStep = mapReduce };

// Then build & submit the RunJobFlowRequest

This generated the rather odd error:

ERROR org.apache.hadoop.streaming.StreamJob (main): Unrecognized option: -files

Odd, because -files most certainly is an option.

Prolonged googling later, and I discovered that the -files option needs to come first. However, StreamingStep doesn't give me any way to change the order of the arguments - or does it?

I eventually realised I was being a bit dense. ToHadoopJarStepConfig() is a convenience method that just generates a regular JarStep... which exposes the args as a List. Change the code to this:

mapReduce.Args.Insert(0, "-files"); 
mapReduce.Args.Insert(1, "s3://<bucket>/python\_module\_1.py,s3://<bucket>/python\_module\_2.py");

and everything is awesome.