“Unrecognized option: -files” in hadoop streaming job

I was recently working on an Elastic MapReduce Streaming setup, that required copying a few required Python files to the nodes in addition to the mapper/reducer.

After much trial & error, I ending up using the following .NET AWS SDK code to accomplish the file upload:

var mapReduce = new StreamingStep {
    Inputs = new List<string> { "s3://<bucket>/input.txt" },
    Output = "s3://<bucket>/output/",
    Mapper = "s3://<bucket>/mapper.py",
    Reducer = "s3://<bucket>/reducer.py",
}.ToHadoopJarStepConfig();

mapReduce.Args.Add("-files");
mapReduce.Args.Add("s3://<bucket>/python_module_1.py,s3://<bucket>/python_module_2.py");

var step = new StepConfig {
    Name = "python_mapreduce",
    ActionOnFailure = "TERMINATE_JOB_FLOW",
    HadoopJarStep = mapReduce
};

// Then build & submit the RunJobFlowRequest

This generated the rather odd error:

ERROR org.apache.hadoop.streaming.StreamJob (main): Unrecognized option: -files

Odd, because -files most certainly is an option.

Prolonged googling later, and I discovered that the -files option needs to come first. However, StreamingStep doesn’t give me any way to change the order of the arguments – or does it?

I eventually realised I was being a bit dense. ToHadoopJarStepConfig() is a convenience method that just generates a regular JarStep… which exposes the args as a List. Change the code to this:

mapReduce.Args.Insert(0, "-files");
mapReduce.Args.Insert(1, "s3://<bucket>/python_module_1.py,s3://<bucket>/python_module_2.py");

and everything is awesome.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s