Version 0.2.2 Released!
After a long silence, Neptune is back with new features and a minor version bump! This time around, we add support for batch jobs - specifically, when you call babel(), you can pass it either a hash (one job) or an array of hashes (many jobs). One use case we've found for this is to support MapReduce-style workflows. Consider the traditional WordCount example - with babel and our new batch support, it would look like the following:
NUM_MAPPERS = 100
def make_n_tasks(params)
tasks = []
NUM_MAPPERS.times { |i|
tasks << params
}
return tasks
end
common_params = {
:storage => "s3",
:is_remote => true,
:bucket_name => "neptune-testbin",
:run_local => true,
:engine => "executor-sqs",
:instance_type => "m2.4xlarge"
}
puts "Started at #{Time.now}"
puts "Starting Shakespeare Wordcount, with #{NUM_MAPPERS} map tasks and 1 reduce task"
STDOUT.flush
map_params = common_params.dup
map_params[:code] = "/neptune-testbin/babel/home/cgb/neptune/scripts/benchmarks/wordcount/wc.py"
map_params[:argv] = ["/neptune-testbin/babel/home/cgb/neptune/scripts/benchmarks/shakespeare.txt"]
map_tasks = babel(make_n_tasks(map_params))
puts "\n\nMap Numbers:\n"
puts "batch at #{Time.now}"
STDOUT.flush
outputs = []
map_tasks.each { |t|
puts "total execution time is #{t.total_execution_time}"
puts "total storage time is #{t.total_storage_time}"
puts "time to read from queue is #{t.queue_pop_time}"
puts "time to store inputs is #{t.time_to_store_inputs}"
puts "input storage time is #{t.input_storage_time}"
puts "output storage time is #{t.output_storage_time}"
puts "total task execution time is #{t.total_task_time}"
puts "fetch finished at #{Time.now}\n\n"
outputs << t.job_data['@output']
STDOUT.flush
}
puts "done with map, starting reduce"
STDOUT.flush
reduce_params = common_params.dup
reduce_params[:code] = "/neptune-testbin/babel/home/cgb/neptune/scripts/benchmarks/wordcount/reduce.py"
reduce_params[:argv] = outputs
reduce_task = babel(reduce_params)
puts "total execution time is #{reduce_task.total_execution_time}"
puts "total storage time is #{reduce_task.total_storage_time}"
puts "time to read from queue is #{reduce_task.queue_pop_time}"
puts "time to store inputs is #{reduce_task.time_to_store_inputs}"
puts "input storage time is #{reduce_task.input_storage_time}"
puts "output storage time is #{reduce_task.output_storage_time}"
puts "total task execution time is #{reduce_task.total_task_time}"
puts "fetch finished at #{Time.now}\n\n"
puts "MR finished at #{Time.now}"This example runs 100 mappers and a single reducer, where EC2 is used to do computation, S3 is used for storage, and SQS is used to hold the tasks themselves. It's all automatically managed for you, so you don't need to know anything about EC2/S3/SQS to use this system - just write the code above and you're good to go! A lot of the extra code above gets pretty in-depth on cloud profiling (since I love to know exactly where the time is being spent), but goes to show you exactly what babel can do. Right now, the performance of batch jobs isn't that great (read: it sucks) because each job results in multiple SOAP calls, which in aggregate is really slow. We're working on fixing this for 0.2.3, so stay tuned!
P.S. We've moved over to github - pull requests gladly considered!

