Version 0.2.2 Released!
After a long silence, Neptune is back with new features and a minor version bump! This time around, we add support for batch jobs - specifically, when you call babel(), you can pass it either a hash (one job) or an array of hashes (many jobs). One use case we've found for this is to support MapReduce-style workflows. Consider the traditional WordCount example - with babel and our new batch support, it would look like the following:
NUM_MAPPERS = 100
def make_n_tasks(params)
tasks = []
NUM_MAPPERS.times { |i|
tasks << params
}
return tasks
end
common_params = {
:storage => "s3",
:is_remote => true,
:bucket_name => "neptune-testbin",
:run_local => true,
:engine => "executor-sqs",
:instance_type => "m2.4xlarge"
}
puts "Started at #{Time.now}"
puts "Starting Shakespeare Wordcount, with #{NUM_MAPPERS} map tasks and 1 reduce task"
STDOUT.flush
map_params = common_params.dup
map_params[:code] = "/neptune-testbin/babel/home/cgb/neptune/scripts/benchmarks/wordcount/wc.py"
map_params[:argv] = ["/neptune-testbin/babel/home/cgb/neptune/scripts/benchmarks/shakespeare.txt"]
map_tasks = babel(make_n_tasks(map_params))
puts "\n\nMap Numbers:\n"
puts "batch at #{Time.now}"
STDOUT.flush
outputs = []
map_tasks.each { |t|
puts "total execution time is #{t.total_execution_time}"
puts "total storage time is #{t.total_storage_time}"
puts "time to read from queue is #{t.queue_pop_time}"
puts "time to store inputs is #{t.time_to_store_inputs}"
puts "input storage time is #{t.input_storage_time}"
puts "output storage time is #{t.output_storage_time}"
puts "total task execution time is #{t.total_task_time}"
puts "fetch finished at #{Time.now}\n\n"
outputs << t.job_data['@output']
STDOUT.flush
}
puts "done with map, starting reduce"
STDOUT.flush
reduce_params = common_params.dup
reduce_params[:code] = "/neptune-testbin/babel/home/cgb/neptune/scripts/benchmarks/wordcount/reduce.py"
reduce_params[:argv] = outputs
reduce_task = babel(reduce_params)
puts "total execution time is #{reduce_task.total_execution_time}"
puts "total storage time is #{reduce_task.total_storage_time}"
puts "time to read from queue is #{reduce_task.queue_pop_time}"
puts "time to store inputs is #{reduce_task.time_to_store_inputs}"
puts "input storage time is #{reduce_task.input_storage_time}"
puts "output storage time is #{reduce_task.output_storage_time}"
puts "total task execution time is #{reduce_task.total_task_time}"
puts "fetch finished at #{Time.now}\n\n"
puts "MR finished at #{Time.now}"This example runs 100 mappers and a single reducer, where EC2 is used to do computation, S3 is used for storage, and SQS is used to hold the tasks themselves. It's all automatically managed for you, so you don't need to know anything about EC2/S3/SQS to use this system - just write the code above and you're good to go! A lot of the extra code above gets pretty in-depth on cloud profiling (since I love to know exactly where the time is being spent), but goes to show you exactly what babel can do. Right now, the performance of batch jobs isn't that great (read: it sucks) because each job results in multiple SOAP calls, which in aggregate is really slow. We're working on fixing this for 0.2.3, so stay tuned!
P.S. We've moved over to github - pull requests gladly considered!
Version 0.2.1 Released!
Version 0.2.0 (released just yesterday!) added super-awesome support for babel function calls, in which you just say "run this code and get me the output" without having to write a Neptune job for each of those functions. But it only worked for jobs with :type => babel (a new job type I have yet to fully blog about), and I did promise that you could use this automation for other job types soon. And soon is apparently very soon, by which I mean now! Here's the nice short and sweet changelog for this version:
Adds support for the babel() method on non-Babel job types (e.g., MPI, MapReduce), and removed the is_remote parameter for babel() calls (since it can always be inferred from :storage). Added more unit tests accordingly for babel() calls, and completed documentation per rdoc standards.
But while I think 93% code coverage on babel.rb and 100% documentation per rdoc standards is amazing, you as a user likely could not care less. So let's get back to what I enticed you with in the first place: automatically running MPI jobs "in the cloud" with a single call to babel(). And let's use an open source example so you can follow at home if you like. Today we'll be running an implementation of the Graph500 benchmark, written with the Knowledge Discovery Toolkit (KDT). KDT code is written in Python and automatically runs over MPI (within AppScale, since we have all the KDT prerequisites installed there), so our MPI job type will run it automatically. To run the Graph500 benchmark and print the results, we just download the benchmark code, included with KDT (in the examples directory), and run the following code:
puts babel( :type => "mpi", :code => "/home/boo/kdt/examples/Graph500.py", :executable => 'python', :procs_to_use => 1, :nodes_to_use => 1, )
The above code assumes that we downloaded KDT to our home directory (/home/boo) and uses the same MPI job parameters as a regular MPI Neptune job. This results in at least three Neptune jobs - one to upload the code, one to run it, and at least one to get the output (since it may not be done on the first try, we sleep and try again until it is done). If we already have the code uploaded somewhere, we can skip the first Neptune job with the following code:
puts babel( :type => "mpi", :code => "/bucketname/kdt/examples/Graph500.py", :executable => 'python', :procs_to_use => 1, :nodes_to_use => 1, :storage => "s3" )
Both examples require some environment variables to be set. In the case where the code is stored locally (the first example), babel needs to know what bucket in the remote datastore to upload your code to (defaulting to AppScale's datastore), which is controlled by the :bucket_name parameter or BABEL_BUCKET_NAME environment variable. In the case where the code is stored remotely, if Amazon S3 is used, neptune needs to know your EC2 access key and secret key, and babel will funnel that info along in the usual fashion (just export them as environment variables or parameters in the job above). So now that you know how to make running your code easier than ever, upgrade to version 0.2.1 and get coding!
Version 0.2.0 Released!
Neptune does a lot of super-cool stuff already, and makes it really easy to run your code on other machines without you needing to know the specifics. But could it be made easier? That is, if I want to actually run some code and get the result, I have to:
1. Run a Neptune job to put the code in the datastore
2. (may be optional) Run a Neptune job to put any inputs my code needs in the datastore.
3. Run a Neptune job to actually run the code.
4. Run a Neptune job to get the output of the job, polling for its completion.
The fact that I don't have to worry about how to configure machines to run MPI, NFS, and so on is amazing, but as a programmer I'm super-duper lazy and was pretty sure we could get the four steps above down to one step. And with Neptune 0.2.0, it's now possible! We have a new keyword / library call, babel, that automates all the steps above (more on the name choice in a later post). It uses the wonderful promise RubyGem to use the nicest implementation of futures I've seen in a language, so that you (as a user) don't even know that what babel calls return is a future.
Let's illustrate this with an example. I've got some Python code from my friend who does biochemical simulations, and he wants to run one simulation (one instance of the code) somewhere not on his machine, which for the sake of argument, we will call "the cloud", and more specifically, AppScale. How do I run it with our new babel support? Just like this:
# Run our Python DFSP code
def run_sim()
result = babel(:code => '/mybucket/gproteincycle.py',
:storage => 's3',
:executable => 'python',
:is_remote => true)
# Print all the metadata it returns, which start with # signs
result.split("\n").each { |line|
if line[0].chr == "#"
puts line
end
}
end
run_sim()The first half of the code (the Babel call) gets converted to Neptune jobs that will store the code and inputs in a remote datastore, run the code in AppScale, and get the output of the job back. In our example, uploading the code isn't necessary, since we already say it's in Amazon S3, but if we change :is_remote to false and point :code at a location on our local filesystem, then Neptune will upload the code for us in the usual fashion.
Like I said above, babel calls return a future, and the promise gem's implementation will automatically spawn up a new thread to store the inputs, run the job, and get the output back. If we attempt to read the value of the result (like we do right after the call), then it blocks until the job completes. Our simulation code returns information about the biochemical species involved (not too interesting in our case) as well as metadata on the job (e.g., info about how many reactions were performed). In our case, the output is a big long string, with the metadata stored on lines that start with a hash (#). So we just run the job and look through the output for lines that start with a hash, and print those out to the screen.
So that's pretty cool, but what if I want to run 1000 simulations and print the same metadata? All I have to do is this:
1000.times { |i|
run_sim()
}
I don't have to worry about assigning unique output names like I normally have to do with Neptune jobs - the babel call will take care of that for us as well. I don't have to worry about scheduling the jobs, where they run, and so on - it just looks like code that runs locally, and that's all it should look like.
So in my humble opinion, this warrants a major version upgrade from what Neptune previously offered, so with that, here's Neptune 0.2.0! There are a number of other changes under-the-hood (e.g., test cases for most of the Babel support, refactoring of methods that were in Kernel to NeptuneHelper), but a discussion of that can wait for another day. Get coding - it's easier than ever before!
P.S. See Version 0.2.1 for info on how to use babel on other job types (e.g., MPI).
Version 0.1.3 Released!
Today's release adds support for two new types of computation: the Knowledge Discovery Toolkit and Cicero. KDT lets you write Python code to analyze graphs and is automatically converted to MPI, so you can write code quickly and have it run just as fast. We have support for the newest version available (0.2-preview) and as it requires ATLAS and LAPACK, the newest AppScale build on my research branch (lp:~cgb-cs/appscale/main-cgb-research) automatically installs it. It also installs numpy and scipy (as required by KDT), which have been long requested for Google App Engine apps (although I still have to verify that we don't erase the PYTHONPATH or otherwise prevent them from being used). Here's a sample script that utilizes the Graph500 benchmark from KDT (which assumes you already have the input in S3 in a bucket named 'neptune-testbin'):
puts neptune( :type => "kdt", :output => "/neptune-testbin/mpi-output13.txt", :code => "/neptune-testbin/kdt-test/Graph500.py", :procs_to_use => 16, :nodes_to_use => 16, :storage => "s3", :EC2_ACCESS_KEY => "your access key here", :EC2_SECRET_KEY => "your secret key here", :S3_URL => "https://s3.amazonaws.com" ).inspect
We also add support for Cicero in this release. Cicero is a framework that allows for automatic task execution over Google App Engine and AppScale, so you just write a function in Python, Java, or Go, use oration to make an App Engine app out of it, and then Cicero can execute it for you automatically. You just tell Cicero how many executions you want (so this assumes your app is embarrassingly parallel), use the cicero job type, and you are good to go! A paper with Cicero is in the works, check back for more details! Here's a sample script you can use for Cicero:
response = neptune :type => :cicero,
:nodes_to_use => {"cloud1" => "http://myapp.appspot.com"},
:tasks => 10000,
:function => "dfsp",
:output => "/output/dfsp/"
puts response.inspectIf you use the research branch specified above, the MPI support will now automatically download all the files in whatever directory you want to run code from, whereas before it only downloaded the one file you wanted to execute (making it impossible to include libraries and header files). So get coding and let us know what you think of it!
Version 0.1.2 Released!
After a bit of a delay we have a new version of Neptune out! This time around there aren't any new features, but we have added a unit test suite to get around the horribly long time that the integration tests were taking. We're down from about half an hour now to less than a second, and still at about the same code coverage levels. With that, update your Neptune gem and enjoy!
Version 0.1.1 Released!
And again we have a release! This time we have support for programs written in Go and R, so get coding! I've included sample scripts in samples/go and samples/r that run the usual "Hello world" programs, and changed the AppScale side of things not to spawn up a machine just to run these programs - most of the time they're small and fast enough not to impact the system's performance. Try it out and let me know what you think!
Also, this is the version that will be in AppScale 1.5 - pending any further releases of course :)
Version 0.1.0 Released!
And another version is out! This time we have a verbose flag - in the past, Neptune jobs would clutter up standard out with everything that was going on. Now, it only does this with the verbose flag. Set it to anything (e.g., :verbose => "blah") and you're good to go. We're packing this version as the version that will be in AppScale 1.5, so new AppScale users will get everything posted to this date and before. Enjoy!
Version 0.0.9 Released!
It's been a little while since our last release, but here it is! Version 0.0.9 adds support for Stochastic State Algorithms via StochKit - just use the "ssa" job type. We'll do a post soon with the particulars of how to lay out your code and the like, so stay tuned!
Version 0.0.8 Released!
As promised, MapReduce support is once again working! Neptune 0.0.8 fixes this support, so when you use an input job to put your data into the underlying datastore, it will also put it into HDFS in case you want to use it for MapReduce later. The test suite includes test cases for regular Hadoop MapReduce via Java WordCount, and for Hadoop MapReduce Streaming via a Ruby implementation of the Embarassingly Parallel NAS Benchmark.
Also, I forgot to mention back in the 0.0.7 release that Walrus support was fixed, so just like for Google Storage, you can run the following:
neptune( :type => output, :storage => "walrus", :EC2_ACCESS_KEY => "your access key", :EC2_SECRET_KEY => "your secret key", :S3_URL => "http://ip of storage box/services/Walrus" )
We also changed it so that for all the S3-like storage backends, you need to specify the URL starting with http, so keep that in mind when deploying jobs.
Also, the test coverage is up to almost 87%, as we now cover many more failure conditions:

So update your Neptune gem and get coding!
Version 0.0.7 Released!
And we have a new version out! Neptune 0.0.7 adds quite a bit of stability compared to previous releases thanks to the use of automated testing via good old fashioned Test::Unit. We also run rcov to automatically see how much code we're covering in our tests and which code in particular we're missing. Right now we're at a little less than 65% coverage - take a look here:

Our fancy new automated testing also revealed a number of tiny bugs to fix (a few around the auto-generation of makefiles) and a major one - when we added input jobs in 0.0.6, we wanted to use it to make job input / output chaining easier, but as a side-effect, it broke MapReduce jobs. These jobs need their input in HDFS when they start, and with all the different storage options we support, we weren't consistently putting the input in HDFS automatically. It's still something we're working out, but it's something we will fix for 0.0.8, so stay tuned for more updates from the world of Neptune!
Version 0.0.6 Released!
Yet another release is out! This time around we add support for "input" jobs. Previously, whenever we wanted to run a job, we had to copy the input over from our local machine or it had to already be in the underlying datastore. But if you just wanted to place a file in the datastore for later, it wasn't do-able. But now it is! Just run this:
result = neptune( :type => "input", :local => "get_mapreduce_output.rb", :remote => "/neptune-testbin/testscript.rb", :storage => "gstorage", :EC2_ACCESS_KEY => "your access key", :EC2_SECRET_KEY => "your secret key", :S3_URL => "commondatastorage.googleapis.com" ) puts result
From our example above, we indicate where our local copy of the file is (here it's another piece of Neptune code) and where we should store it in the datastore (as these use the S3 naming convention, they should begin with a slash '/'). For the short-term, the bucket should already exist (this matters for Google Storage but not the others). This method call then returns a boolean value corresponding to whether or not the operation succeeded. So upgrade to Neptune 0.0.6 and check back soon for more updates!
Version 0.0.5 Released!
And we have a new release out! Neptune 0.0.5 adds support for alternative storage backends to be used when storing the results of Neptune jobs. Before, we always stored the output of Neptune jobs in the underlying database that AppScale uses (dubbed 'AppDB' in AppScale-speak). Now, you can store the results to Amazon S3, Eucalyptus Walrus, and Google Storage automatically!
Two different ways are available to make use of this support. If you like, you can manually specify your credentials when you run each Neptune job:
output = neptune(
:type => "mpi",
:output => "/neptune-testbin/mpi-output4.txt",
:code => "cpi",
:nodes_to_use => 1,
:storage => "gstorage",
:EC2_ACCESS_KEY => "your access key",
:EC2_SECRET_KEY => "your secret key",
:S3_URL => "commondatastorage.googleapis.com"
)
puts "job started? #{output[:result]}"
puts "message = #{output[:msg]}"Alternatively, you can put your credentials in your environment (ala the Eucalyptus style) and Neptune will automatically pick them up:
output = neptune(
:type => "mpi",
:output => "/neptune-testbin/mpi-output4.txt",
:code => "cpi",
:nodes_to_use => 1,
:storage => "s3"
)
puts "job started? #{output[:result]}"
puts "message = #{output[:msg]}"For the moment, we don't have automated bucket creation when using Google Storage, so if you're using it, make sure to manually create your bucket ahead of time. We'll get it resolved soon!
The latest AppScale branch has the necessary support for Neptune 0.0.5, and when we release AppScale 1.5, it will have this support as well. Let us know if other storage backends would be preferable in your apps!
Version 0.0.4 Released!
A quick update once more! This time around, I fixed the syntax like we talked about back on the 0.0.3 release. Let's walk through the new syntax with an example. Let's suppose we want to compile some Unified Parallel C code and run it over its MPI backend. We begin by compiling the code:
result = neptune (
:type => "compile",
:code => "ring",
:output => "/baz",
:copy_to => "ring-compiled"
)
puts "out = #{result[:out]}"
puts "err = #{result[:err]}"So here I've specified the type of job to run (a compilation job), where my code is located (in a folder named "ring"), and where to copy the compiled code to (a folder named "ring-compiled"). My "ring" folder has a Makefile in it that says:
all: /usr/local/berkeley_upc-2.12.1/upcc --network=mpi -o Ring Ring.c
The latest AppScale branch includes the UPC compiler, and the next release (1.5) will include it as well. So we compile our code and specify that the MPI backend should be used with the Neptune job / Makefile from above, and then can run our code over four nodes as follows:
output = neptune(
:type => "mpi",
:code => "ring-compiled/Ring",
:nodes_to_use => 4,
:procs_to_use => 8,
:output => "/baz/output"
)
puts "job started? #{output[:result]}"
puts "message = #{output[:msg]}"Since our UPC code is compiled to use the MPI backend, we specify MPI as the type of job to run, as well as the location of our compiled code and where the output should be placed. As usual, we also specify how many machines we want to run over, but a new feature in 0.0.4 (when paired with the latest and greatest AppScale) is the ability to specify how many processors are needed. Here we specify 8 processors over 4 nodes, so each machine will get two processors scheduled for computation.
That should give you just enough to get going on Neptune 0.0.4. Happy coding!
Version 0.0.3 Released!
So it turned out that while you could use the Neptune executable (e.g., "neptune file.rb"), you couldn't do a "require neptune" from your own code and call Neptune jobs from there. Thus after a quick fix, Neptune 0.0.3 is out to fix this problem!
However, I still think it looks a bit weird to have code that says:
job "mpi" do @something = baz end
When in reality this may be cleaner:
neptune "mpi" do :something => baz end
An upcoming release may fix this, so stay tuned!
Version 0.0.2 Released!
In the spirit of "release early, release often," Neptune version 0.0.2 has been released! This release adds compilation support, so now you can specify a directory containing a Makefile, and Neptune will copy that directory over to an AppScale node, compile it, and copy it back to a specified folder on your computer. Here's a code snippet:
result = job "compile" do
@code = "ring"
@output = "/baz"
@copy_to = "ring-compiled"
end
puts "out = #{result[:out]}"
puts "err = #{result[:err]}"So here we have our code in a folder named "ring", would like it to be compiled, and have the result placed in a folder named "ring-compiled". The output parameter doesn't do anything right now, but an upcoming release will optionally store it in the underlying database in AppScale so that you won't be required to keep a copy of it on your computer at all times. Check it out and let us know what you think!
Version 0.0.1 Released!
Our first version of Neptune is out and free to use! It's in the RubyGems repo, so you can install it as easily as running
gem install neptune