I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.
In part 3 we created our virtual processing machine and set up windows machines to become idle-time workers.
Running the latest code
Inevitably after creating your workers business logic will change, bugs will be found, faster more efficient code will be produced thus leaving your workers sat around processing data using old smelly code. How then do we ensure that we’re always using the latest and greatest version of our processing scripts?
There are a few very easy simple ways we could do this, the trick, however, is to reduce processing power and network traffic in achieving this. Lets start with the simplest of solutions and improve it slowly over a couple of iterations.
The first method would be to simply connect to our job control server (via samba, FTP, or similar) and pull down the latest version of the code. Not very efficient, but it will do the job. Lets improve on that somewhat, how about creating an rsync script and using that each time instead? Alternatively what about putting our latest processing script into subversion checking out the code initially and then just updating our code on each run (svn update)?
In the end we could end up with a bash script (called by cron every 10 minutes) which looks as simple as this:
#!/bin/sh if ps ax | grep -v grep | grep php > /dev/null then echo "Job is currently processing, exit" else echo "Job is not running, start now" cd /path/to/working/copy svn update php yourJobProcessingScript.php fi
Now we can be sure that with each run we’re definitely running the latest code. We’re ensuring this by updating our code base each and every time we perform a run and reducing network traffic by only transferring the file differences across our network.
In my demonstration setup, I did exactly as above. Subversion was installed on my job processing server and I simply pulled the latest code from a ‘worker’ branch using ‘svn update’. I also added a version number tag to my processing script which was returned to the database as part of the results return. This way I could see that my code was being updated each time I copied my trunk into the worker branch i.e. that I was definitely running the latest processing script.
Using the latest data
If your job processing makes use of data sources then at some point these are going to be updated too. Unless you call your data sources on a very infrequent basis you’re going to flood your network with traffic as soon as your workers start running bringing everything to a standstill. For my solution I decided that I’d like to move my data sources around with my VMs.
Hold you’re horses there! What if my data sources are HUGE? Well this really is a case of how much data are we talking? It may be more cost effective to install an additional larger hard drive into each machine than to purchase an additional processing server. This is a question of budget and is up to the business to decide. It maybe that your data sources are so large that its just unfeasible to keep that amount of data in your worker machines. In that case what would you do? Well we could look at calling a local data server, but this might cause issues with the network. In this case a grid system such as this may become unrealistic to include in your office environment. It may also be that you can look into alternative running strategies, for example only calling your workers between 8pm and 6am each night and/or throttling data source requests.
Moving on lets say our data sources amount to 100Gb of data. Well yes that’s quite a bit of data to move around the network on an update. How would we ensure that we have the latest copy of the data in this case? Rsync is a possibility, but personally I think by running your latest data source on your job processing server and setting this up as a master in replication (with a nice long bin log) might be the way to go:
By setting each of your workers up as a slave to the job control server updates to your data sources will trickle down nicely to your workers without a huge increase in network activity (that is unless you perform a huge data update and all your workers kick in at once). This has advantages over rsync in that you wouldn’t get a long pause before each job; as the database updates, the mysql daemon on your worker will continually update its data while the processing continues.
This is how I set up my demonstration server. To set up replication I followed the guide on the mySQL site (Setting up replication) and within 20 minutes I had my inital worker replicating the job control servers dataset. For each additional worker the replication settings and process worked each time when the VM was copied.
In this section of the article we have looked at how easy and painless it is to keep your processing code up to date by usingÂ rsync or subverion (SVN) to do the work and reduce network traffic at the same time.Â We also discussed how to keep your data source information up-to-date by allowing it to trickle down to each of your workers. Thus we areÂ ensuring that we keep up with business logic and information in our office grid system. There will obviously be countless alternatives to performing these tasks, but here were two simple examples to show how easy a solution is to come by.
In the final part of this series, aptly named Part 5 , we’ll discuss deploying this system for. I’ll summarise what has been learned and what I managed to create.