I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.
As a PHP developer I’m going to use tools that I use each day namely, Linux, mySQL, PHP, VirtualBox and subversion (SVN). However I hope this guide will adapt to other languages and technologies just as well.
The solution I provide will be very loosely based on the type of processing we’d need to achieve however this may not be true through the entire article as I’ll change things for simplicity, or to produce more interesting usage scenarios.
These virtualised environments will run on windows machines since this is what the majority of offices run. The processing that the office machines do should not interfere with staff using those machines, should require no maintenance at the machine, and be easily deployable to new machines as they become available. Also, new virtual machines should not require any additional configuration as this greatly reduces the scalability and ease at which the grid system can be extended.
Why Deploy an Office Computing Grid?
Firstly you may be thinking,why not just use a cloud computing resource such as Amazon’s EC2 platform? Well the reasons could be several, for example:
- You won’t entrust certain data to a cloud computing environment
- You can’t put certain data into a cloud computing environment for legal reasons (e.g. data leaving the country), potentially for legal reasons, e.g. NHS records.
- You want to keep your processing units close and have full control over the hardware too
- You don’t have the project funds to run cloud instances
- Your office doesn’t have a connection to the internet and therefore its not possible to use a cloud resource
- You don’t like rain, clouds suggest rain, therefore you keep well away
I’m sure the list could continue, but I think that’s enough for now.
Advantages of an Office Computing Grid
Well, lets do some maths (and in true physics style lets make some sweeping assumptions). Imagine you have big beefy processing server running 100 jobs per day. In your office you have 50 machines which are idle 16 hours a day, each of these machines is 10% as powerful as your beefy processing sever. (All results here are rounded to underestimate performance increase).
So, 1 machine * 10% power * 2/3 time = 0.067 i.e. 1 desktop processing in idle time could process 6 full jobs per day.
If you now scale this up it takes 15 idle desktops to process as many jobs per day as your main processing server does.
So in our pretend office of 50 machines we could increase our processing power from 1 server up to 4 full processing servers, or we could be processing 400 jobs per day instead of 100.
Notice, for no investment in new hardware your company has just increased its batch processing capacity 4 times! Potentially you’re going to increase your power usage but from most office environments I’ve been to machines are generally left on overnight anyway, so you could see this as a green initiative.
Other advantages also mean that investment in new (or updated) processing servers can be delayed if your office machines are sufficient and that as you improve the power of your office machines your office grid becomes more powerful automatically.
What you need? (or more correctly what did I use):
- Idle office machines (in my case a spare old windows XP laptop)
- VirtualBox (or another virtualisation client software)
- A virtual machine with PHP, mySQL runningÂ running a cut down OS, I’m calling these my LiMP servers :)
- Jobs to run
- Job server (can be another virtual machine somewhere)
The types of jobs that this system is designed to run is as follows:
- System receives a list of data upon which we need to match and return results
- Matching involves checking/searching several (fairly static) data sources
- Results from data sources may require further validation, merging, checking of additional data sources in response to results
- Data is returned with matching records, fully validated and processed
- Each record within a job is independent of the rest
So basically we’re looking at running jobs which require a mixture of database lookups and some number crunching, a fairly typical scenario in a business environment.
Grid solutions are not only advantageous for processing jobs of this type. Basically, any process which can be split into independent units can be run in parallel. See this wikipedia for examples and more information: Grid Computing, but a couple of famous examples are Seti@Home and BIONC. There are frameworks for running computing grids, and these are well worth looking into.
What will we achieve?
By the end of these articles I hope to show that deploying an office grid need not be hugely expensive or time consuming. I’m going to discuss:
- Setting up the job control system, job configuration
- Creating an appropriate processing virtual machine
- How to setup the system on a windows machine
- Ensuring you are using the latest code and data
- Deployment and benchmarking
- Looking ahead
I’ll be building (ok I built, then wrote this) an example application to test the concepts on a local machine using windows XP and my ‘GridMachine’ virtual machine. My job control server will be my main machine which runs Fedora 11.
This is in no way meant to demonstrate a fully working robust system, its meant more of a demonstration and discussing showing that these things can be achieved in a reasonably short space of time and at little cost. Please feel free to send me any comments, corrections, or improvements and I’ll do my best to keep this article updated to match.
In part 2 I will start by looking at the job control system, and look into how jobs should be configured in order to achieve greatest amount of processing whilst ensuring that each job is processed without fail.