Posts tagged: svn

Office Grid Computing using Virtual environments – Part 4

By , Friday 4th December 2009 11:59 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

In part 3 we created our virtual processing machine and set up windows machines to become idle-time workers.

Running the latest code

Inevitably after creating your workers business logic will change, bugs will be found, faster more efficient code will be produced thus leaving your workers sat around processing data using old smelly code. How then do we ensure that we’re always using the latest and greatest version of our processing scripts?

There are a few very easy simple ways we could do this, the trick, however, is to reduce processing power and network traffic in achieving this. Lets start with the simplest of solutions and improve it slowly over a couple of iterations.

The first method would be to simply connect to our job control server (via samba, FTP, or similar) and pull down the latest version of the code. Not very efficient, but it will do the job. Lets improve on that somewhat, how about creating an rsync script and using that each time instead? Alternatively what about putting our latest processing script into subversion checking out the code initially and then just updating our code on each run (svn update)?

In the end we could end up with a bash script (called by cron every 10 minutes) which looks as simple as this:

#!/bin/sh
if ps ax | grep -v grep | grep php > /dev/null
then
    echo "Job is currently processing, exit"
else
    echo "Job is not running, start now"
    cd /path/to/working/copy
    svn update
    php yourJobProcessingScript.php
fi

Now we can be sure that with each run we’re definitely running the latest code. We’re ensuring this by updating our code base each and every time we perform a run and reducing network traffic by only transferring the file differences across our network.

In my demonstration setup, I did exactly as above. Subversion was installed on my job processing server and I simply pulled the latest code from a ‘worker’ branch using ‘svn update’. I also added a version number tag to my processing script which was returned to the database as part of the results return. This way I could see that my code was being updated each time I copied my trunk into the worker branch i.e. that I was definitely running the latest processing script.

Using the latest data

If your job processing makes use of data sources then at some point these are going to be updated too. Unless you call your data sources on a very infrequent basis you’re going to flood your network with traffic as soon as your workers start running bringing everything to a standstill. For my solution I decided that I’d like to move my data sources around with my VMs.

Hold you’re horses there! What if my data sources are HUGE? Well this really is a case of how much data are we talking? It may be more cost effective to install an additional larger hard drive into each machine than to purchase an additional processing server. This is a question of budget and is up to the business to decide. It maybe that your data sources are so large that its just unfeasible to keep that amount of data in your worker machines. In that case what would you do? Well we could look at calling a local data server, but this might cause issues with the network. In this case a grid system such as this may become unrealistic to include in your office environment. It may also be that you can look into alternative running strategies, for example only calling your workers between 8pm and 6am each night and/or throttling data source requests.

Moving on lets say our data sources amount to 100Gb of data. Well yes that’s quite a bit of data to move around the network on an update. How would we ensure that we have the latest copy of the data in this case? Rsync is a possibility, but personally I think by running your latest data source on your job processing server and setting this up as a master in replication (with a nice long bin log) might be the way to go:

replication By setting each of your workers up as a slave to the job control server updates to your data sources will trickle down nicely to your workers without a huge increase in network activity (that is unless you perform a huge data update and all your workers kick in at once). This has advantages over rsync in that you wouldn’t get a long pause before each job; as the database updates, the mysql daemon on your worker will continually update its data while the processing continues.

This is how I set up my demonstration server. To set up replication I followed the guide on the mySQL site (Setting up replication) and within 20 minutes I had my inital worker replicating the job control servers dataset. For each additional worker the replication settings and process worked each time when the VM was copied.

Summary

In this section of the article we have looked at how easy and painless it is to keep your processing code up to date by using  rsync or subverion (SVN) to do the work and reduce network traffic at the same time.  We also discussed how to keep your data source information up-to-date by allowing it to trickle down to each of your workers. Thus we are  ensuring that we keep up with business logic and information in our office grid system. There will obviously be countless alternatives to performing these tasks, but here were two simple examples to show how easy a solution is to come by.

Next time

In the final part of this series, aptly named Part 5 , we’ll discuss deploying this system for. I’ll summarise what has been learned and what I managed to create.

Office Grid Computing using Virtual environments – Part 1

By , Friday 4th December 2009 11:23 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

As a PHP developer I’m going to use tools that I use each day namely, Linux, mySQL, PHP, VirtualBox and subversion (SVN). However I hope this guide will adapt to other languages and technologies just as well.

The solution I provide will be very loosely based on the type of processing we’d need to achieve however this may not be true through the entire article as I’ll change things for simplicity, or to produce more interesting usage scenarios.

These virtualised environments will run on windows machines since this is what the majority of offices run. The processing that the office machines do should not interfere with staff using those machines, should require no maintenance at the machine, and be easily deployable to new machines as they become available. Also, new virtual machines should not require any additional configuration as this greatly reduces the scalability and ease at which the grid system can be extended.

Why Deploy an Office Computing Grid?

Firstly you may be thinking,why not just use a cloud computing resource such as Amazon’s EC2 platform? Well the reasons could be several, for example:

  • You won’t entrust certain data to a cloud computing environment
  • You can’t put certain data into a cloud computing environment for legal reasons (e.g. data leaving the country), potentially for legal reasons, e.g. NHS records.
  • You want to keep your processing units close and have full control over the hardware too
  • You don’t have the project funds to run cloud instances
  • Your office doesn’t have a connection to the internet and therefore its not possible to use a cloud resource
  • You don’t like rain, clouds suggest rain, therefore you keep well away

I’m sure the list could continue, but I think that’s enough for now.

Advantages of an Office Computing Grid

Well, lets do some maths (and in true physics style lets make some sweeping assumptions). Imagine you have big beefy processing server running 100 jobs per day. In your office you have 50 machines which are idle 16 hours a day, each of these machines is 10% as powerful as your beefy processing sever. (All results here are rounded to underestimate performance increase).

So, 1 machine * 10% power * 2/3 time = 0.067 i.e. 1 desktop processing in idle time could process 6 full jobs per day.

If you now scale this up it takes 15 idle desktops to process as many jobs per day as your main processing server does.

So in our pretend office of 50 machines we could increase our processing power from 1 server up to 4 full processing servers, or we could be processing 400 jobs per day instead of 100.

Notice, for no investment in new hardware your company has just increased its batch processing capacity 4 times! Potentially you’re going to increase your power usage but from most office environments I’ve been to machines are generally left on overnight anyway, so you could see this as a green initiative.

Other advantages also mean that investment in new (or updated) processing servers can be delayed if your office machines are sufficient and that as you improve the power of your office machines your office grid becomes more powerful automatically.

Technologies

What you need? (or more correctly what did I use):

  • Idle office machines (in my case a spare old windows XP laptop)
  • VirtualBox (or another virtualisation client software)
  • A virtual machine with PHP, mySQL running  running a cut down OS, I’m calling these my LiMP servers :)
  • Jobs to run
  • Job server (can be another virtual machine somewhere)

Typical Jobs

The types of jobs that this system is designed to run is as follows:

  • System receives a list of data upon which we need to match and return results
  • Matching involves checking/searching several (fairly static) data sources
  • Results from data sources may require further validation, merging, checking of additional data sources in response to results
  • Data is returned with matching records, fully validated and processed
  • Each record within a job is independent of the rest

So basically we’re looking at running jobs which require a mixture of database lookups and some number crunching, a fairly typical scenario in a business environment.

Grid solutions are not only advantageous for processing jobs of this type. Basically, any process which can be split into independent units can be run in parallel. See this wikipedia for examples and more information: Grid Computing, but a couple of famous examples are Seti@Home and BIONC. There are frameworks for running computing grids, and these are well worth looking into.

What will we achieve?

By the end of these articles I hope to show that deploying an office grid need not be hugely expensive or time consuming. I’m going to discuss:

  • Setting up the job control system, job configuration
  • Creating an appropriate processing virtual machine
  • How to setup the system on a windows machine
  • Ensuring you are using the latest code and data
  • Deployment and benchmarking
  • Looking ahead

I’ll be building (ok I built, then wrote this) an example application to test the concepts on a local machine using windows XP and my ‘GridMachine’ virtual machine. My job control server will be my main machine which runs Fedora 11.

This is in no way meant to demonstrate a fully working robust system, its meant more of a demonstration and discussing showing that these things can be achieved in a reasonably short space of time and at little cost. Please feel free to send me any comments, corrections, or improvements and I’ll do my best to keep this article updated to match.

Next time

In part 2 I will start by looking at the job control system, and look into how jobs should be configured in order to achieve greatest amount of processing whilst ensuring that each job is processed without fail.

Panorama Theme by Themocracy

2 visitors online now
1 guests, 1 bots, 0 members
Max visitors today: 13 at 06:05 am UTC
This month: 16 at 18-07-2017 05:35 pm UTC
This year: 45 at 02-01-2017 10:28 pm UTC
All time: 130 at 28-03-2011 10:40 pm UTC