Posts tagged: replication

Office Grid Computing using Virtual environments – Part 4

By , Friday 4th December 2009 11:59 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

In part 3 we created our virtual processing machine and set up windows machines to become idle-time workers.

Running the latest code

Inevitably after creating your workers business logic will change, bugs will be found, faster more efficient code will be produced thus leaving your workers sat around processing data using old smelly code. How then do we ensure that we’re always using the latest and greatest version of our processing scripts?

There are a few very easy simple ways we could do this, the trick, however, is to reduce processing power and network traffic in achieving this. Lets start with the simplest of solutions and improve it slowly over a couple of iterations.

The first method would be to simply connect to our job control server (via samba, FTP, or similar) and pull down the latest version of the code. Not very efficient, but it will do the job. Lets improve on that somewhat, how about creating an rsync script and using that each time instead? Alternatively what about putting our latest processing script into subversion checking out the code initially and then just updating our code on each run (svn update)?

In the end we could end up with a bash script (called by cron every 10 minutes) which looks as simple as this:

#!/bin/sh
if ps ax | grep -v grep | grep php > /dev/null
then
    echo "Job is currently processing, exit"
else
    echo "Job is not running, start now"
    cd /path/to/working/copy
    svn update
    php yourJobProcessingScript.php
fi

Now we can be sure that with each run we’re definitely running the latest code. We’re ensuring this by updating our code base each and every time we perform a run and reducing network traffic by only transferring the file differences across our network.

In my demonstration setup, I did exactly as above. Subversion was installed on my job processing server and I simply pulled the latest code from a ‘worker’ branch using ‘svn update’. I also added a version number tag to my processing script which was returned to the database as part of the results return. This way I could see that my code was being updated each time I copied my trunk into the worker branch i.e. that I was definitely running the latest processing script.

Using the latest data

If your job processing makes use of data sources then at some point these are going to be updated too. Unless you call your data sources on a very infrequent basis you’re going to flood your network with traffic as soon as your workers start running bringing everything to a standstill. For my solution I decided that I’d like to move my data sources around with my VMs.

Hold you’re horses there! What if my data sources are HUGE? Well this really is a case of how much data are we talking? It may be more cost effective to install an additional larger hard drive into each machine than to purchase an additional processing server. This is a question of budget and is up to the business to decide. It maybe that your data sources are so large that its just unfeasible to keep that amount of data in your worker machines. In that case what would you do? Well we could look at calling a local data server, but this might cause issues with the network. In this case a grid system such as this may become unrealistic to include in your office environment. It may also be that you can look into alternative running strategies, for example only calling your workers between 8pm and 6am each night and/or throttling data source requests.

Moving on lets say our data sources amount to 100Gb of data. Well yes that’s quite a bit of data to move around the network on an update. How would we ensure that we have the latest copy of the data in this case? Rsync is a possibility, but personally I think by running your latest data source on your job processing server and setting this up as a master in replication (with a nice long bin log) might be the way to go:

replication By setting each of your workers up as a slave to the job control server updates to your data sources will trickle down nicely to your workers without a huge increase in network activity (that is unless you perform a huge data update and all your workers kick in at once). This has advantages over rsync in that you wouldn’t get a long pause before each job; as the database updates, the mysql daemon on your worker will continually update its data while the processing continues.

This is how I set up my demonstration server. To set up replication I followed the guide on the mySQL site (Setting up replication) and within 20 minutes I had my inital worker replicating the job control servers dataset. For each additional worker the replication settings and process worked each time when the VM was copied.

Summary

In this section of the article we have looked at how easy and painless it is to keep your processing code up to date by using  rsync or subverion (SVN) to do the work and reduce network traffic at the same time.  We also discussed how to keep your data source information up-to-date by allowing it to trickle down to each of your workers. Thus we are  ensuring that we keep up with business logic and information in our office grid system. There will obviously be countless alternatives to performing these tasks, but here were two simple examples to show how easy a solution is to come by.

Next time

In the final part of this series, aptly named Part 5 , we’ll discuss deploying this system for. I’ll summarise what has been learned and what I managed to create.

Office Grid Computing using Virtual environments – Part 5

By , Friday 4th December 2009 11:03 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

In Part 4 we looked at using tools to ensure that we’re running the latest version of the code and data sources so that obtained results are always up-to-date with the latest business information and logic.

Pre-Deployment

Before deploying your grid system if there’s one thing you do and one thing alone it’s benchmark your current system! No matter what you tell colleagues about how much extra work your system is going to do unless you have numbers to back this up your guarantees are nothing. So,

  • how many records can you process currently? Per Day? Per Hour?
  • How long does it typically take to turn around a job?
  • How much more capacity do you have?

There’s also additional questions:

  • If your processing server (or one of your processing servers) goes down how will this affect your capabilities, will you be crippled?
  • What advantages do you hope/expect to get from a grid system?
  • Are your office machines capable of running the jobs?
  • Are your (or can you jobs be converted) to wrok in this style of running?

The last major point is to take your time on any major change like this. Update your processing code to work using the new methodology, benchmark again. Possibly set up your processing server to run a virtual machine, after all your processing server will just be another worker (just a very powerful one relatively). Allow the new process to settle.

Deployment

My suggestion would be to pop into the office one weekend perform all the installations and setup. Do this just before a fortnight’s holiday and leave so other poor chap to deal with the consequences… maybe not…

Deployment for a system like this needs to be slow. Despite it being relatively simple to set up this system will affect your entire office infrastructure (well the digital one). Firstly, roll out to a couple of machines at a time, monitor network traffic, how the worker hosts perform on a day-to-day basis. You may need to alter your job configuration in response to your findings.

Once the system has settled with a few machines (lets say 10% of all office machines, i.e. 5) keep monitoring network traffic and host machine performance.  Next benchmark again, you should now be processing 33% more jobs than your first benchmarks. Check this is so, or that you’re at least in this ballpark. If not, investigate what is going on before moving on. Repeat this cycle until you happily have all office machines running without killing individual machine performance or grinding your network to a standstill.

At all times keep benchmarking, even after all deployments are made. Check how new code updates affect speed of your system, check all workers are reporting in and processing jobs. Slowly (very slowly) increment your job configuration to get the best from your workers and network.

Stop!

What if you want to stop your workers from running at some time? They are all out there running, regenerating, and trying their best to process data like hungry insects. The answer may seem obvious but its worth adding just in case its overlooked. Simply edit your processing script with an exit(0) or die() or some other statement to kill your processing job. An important reason why we always try to update to the latest processing script before any run!

Demonstration System

In order to write this set of short articles I created a very small grid to demonstrate the technologies and methodologies. I read lots of articles, tutorials, and used various tools to setup and monitor what was going on. By no means have I gone out and saturated a whole office with traffic and nor have I had access to a regular staff members PC to see how host performance was affected.

My demonstration system was very humble indeed. I used my regular desktop set up as a job control server. On this I had installed mySQL server installed set up as a master in replication, PHP,  and SVN linked through apache (for access via worker VM).

I then created a centOS worker machine on VirtualBox on a 6 year old windows XP laptop. I setup scheduled tasks as specified after copying the VM onto the machine and let it go.

The virtual machine was set up with PHP, subversion, and mySQL. I checked out a branch named ‘worker’ from my job control servers repository and made sure it could be updated using ‘svn update’. Next I setup mySQL as a slave and checked that data was replicating from mySQL on the job control server down to the worker VM. After all this I setup the bash script and the cron job.

My processing script basically went along the lines of this (very simple stuff):

  • Read in the name field
  • Counted the number of similar names in a table from the data source held on the VM
  • Counted the number of names as above but splitting the name by spaces (i.e. forename, middle, surname)
  • Repeated this process 1,000 times

Each job took approximately 20 minutes to run. At one point I opened several copies of the worker VM on the windows laptop and watched the jobs be checked off by each of the worker IP addresses. At this point I also confirmed that replication automatically restarted.

Leaving the laptop to idle resulted in a worker starting to process jobs from the job control server. When resuming laptop usage there was a delay of about 30-60 seconds, this is a fair amount of time and staff would need to be made aware that their machine may pause for a short while when returning to the machine. Newer machines may not have a pause of this long. The benefit of the amount of processing performed by these machines during idle periods would more that outweigh staff members having to wait a short period (say 1 minute) on arriving at their machines of a morning (I frequently wait longer that this for a Windows Defender update to take place) provided they were made aware of this (useful time to grab a morning coffee!).

Overall I feel confident that I have demonstrated the technologies that could be used to create such a system. I have shown that such a system does work on a (very) small scale and with some more experimenting could be scaled up utilise the resources of an office’s machines. If I don’t get to the point of doing this I would be very interested to know/see when someone else does.

Conclusions / Evaluation

The next obvious step would be to actually get a real world example and start to deploy a system such as this within an office environment and see what happens. Asking a business to commit to this without a trail blazing company to prove the technology and effectiveness may be a little difficult. Grid/Distributed computing is very popular is some circles and has some large applications (BIONC, SETI@Home, Folding@Home, etc). I did not, however, find a smaller scale and simple system like this in my searches that could be rolled out within an office environment.

I created a basically free system using mostly open source software and tools available in almost any office. The technologies were basically demonstrated and show to perform and work as expected. Hopefully I have show that with not much work and with a very simple setup you can deploy an office grid computing system that is powerful, cheap,  and scalable all at the same time.

Once a system is up and running there is almost no end to the amount of customisation and improvements you can make. For example statistics / benchmarking can easily be added showing the worth of such a system every day. New machines can be added quickly and easily as and when they arrive with upgrades to existing hardware bolstering your processing power.

I hope you’ve enjoyed reading this series of articles and its given you food for thought on running an office grid system. The solution presented here won’t necessarily work in all situations but should be adaptable to allow you to get your data processing done using your own solution.

Please feel free to send me any comments, corrections, or improvements and I’ll do my best to keep this article updated to match.

Panorama Theme by Themocracy

1 visitors online now
0 guests, 1 bots, 0 members
Max visitors today: 14 at 04:42 am UTC
This month: 17 at 19-09-2017 09:35 pm UTC
This year: 45 at 02-01-2017 10:28 pm UTC
All time: 130 at 28-03-2011 10:40 pm UTC