Posts tagged: virtualbox

Office Grid Computing using Virtual environments – Part 3

By , Friday 4th December 2009 11:37 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

In part 2 we looked at the jobs a server will run, and how jobs should be configured in order to achieve greatest amount of processing whilst ensuring that each job is processed without fail.

Setting up your worker – or LiMP server

The next step in the process is to set up your virtual workers. For this I’m going to use an installation of centOS using VirtualBox. I’m going to install mySQL and PHP on the server, also known as a LiMP (Linux, mySQL, PHP) Server  (I may have made that name up).

  • Install VirtualBox on your windows machine (follow link)
  • Download and install centOS (current version 5.3) within a created virtual machine

There’s no point me going to this there’s probably 1,000’s of great tutorials out there (ok, here’s one: Creating and Managing  centOS virtual machine under virtualbox). The important point to note I suppose is that I called my virtual machine GridMachine.

As far as my choices of virtualisation client and operating system go there is no big compelling reason for each choice. VirtualBox is something I use on my home machine and is supported by the three major operating systems. I chose centOS as its a good stable OS and I use it on my own web server. I am a great believer in the right tools for the job (although I’m applying ‘use the quickest and easiest for you’ mentality here), so if operating system X runs your code quicker and more efficiently use that instead :)

Importantly make sure that your VM uses DHCP, otherwise for each new virtual machine would need to be configured separately which is something we don’t want.By using DHCP we don’t need to configure network settings individually for worker machines, DHCP will hand out IPs for you. Therefore you can copy your virtual machine about the office without worrying about setting each one up (this improves scalability and reduces worker administration).

The process you should aim to achieve would be to obtain a new physical machine, install VirtualBox, and then pretty much deploy the virtual image without much else. It might be wise to setup all your workers on a different subnet so that you can at least see how many machines are running. You’ll also need to set up your machines on a long lease or unlimited lease DHCP.

How to run Jobs on the worker

This is an interesting area and there are several valid methods for processing jobs on the worker. Here I’ll just discuss the two most obvious:

  • Perpetually running script: A script, be it a shell script, or a PHP script is executed once on the worker and runs as part of an infinite loop. I’ve discounted this method as one crash of the script and potentially your workers will cease to run without some sort of intervention.
  • Cron based script execution: Every X minutes the cron daemon kicks off a call to your script to get things going. Without some checking this could lead to many many copies of your worker script running.

My decision was to go with cron which kicks off a shell script every 10 minutes.  My shell script performs the following tasks:

  1. Get a process list and grep this for ‘php’. If not found then continue.
  2. Call your job code, in my case this would be something PHP based
  3. Worker script completes its run
  4. Ready to go again on the next appropriate call

My bash script looks something like the following:

#!/bin/sh
if ps ax | grep -v grep | grep php > /dev/null
then
    echo "Job is currently processing, exit"
else
    echo "Job is not running, start now"
    php yourJobProcessingScript.php
fi

Note: the echo’s are almost completely pointless, but may help the next person who comes along to try and edit them.

That concludes the set up of the worker virtual machine, quick, simple, and easy to copy to each new piece of hardware that is received. The ‘cleverness’ of the grid system really isn’t in the visualised OS, its all to do with the code created to process jobs, the job configuration, and in making sure that the job runs when appropriate (i.e. when the host is idle).

Setting up Windows to Initialise Workers

The first task is to work out the command required to run the virtual machine from the windows command line. If you’ve installed virtualBox in the default location and you’ve named your worker GridMachine then the command required to load up your worker is:

"C:\Program Files\Sun\VirtualBox\VBoxManage.exe" startvm GridMachine

However to run the script in a ‘headless’ state we need to use:

"C:\Program Files\Sun\VirtualBox\VBoxHeadless.exe" -startvm GridMachine --vrdp=off

This will start the virtual machine without the GUI and allow it to save state gracefully. The second argument turns off RDP so it doesn’t conflict with windows RDP, or give you a message about listening on port 3389. The virtual machine name is cAsE sEnSiTiVe!

Next, we’ll need to set windows up to kick off our worker VM once the machine has been idle. To do this (on Windows XP) you’ll need to go Start -> All Programs -> Accessories -> System Tools -> Scheduled Tasks as below:

scheduled tasks

Next click on ‘Add Scheduled Task’ followed by browse to add a custom program. Navigate to your VBoxManage script and click ok. Schedule your task for any of the options (we’ll change this in a minute) and continue. After skipping the next screen windows will ask you who you want to run this task, I’d suggest either ‘Administrator’ or creating a new privileged user. Remember we don’t want to interfere with the standard staff account on the machine at any point. Click next and check show advanced options for this task.

To the end of the run textbox add our ‘startvm GridMachine‘ string and ensure that run only when logged in is left unticked. Visit the schedule task next and change the schedule drop down to the option ‘when idle’, choose the amount of time you’d like the machine to be idle before moving on to the next tab.

Finally untick the option which states stop the task if it has been running X amount of time, but do tick the option to stop the task if the machine is no longer idle.

schedule

That’s it then for the windows host setup!

Summary

In this part we have set up a virtual machine to act as a worker, as well as the way in which we call and execute our job processing scripts (for myself a PHP script). From here we look at how to set up our copies of windows to start up the virtual machine in headless mode when the computer becomes idle, and save its state when the user resumes usage of the machine. Hopefully at this point you’re seeing how simple it is to set up such a system and are itching to get some experiments going yourself!

Next time

In Part 4 we’ll be looking at using tools to ensure that you’re running the latest version of the code and data sources so that obtained results are always up-to-date with the latest business information and logic.

Office Grid Computing using Virtual environments – Part 1

By , Friday 4th December 2009 11:23 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

As a PHP developer I’m going to use tools that I use each day namely, Linux, mySQL, PHP, VirtualBox and subversion (SVN). However I hope this guide will adapt to other languages and technologies just as well.

The solution I provide will be very loosely based on the type of processing we’d need to achieve however this may not be true through the entire article as I’ll change things for simplicity, or to produce more interesting usage scenarios.

These virtualised environments will run on windows machines since this is what the majority of offices run. The processing that the office machines do should not interfere with staff using those machines, should require no maintenance at the machine, and be easily deployable to new machines as they become available. Also, new virtual machines should not require any additional configuration as this greatly reduces the scalability and ease at which the grid system can be extended.

Why Deploy an Office Computing Grid?

Firstly you may be thinking,why not just use a cloud computing resource such as Amazon’s EC2 platform? Well the reasons could be several, for example:

  • You won’t entrust certain data to a cloud computing environment
  • You can’t put certain data into a cloud computing environment for legal reasons (e.g. data leaving the country), potentially for legal reasons, e.g. NHS records.
  • You want to keep your processing units close and have full control over the hardware too
  • You don’t have the project funds to run cloud instances
  • Your office doesn’t have a connection to the internet and therefore its not possible to use a cloud resource
  • You don’t like rain, clouds suggest rain, therefore you keep well away

I’m sure the list could continue, but I think that’s enough for now.

Advantages of an Office Computing Grid

Well, lets do some maths (and in true physics style lets make some sweeping assumptions). Imagine you have big beefy processing server running 100 jobs per day. In your office you have 50 machines which are idle 16 hours a day, each of these machines is 10% as powerful as your beefy processing sever. (All results here are rounded to underestimate performance increase).

So, 1 machine * 10% power * 2/3 time = 0.067 i.e. 1 desktop processing in idle time could process 6 full jobs per day.

If you now scale this up it takes 15 idle desktops to process as many jobs per day as your main processing server does.

So in our pretend office of 50 machines we could increase our processing power from 1 server up to 4 full processing servers, or we could be processing 400 jobs per day instead of 100.

Notice, for no investment in new hardware your company has just increased its batch processing capacity 4 times! Potentially you’re going to increase your power usage but from most office environments I’ve been to machines are generally left on overnight anyway, so you could see this as a green initiative.

Other advantages also mean that investment in new (or updated) processing servers can be delayed if your office machines are sufficient and that as you improve the power of your office machines your office grid becomes more powerful automatically.

Technologies

What you need? (or more correctly what did I use):

  • Idle office machines (in my case a spare old windows XP laptop)
  • VirtualBox (or another virtualisation client software)
  • A virtual machine with PHP, mySQL running  running a cut down OS, I’m calling these my LiMP servers :)
  • Jobs to run
  • Job server (can be another virtual machine somewhere)

Typical Jobs

The types of jobs that this system is designed to run is as follows:

  • System receives a list of data upon which we need to match and return results
  • Matching involves checking/searching several (fairly static) data sources
  • Results from data sources may require further validation, merging, checking of additional data sources in response to results
  • Data is returned with matching records, fully validated and processed
  • Each record within a job is independent of the rest

So basically we’re looking at running jobs which require a mixture of database lookups and some number crunching, a fairly typical scenario in a business environment.

Grid solutions are not only advantageous for processing jobs of this type. Basically, any process which can be split into independent units can be run in parallel. See this wikipedia for examples and more information: Grid Computing, but a couple of famous examples are Seti@Home and BIONC. There are frameworks for running computing grids, and these are well worth looking into.

What will we achieve?

By the end of these articles I hope to show that deploying an office grid need not be hugely expensive or time consuming. I’m going to discuss:

  • Setting up the job control system, job configuration
  • Creating an appropriate processing virtual machine
  • How to setup the system on a windows machine
  • Ensuring you are using the latest code and data
  • Deployment and benchmarking
  • Looking ahead

I’ll be building (ok I built, then wrote this) an example application to test the concepts on a local machine using windows XP and my ‘GridMachine’ virtual machine. My job control server will be my main machine which runs Fedora 11.

This is in no way meant to demonstrate a fully working robust system, its meant more of a demonstration and discussing showing that these things can be achieved in a reasonably short space of time and at little cost. Please feel free to send me any comments, corrections, or improvements and I’ll do my best to keep this article updated to match.

Next time

In part 2 I will start by looking at the job control system, and look into how jobs should be configured in order to achieve greatest amount of processing whilst ensuring that each job is processed without fail.

Office Grid Computing using Virtual environments – Part 2

By , Friday 4th December 2009 11:23 pm

Introduction

I work in a company where we run many batch jobs processing millions of records of data each day and I’ve been thinking recently about all the machines that sit around each and every day doing nothing for several hours. Wouldn’t it be good if we could use those machines to bolster the processing power of our systems? In this set of articles I’m going to look at the potential benefits of employing an office grid using virtualised environments.

In Part 1 I gave an overview of the system and technologies I will be using as well as discussed some of the potential reasons why you would want to create an office grid.

Job Control

If you’re going to be running jobs then you’re going to need some way to manage them. Your job control system (on your job server) needs to be really well thought out before even attempting to run an office grid. So firstly, what are the tasks for a job control system:

  • Hand out jobs upon request from workers
  • Tell workers what type of jobs to run
  • Track jobs
  • Ensure that jobs are only run once
  • Provide job data to workers, or at least tell them where to get it

The system also needs to be extensible, a solution that works for now in a single case may be extended to run several types of jobs as the business sees the worth in a grid solution. For example, jobs may gain priorities, more than one job type may exist (i.e. several code bases), eventually you may even run several different worker machines that are optimised for each type of job (although that does move away from the ‘generic worker’ idea). Always try to think about the future when developing systems, a short term vision can lead to longer term frustration and increased development time.

Job Server

We’re going to need somewhere to control our jobs from, this should be the only system in your grid that has a fixed resource locator, be that an IP address, host name, URL (using internal DNS), etc. This is because the workers need to know where to look for jobs, workers need to find the job control system (not the job control system find the workers).

The job server itself doesn’t really have a complicated task (in a basic system anyhow), it needs to store a list of jobs, hand out jobs, receive results, and subsequently store them for later retrieval. How these parts (such as ‘hand out jobs’) are defined can be very basic. Later on we can extend the system to include an administration interface to add, edit, delete, suspend jobs but this is beyond this exercise.

There is no reason whatsoever then that your job server could not be a virtual machine running within your main processing server provided it doesn’t drain too many resources from it. The job server however does need high availability, if it goes down on a Friday evening you’re going to lose a whole weekend of processing, potentially costing you a couple of weeks worth of processing time (when compared to your main processing server alone). You may want to consider putting your job server on a load balanced environment for high availability.

Basic Setup

The basic setup for our job server will consist of what I’m calling one of my LiMP servers (that is Linux, mySql, PHP). The code running on the  workers will actually work out what jobs it can run by interacting with with job control system databases. Later on we could create a web service and actually hand out jobs rather than having the workers do the hard work themselves, but for now we’ll continue using the KISS principle (Keep it Simple, Stupid!).

So, lets create three mySQL tables to deal with jobs. These will be `jobs`, `jobRecords`, and `jobResults`.

jobs table Here I’m using SQL Buddy a great little alternative to phpMyAdmin just because its easier to install on centOS (for others see: 10 Great alternatives to phpMyAdmin)

This table consists of 5 simple fields,

  • id: Uniquely identify the job
  • name: Could be a client reference, or any number of other identifiers
  • Status: You need to know where the job is at, e.g.
    • 0: Not started
    • 1: Picked up
    • 2: Completed
  • started_by: Who’s started doing the job? This isn’t entirely required but is a nice to have. I’d suggest tracking workers by their IP address on your network
  • started_at: When did the worker start the job? By tracking jobs that have not completed within X amount of time we know we need to pick up the job once again and start processing by another worker. Workers could stop processing/go offline for any number of reasons, power failure, crash, network loss, etc.

It is easy how this table could be extended with a few additional fields to allow for statistics tracking, a finish time column to see how long the job took, a counter to see how many workers picked up the job (obviously this needs to tend to 1), job priority, the list can go on and on. In more complex job scenarios it would be possible to specify how much memory the worker would need access to (and therefore only use suitable workers), or even what type of worker would be required.

Lets add a few example jobs:

example jobs

The next table again is quite simple to understand, these are our job records. They are linked to the main jobs table by a column `jobs_id`. The make up of this table very much depends on the data that you need to supply to your workers, lets make a very simple example where we have four columns:

  • id: ID of the record
  • name: Person’s name
  • address: Person’s address
  • jobs_id: The job ID that this record is linked to

The third and final table consists of a results table, it has much the same make up as our records table, and with the addition of some columns could be part of the records table:

  • job_record_id: Link the result to the job table
  • result: The result data

…and that’s all you need for job control! (albeit at a very basic level) In my case I’m pointed to another table where my data to process was located, but this could just as easily been a file, parameters to run simulation code, you name it.

Selecting a job

As stated previously, the workers will do our job management for us for now, so all we need to really do is find a job that needs processing and get the information. How would we do this? Well pick our job selection criteria and look for jobs, in SQL I did the following:

  1. Take any jobs that are not marked as complete but from our worker and reset them (substitute __ME__ with an identifier, easiest would be IP address):
    UPDATE `jobs` SET `status` = 0 WHERE `status` = 1 AND `started_by` = __ME__;
  2. Using our job selection criteria, select a job and tell the control system that this worker is dealing with it:
    UPDATE `jobs` SET `status` = 1, `started_by` = __ME__, `started_at` = NOW() WHERE `status` = 0 OR
    (`status` = 1 AND `started_at` > DATE_SUB(NOW(), INTERVAL X HOUR)) ORDER BY `id` ASC;

    By grabbing jobs that haven’t returned results in X amount of time we ensure that all jobs are run in the event of a worker crashing or going AWOL.

  3. Next grab the jobs details followed by the records themselves:
    SELECT * FROM `jobs` WHERE `started_by` = __ME__  LIMIT 1;
    SELECT * FROM `job_records` WHERE `id` = __JOBID__;

Upon completion of the job we insert our result records and mark the job as complete. Remember as jobs can suspend/resume at any time allow for some robustness in your script. It might be that the task suspends half way through updating the job control system, so checking the number of records in a job and the number of results saved back to the job control system would be a wise move.

In addition, whilst this demonstrates how jobs can be selected and managed from an SQL-query frame you should really be abstracting your job control so that if you decide to switch to using a web service, a file based system, XML, or any other number of systems it will not affect the code above it.

Job Configuration

The next aspect to consider is job size and configuration. By playing with job configuration we can strike an excellent balance between speed, process replication, and reliability. Take a couple of  scenarios:

  1. Jobs take 1 day each to run: This means that your workers need 15 days to process each job (remember 10% of the power for 2/3rds of the time). This is clearly not a wise configuration, your job size is way too big! It would take at least double the time to get a job processed should the initial worker go AWOL (time to pick up that it hasn’t returned a result plus reprocessing time). In an ideal you’d have at least one full job easily cleared by the end of each long idle period, that way you keep the jobs ticking over and at worst case a job would take two days to process should the first go missing.
  2. Jobs take 1 minute to run: This means that your workers take about 15 minutes to run each job. Whilst this may initially seem ideal, you gain additional work processing during lunch time, coffee breaks, meetings, etc this scenario puts strain on other areas of your system and introduces its own problems. For example, firstly your setup/processing time ratio is going to go right down, therefore losing system efficiency. Your network is going to be constantly streaming job information to the various workers frustrating staff who are dong their day to day work. You’re also going to put more strain on your job processing server as it has to dish out lots and lots of small pieces of work on a regular basis. Lastly, in this situation if your job server goes down you’re going to create a huge back log of uncompleted work whereas bigger jobs could of continued processing blissfully unaware that the job server was experiencing difficulties.

In reality there will be no one ideal configuration for your grid setup, much depends on the available resources, types of job, job turnaround time requirements, network capability, and so on. However some guidelines would be:

  • Size jobs so that each worker can get through at least 3-4 jobs in a period of 15 hours (the longest likely idle time period)
  • Play with the job size so that setup time becomes fairly insignificant compared to the processing time (bearing in mind the above point).
  • If a job doesn’t complete in double the amount of time (maybe less) you expect it to complete it assume that its gone AWOL and start processing it with another worker. This means you may have to wait up to three times the normal length of a job for it to complete (possibly longer if the subsequent job fails). You may want to reduce this time, but be careful not to reduce it too much as you may start duplicating processing tasks on a regular basis.
  • Jobs should be independent of outside requirements as much as possible. The job server, for example, should only be contacted at the start and end of every job.
  • Don’t saturate your network, this will have two negative effects, your daytime staff will find using the network frustrating and problems may be experienced with connections timing out a problem that will only get worse as you scale your grid.
  • Ensure jobs can run on your workers. If jobs become too memory intensive or disk space intensive jobs will start aborting and the only thing you’ll notice is a drop in number of jobs processed with no real reason why.

Submitting Results of a Job

When submitting the results of a job it is important to check that results have not been submitted by another worker, especially if the current worker has been dormant for some time.

When results are submitted ensure that the number of results matches the number of records within the job.

As stated previously, and can not be over emphasised, build fault tolerance into job retrieval and results submission. The workers can (and most likely will) go into suspend mode at the most inconvenient of times and this needs to be catered for. Also once again abstracting away your results submission will help cater for future changes to your job control system much easier to deal with.

Summary

In this section  we have looked at what a job control server needs to do and how to get a very basic system set up. We discussed how to retrieve a job from the control system and how best to configure jobs to get the most our of your office grid system. To finish, a paragraph or two on submitting results back to the job control server was presented.

  • A job control server manages jobs and ensures that all work units are completed
  • By abstracting your job select/results submission we can change the technology of the control server without much problems
  • Configure your jobs to ensure that they are run quickly and efficiently without putting too much pressure on your network infrastructure, and without duplicating processing tasks on a regular basis.
  • Ensure that you build fault tolerance and error checking  into your routines, workers can suspend and resume and the most inconvenient of times. Remember to check if results have already been submitted by another worker.

Next time

In part 3 we’ll create our virtual processing machine and set up our windows machines to become idle-time workers.

Panorama Theme by Themocracy

5 visitors online now
2 guests, 3 bots, 0 members
Max visitors today: 8 at 04:59 am UTC
This month: 31 at 09-06-2017 03:33 pm UTC
This year: 45 at 02-01-2017 10:28 pm UTC
All time: 130 at 28-03-2011 10:40 pm UTC