Indledning
Jeg arbejder i en virksomhed, hvor vi køre mange batchjob forarbejdning af millioner af optegnelser over data hver dag, og jeg har tænkt for nylig om alle de maskiner, der sidder rundt om hver eneste dag at gøre noget i flere timer. Ville det ikke være godt, hvis vi kunne bruge disse maskiner til at styrke den regnekraft af vores systemer? I dette sæt af artikler jeg har tænkt mig at se på de potentielle fordele ved at ansætte et kontor gitter ved hjælp af virtualiserede miljøer.
I del 1 jeg gav et overblik over systemet og teknologier jeg vil bruge samt drøftet nogle af de potentielle grunde til, at du ønsker at oprette et kontor gitter.
Job Control
Hvis du vil køre jobs så du vil få brug for nogle måde at styre dem. Dit job kontrolsystem (på dit job server) skal være virkelig velgennemtænkt, før selv forsøger at køre et kontor gitter. Så det første, hvad er opgaver for et job kontrolsystem:
- Uddel opgaver efter anmodning fra arbejdstagerne
- Fortæl arbejdere hvilken type af job til at køre
- Spor job
- Sørg for, at arbejdspladser er kun køre én gang
- Giv jobdata til arbejdstagere, eller i hvert fald fortælle dem, hvor du kan få det
Systemet skal også udvides, en løsning der virker for nu i et enkelt tilfælde kan udvides til at køre flere typer af arbejdspladser, fordi virksomheden ser en værdi i et gitter løsning. For eksempel kan arbejdspladser få prioriteter, mere end én jobtype kan eksistere (dvs. flere code baser), i sidste ende kan du endda køre flere forskellige arbejdstager maskiner, der er optimeret for hver type af arbejde (selv om det bevæger sig væk fra det »generiske arbejdstager 'idé). Forsøg altid at tænke på fremtiden, når udvikle systemer, kan en kortsigtet vision føre til på længere sigt frustration og øget udviklingsbistand tid.
Job Server
Vi vil få brug for et sted at styre vores job fra, bør dette være det eneste system i dit net, der har en fast ressource locator, være, at en IP-adresse, værtsnavn, URL (ved hjælp af intern DNS), osv. Dette skyldes arbejderne har brug for at vide hvor de skal lede efter job, arbejdstagere har behov for at finde det job kontrolsystem (ikke jobbet kontrolsystem finde arbejderne).
The job server itself doesn't really have a complicated task (in a basic system anyhow), it needs to store a list of jobs, hand out jobs, receive results, and subsequently store them for later retrieval. How these parts (such as 'hand out jobs') are defined can be very basic. Later on we can extend the system to include an administration interface to add, edit, delete, suspend jobs but this is beyond this exercise.
There is no reason whatsoever then that your job server could not be a virtual machine running within your main processing server provided it doesn't drain too many resources from it. The job server however does need high availability, if it goes down on a Friday evening you're going to lose a whole weekend of processing, potentially costing you a couple of weeks worth of processing time (when compared to your main processing server alone). You may want to consider putting your job server on a load balanced environment for high availability.
Grundlæggende opsætning
The basic setup for our job server will consist of what I'm calling one of my LiMP servers (that is Li nux, m ySql, P HP). The code running on the workers will actually work out what jobs it can run by interacting with with job control system databases. Later on we could create a web service and actually hand out jobs rather than having the workers do the hard work themselves, but for now we'll continue using the KISS principle (Keep it Simple, Stupid!).
So, lets create three mySQL tables to deal with jobs. These will be `jobs`, `jobRecords`, and `jobResults`.
Here I'm using SQL Buddy a great little alternative to phpMyAdmin just because its easier to install on centOS (for others see: 10 Great alternatives to phpMyAdmin )
This table consists of 5 simple fields,
- id: Uniquely identify the job
- name: Could be a client reference, or any number of other identifiers
- Status: You need to know where the job is at, eg
- 0: Not started
- 1: Picked up
- 2: Completed
- started_by: Who's started doing the job? This isn't entirely required but is a nice to have. I'd suggest tracking workers by their IP address on your network
- started_at: When did the worker start the job? By tracking jobs that have not completed within X amount of time we know we need to pick up the job once again and start processing by another worker. Workers could stop processing/go offline for any number of reasons, power failure, crash, network loss, etc.
It is easy how this table could be extended with a few additional fields to allow for statistics tracking, a finish time column to see how long the job took, a counter to see how many workers picked up the job (obviously this needs to tend to 1), job priority, the list can go on and on. In more complex job scenarios it would be possible to specify how much memory the worker would need access to (and therefore only use suitable workers), or even what type of worker would be required.
Lets add a few example jobs:
The next table again is quite simple to understand, these are our job records. They are linked to the main jobs table by a column `jobs_id`. The make up of this table very much depends on the data that you need to supply to your workers, lets make a very simple example where we have four columns:
- id: ID of the record
- name: Person's name
- address: Person's address
- jobs_id: The job ID that this record is linked to
The third and final table consists of a results table, it has much the same make up as our records table, and with the addition of some columns could be part of the records table:
- job_record_id: Link the result to the job table
- result: The result data
…and that's all you need for job control! (albeit at a very basic level) In my case I'm pointed to another table where my data to process was located, but this could just as easily been a file, parameters to run simulation code, you name it.
Selecting a job
As stated previously, the workers will do our job management for us for now, so all we need to really do is find a job that needs processing and get the information. How would we do this? Well pick our job selection criteria and look for jobs, in SQL I did the following:
- Take any jobs that are not marked as complete but from our worker and reset them (substitute __ME__ with an identifier, easiest would be IP address):
UPDATE `jobs` SET `status` = 0 WHERE `status` = 1 AND `started_by` = __ME__;
- Using our job selection criteria, select a job and tell the control system that this worker is dealing with it:
UPDATE `jobs` SET `status` = 1, `started_by` = __ME__, `started_at` = NOW() WHERE `status` = 0 OR
(`status` = 1 AND `started_at` > DATE_SUB(NOW(), INTERVAL X HOUR)) ORDER BY `id` ASC;
By grabbing jobs that haven't returned results in X amount of time we ensure that all jobs are run in the event of a worker crashing or going AWOL.
- Next grab the jobs details followed by the records themselves:
SELECT * FROM `jobs` WHERE `started_by` = __ME__ LIMIT 1;
SELECT * FROM `job_records` WHERE `id` = __JOBID__;
Upon completion of the job we insert our result records and mark the job as complete. Remember as jobs can suspend/resume at any time allow for some robustness in your script. It might be that the task suspends half way through updating the job control system, so checking the number of records in a job and the number of results saved back to the job control system would be a wise move.
In addition, whilst this demonstrates how jobs can be selected and managed from an SQL-query frame you should really be abstracting your job control so that if you decide to switch to using a web service, a file based system, XML , or any other number of systems it will not affect the code above it.
Job Configuration
The next aspect to consider is job size and configuration. By playing with job configuration we can strike an excellent balance between speed, process replication, and reliability. Take a couple of scenarios:
- Jobs take 1 day each to run: This means that your workers need 15 days to process each job (remember 10% of the power for 2/3rds of the time). This is clearly not a wise configuration, your job size is way too big! It would take at least double the time to get a job processed should the initial worker go AWOL (time to pick up that it hasn't returned a result plus reprocessing time). In an ideal you'd have at least one full job easily cleared by the end of each long idle period, that way you keep the jobs ticking over and at worst case a job would take two days to process should the first go missing.
- Jobs take 1 minute to run: This means that your workers take about 15 minutes to run each job. Whilst this may initially seem ideal, you gain additional work processing during lunch time, coffee breaks, meetings, etc this scenario puts strain on other areas of your system and introduces its own problems. For example, firstly your setup/processing time ratio is going to go right down, therefore losing system efficiency. Your network is going to be constantly streaming job information to the various workers frustrating staff who are dong their day to day work. You're also going to put more strain on your job processing server as it has to dish out lots and lots of small pieces of work on a regular basis. Lastly, in this situation if your job server goes down you're going to create a huge back log of uncompleted work whereas bigger jobs could of continued processing blissfully unaware that the job server was experiencing difficulties.
I virkeligheden vil der ikke være en ideel konfiguration til dit net setup, meget afhænger af de tilgængelige ressourcer, typer af job, job ekspeditionstid krav, netværks-kapacitet, og så videre. Men nogle retningslinjer vil være:
- Størrelse job, så at hver arbejdstager kan komme igennem mindst 3-4 arbejdspladser i en periode på 15 timer (den længste sandsynlige inaktiv periode)
- Spil med jobbet størrelse, så setup tid bliver temmelig ubetydelige i forhold til behandlingstiden (under hensyntagen til ovenstående punkt).
- Hvis et job ikke komplet i dobbelt så lang tid (måske mindre), du forventer at færdiggøre det antage, at det er væk væk fra mig og begynde at behandle det med en anden arbejdstager. Dette betyder, at du måske nødt til at vente i op til tre gange den normale længde af et job til den er færdig (muligvis længere, hvis den efterfølgende jobbet mislykkes). Du ønsker måske at reducere denne tid, men vær forsigtig med ikke at reducere det alt for meget som du kan begynde at overlappe behandlings opgaver på regelmæssig basis.
- Jobs bør være uafhængige af eksterne krav, så meget som muligt. Jobbet server, for eksempel, bør kun kontaktes ved starten og slutningen af hver opgave.
- Må ikke mætte dit netværk, vil dette få to negative effekter, vil din dagtimerne personalet finde ved hjælp af netværket frustrerende og problemer, kan opleves med tilslutninger timing et problem, som kun vil blive værre, når du skalere din net.
- Sikre jobs kan køre på dine medarbejdere. Hvis arbejdspladser bliver for hukommelsen intensiv eller diskplads intensivt job vil begynde at opgive, og det eneste du vil bemærke er et fald i antal arbejdspladser behandles uden reel grund til hvorfor.
Indsendelse Resultater af et job
When submitting the results of a job it is important to check that results have not been submitted by another worker, especially if the current worker has been dormant for some time.
When results are submitted ensure that the number of results matches the number of records within the job.
As stated previously, and can not be over emphasised, build fault tolerance into job retrieval and results submission. The workers can (and most likely will) go into suspend mode at the most inconvenient of times and this needs to be catered for. Also once again abstracting away your results submission will help cater for future changes to your job control system much easier to deal with.
Resumé
In this section we have looked at what a job control server needs to do and how to get a very basic system set up. We discussed how to retrieve a job from the control system and how best to configure jobs to get the most our of your office grid system. To finish, a paragraph or two on submitting results back to the job control server was presented.
- A job control server manages jobs and ensures that all work units are completed
- By abstracting your job select/results submission we can change the technology of the control server without much problems
- Configure your jobs to ensure that they are run quickly and efficiently without putting too much pressure on your network infrastructure, and without duplicating processing tasks on a regular basis.
- Ensure that you build fault tolerance and error checking into your routines, workers can suspend and resume and the most inconvenient of times. Remember to check if results have already been submitted by another worker.
Next time
In part 3 we'll create our virtual processing machine and set up our windows machines to become idle-time workers.