Въведение
Аз работя в компания, в която ще свършим много работни места за обработка на партидите милиони копия на данните, всеки ден и аз си мисля напоследък за всички машини, които седят около всеки ден правиш нищо в продължение на няколко часа. Не би ли било добре, ако можем да използваме тези машини да се засили процесорна мощ на нашите системи? В тази поредица от статии, аз ще разгледаме потенциалните ползи от наемането на офис мрежа използване виртуализирани среди.
В част 1, аз се прави преглед на системата и технологии ще се използват, както и обсъдени някои от възможните причини, поради които бихте искали да се създаде мрежа офис.
Работа контрол
Ако ще да се работи работни места след това започваш да имат нужда от някакъв начин да ги управляват. Вашата система контрол на работата (в работата си сървър) трябва да бъдат много добре обмислени, преди дори се опитват да стартирате мрежата офис. Така че, от една страна, какви са задачите за работа система за управление:
- Раздайте работни места по искане на работниците
- Кажете какъв тип работници на работните места да тече
- Проследяване на работни места
- Уверете се, че работните места се провеждат само веднъж
- Provide job data to workers, or at least tell them where to get it
The system also needs to be extensible, a solution that works for now in a single case may be extended to run several types of jobs as the business sees the worth in a grid solution. For example, jobs may gain priorities, more than one job type may exist (ie several code bases), eventually you may even run several different worker machines that are optimised for each type of job (although that does move away from the 'generic worker' idea). Always try to think about the future when developing systems, a short term vision can lead to longer term frustration and increased development time.
Работа Server
We're going to need somewhere to control our jobs from, this should be the only system in your grid that has a fixed resource locator, be that an IP address, host name, URL (using internal DNS), etc. This is because the workers need to know where to look for jobs, workers need to find the job control system (not the job control system find the workers).
The job server itself doesn't really have a complicated task (in a basic system anyhow), it needs to store a list of jobs, hand out jobs, receive results, and subsequently store them for later retrieval. How these parts (such as 'hand out jobs') are defined can be very basic. Later on we can extend the system to include an administration interface to add, edit, delete, suspend jobs but this is beyond this exercise.
There is no reason whatsoever then that your job server could not be a virtual machine running within your main processing server provided it doesn't drain too many resources from it. The job server however does need high availability, if it goes down on a Friday evening you're going to lose a whole weekend of processing, potentially costing you a couple of weeks worth of processing time (when compared to your main processing server alone). You may want to consider putting your job server on a load balanced environment for high availability.
Basic Setup
The basic setup for our job server will consist of what I'm calling one of my LiMP servers (that is Li nux, m ySql, P HP). The code running on the workers will actually work out what jobs it can run by interacting with with job control system databases. Later on we could create a web service and actually hand out jobs rather than having the workers do the hard work themselves, but for now we'll continue using the KISS principle (Keep it Simple, Stupid!).
So, lets create three mySQL tables to deal with jobs. These will be `jobs`, `jobRecords`, and `jobResults`.
Here I'm using SQL Buddy a great little alternative to phpMyAdmin just because its easier to install on centOS (for others see: 10 Great alternatives to phpMyAdmin )
This table consists of 5 simple fields,
- id: Uniquely identify the job
- name: Could be a client reference, or any number of other identifiers
- Status: You need to know where the job is at, eg
- 0: Not started
- 1: Picked up
- 2: Completed
- started_by: Who's started doing the job? This isn't entirely required but is a nice to have. I'd suggest tracking workers by their IP address on your network
- started_at: When did the worker start the job? By tracking jobs that have not completed within X amount of time we know we need to pick up the job once again and start processing by another worker. Workers could stop processing/go offline for any number of reasons, power failure, crash, network loss, etc.
It is easy how this table could be extended with a few additional fields to allow for statistics tracking, a finish time column to see how long the job took, a counter to see how many workers picked up the job (obviously this needs to tend to 1), job priority, the list can go on and on. In more complex job scenarios it would be possible to specify how much memory the worker would need access to (and therefore only use suitable workers), or even what type of worker would be required.
Lets add a few example jobs:
The next table again is quite simple to understand, these are our job records. They are linked to the main jobs table by a column `jobs_id`. The make up of this table very much depends on the data that you need to supply to your workers, lets make a very simple example where we have four columns:
- id: ID of the record
- name: Person's name
- address: Person's address
- jobs_id: The job ID that this record is linked to
The third and final table consists of a results table, it has much the same make up as our records table, and with the addition of some columns could be part of the records table:
- job_record_id: Link the result to the job table
- result: The result data
…and that's all you need for job control! (albeit at a very basic level) In my case I'm pointed to another table where my data to process was located, but this could just as easily been a file, parameters to run simulation code, you name it.
Selecting a job
Както вече бе посочено, работниците ще направим всичко за управление на работа за нас, за сега, така че всички ние трябва наистина да направите, е да си намерят работа, която трябва обработка и да получите информация. Как ще го направим? Ами вземете нашите критерии за подбор на работни места и търсят работа, в SQL направих следното:
- Вземете всички работни места, които не са маркирани като завършени от нашите работници и нулиране тях (заместител __ME__ с идентификатор, най-лесно ще бъде IP адрес):
UPDATE "работни места" SET "статут" = 0, когато "статут" = 1 и "started_by" = __ME__;
- С помощта на нашите критерии за работа за избор, изберете работа и да каже на система за контрол, че този работник се занимава с него:
UPDATE "работни места" SET "статут" = 1, "started_by" = __ME__, "started_at" = NOW () КОГАТО "статут" = 0 или
("Статут" = 1 и "started_at"> DATE_SUB (NOW (), период за X час)) ORDER BY "номер" ASC; Като вземат работните места, които не са се завърнали резултати в размер X от време ние гарантираме, че всички работни места се провеждат в случай на работник от срив или ще несигурен.
- Следваща вземете работни места Подробности, следвани от самите записи:
SELECT * FROM `работни места", когато "started_by" = __ME__ LIMIT 1;
SELECT * FROM `job_records", когато "номер" = __JOBID__;
След приключване на работа ние посочете ни резултат записа и маркирайте работа за пълно. Не забравяйте, тъй като работните места може да спре / възобнови по всяко време да позволи за някои стабилността в скрипта. Тя може да бъде, че задачата, спира по средата на актуализиране на системата за контрол на работата, така че проверка на броя на записите в работа и броя на резултатите, записани обратно към системата за управление на задачите би било мъдър ход.
В допълнение, като това показва, колко работни места могат да бъдат избрани и управлявани от SQL-заявка кадри, които трябва наистина да бъде абстрахиране работата си контрол, така че ако решите да преминете към използване на уеб услуги, файл, базирана система, XML или друг съвкупност от системи, това няма да повлияе на кода над него.
Работа за конфигурация
Следващият аспект е да се помисли работа размери и конфигурация. С играе с работа конфигурация може да удари отличен баланс между скорост, репликация процес, и надеждност. Вземете няколко OFA сценария:
- Работата отнема 1 ден всяка да текат: Това означава, че работниците трябва да 15 дни за обработка на всяко едно работно място (не забравяйте 10% от мощността за 2/3rds на време). Това очевидно не е мъдър конфигурация, вашата работа размер е твърде голям! Това ще отнеме поне два пъти повече време, за да си намеря работа преработени следва първоначалната работник отиде несигурен (време, за да вземем, че не се е върнал в резултат плюс преработка време). В един идеален ще трябва най-малко един пълен работен лесно премахнати до края на всеки продължителен период на престой, по този начин да поддържате работни места тиктака отново и в най-лошия случай, че работата ще отнеме два дни, за процеса, ако първият изчезнали.
- Работа по 1 минута, за да текат: Това означава, че работниците отнеме около 15 минути, за да тичам всяка работа. Въпреки че това може първоначално изглежда идеално, получавате допълнителна обработка работа по време на обяд, кафе паузи, срещи и т.н. този сценарий поставя натиск върху други области на вашата система и въвежда собствените си проблеми. Например, от една страна настройките съотношение между времето за обработка ще отиде чак, поради загуба на ефективността на системата. Your network is going to be constantly streaming job information to the various workers frustrating staff who are dong their day to day work. You're also going to put more strain on your job processing server as it has to dish out lots and lots of small pieces of work on a regular basis. Lastly, in this situation if your job server goes down you're going to create a huge back log of uncompleted work whereas bigger jobs could of continued processing blissfully unaware that the job server was experiencing difficulties.
In reality there will be no one ideal configuration for your grid setup, much depends on the available resources, types of job, job turnaround time requirements, network capability, and so on. However some guidelines would be:
- Size jobs so that each worker can get through at least 3-4 jobs in a period of 15 hours (the longest likely idle time period)
- Play with the job size so that setup time becomes fairly insignificant compared to the processing time (bearing in mind the above point).
- If a job doesn't complete in double the amount of time (maybe less) you expect it to complete it assume that its gone AWOL and start processing it with another worker. This means you may have to wait up to three times the normal length of a job for it to complete (possibly longer if the subsequent job fails). You may want to reduce this time, but be careful not to reduce it too much as you may start duplicating processing tasks on a regular basis.
- Jobs should be independent of outside requirements as much as possible. The job server, for example, should only be contacted at the start and end of every job.
- Don't saturate your network, this will have two negative effects, your daytime staff will find using the network frustrating and problems may be experienced with connections timing out a problem that will only get worse as you scale your grid.
- Ensure jobs can run on your workers. If jobs become too memory intensive or disk space intensive jobs will start aborting and the only thing you'll notice is a drop in number of jobs processed with no real reason why.
Submitting Results of a Job
When submitting the results of a job it is important to check that results have not been submitted by another worker, especially if the current worker has been dormant for some time.
When results are submitted ensure that the number of results matches the number of records within the job.
As stated previously, and can not be over emphasised, build fault tolerance into job retrieval and results submission. The workers can (and most likely will) go into suspend mode at the most inconvenient of times and this needs to be catered for. Also once again abstracting away your results submission will help cater for future changes to your job control system much easier to deal with.
Обобщение
In this section we have looked at what a job control server needs to do and how to get a very basic system set up. We discussed how to retrieve a job from the control system and how best to configure jobs to get the most our of your office grid system. To finish, a paragraph or two on submitting results back to the job control server was presented.
- A job control server manages jobs and ensures that all work units are completed
- By abstracting your job select/results submission we can change the technology of the control server without much problems
- Configure your jobs to ensure that they are run quickly and efficiently without putting too much pressure on your network infrastructure, and without duplicating processing tasks on a regular basis.
- Ensure that you build fault tolerance and error checking into your routines, workers can suspend and resume and the most inconvenient of times. Remember to check if results have already been submitted by another worker.
Следващия път
In part 3 we'll create our virtual processing machine and set up our windows machines to become idle-time workers.