Introducción
Yo trabajo en una empresa en la que nos encontramos muchos puestos de trabajo de procesamiento por lotes de millones de registros de datos de todos los días y he estado pensando últimamente sobre todas las máquinas que se sientan alrededor de cada uno y todos los días sin hacer nada durante varias horas. ¿No sería bueno si pudiéramos utilizar esas máquinas para reforzar el poder de transformación de nuestros sistemas? En este conjunto de artículos que voy a ver los beneficios potenciales del empleo de una oficina de la red utilizando entornos virtualizados.
En la parte 1 me dio una visión general del sistema y las tecnologías que va a utilizar, así como se discute algunas de las posibles razones por las que se desea crear una red de oficinas.
Trabajo de control
Si usted va a estar ejecutando trabajos, entonces vamos a necesitar alguna forma de manejarlos. Su sistema de control de trabajos (en el servidor de trabajo) tiene que ser muy bien pensado, incluso antes de intentar ejecutar una red de oficinas. Así que en primer lugar, ¿cuáles son las tareas de un sistema de control de trabajo:
- Trabajos de la mano a cabo a petición de los trabajadores
- Dígales a los trabajadores qué tipo de trabajos se ejecuten
- Seguimiento de los trabajos
- Asegurar que los trabajos sólo se ejecutan una vez
- Proporcionar los datos del trabajo a los trabajadores, o por lo menos decir dónde conseguirlo
El sistema también debe ser extensible, una solución que funciona por ahora en un solo caso se puede extender para ejecutar varios tipos de trabajos como el negocio ve la pena en una solución de red. Por ejemplo, los trabajos pueden ganar las prioridades, más de un tipo de trabajo puede existir (es decir, varias bases de código), con el tiempo puede incluso ejecutar varias máquinas diferentes trabajadores que están optimizadas para cada tipo de trabajo (a pesar de que se aleja de los trabajadores "genéricos 'idea). Siempre trato de pensar en el futuro en el desarrollo de sistemas, una visión a corto plazo puede conducir a la frustración a largo plazo y el tiempo de desarrollo mayor.
Servidor de tareas de
Vamos a necesitar un lugar para el control de nuestros puestos de trabajo a partir de, este debe ser el único sistema en su red que cuenta con un localizador de recursos fijos, ya sea una dirección IP, nombre de host, la dirección URL (usando DNS interno), etc Esto se debe a los trabajadores necesitan saber dónde buscar empleo, los trabajadores necesitan para encontrar el sistema de control de trabajo (no el sistema de control de trabajo encontrar a los trabajadores).
El servidor de trabajo en sí no tiene realmente una tarea complicada (en un sistema básico de todos modos), que necesita para almacenar una lista de puestos de trabajo, entregar trabajos, recibir los resultados, y posteriormente almacenarlos para su posterior recuperación. ¿Cómo estas partes (tales como "mano puestos de trabajo") se definen pueden ser muy básicas. Más adelante se puede ampliar el sistema para incluir una interfaz de administración para agregar, editar, borrar, suspender trabajos, pero esto está más allá de este ejercicio.
No hay ninguna razón entonces de que su servidor de trabajo no podía ser una máquina virtual que se ejecuta en el servidor de procesamiento principal, siempre que no drena demasiados recursos de la misma. El servidor de trabajo sin embargo es necesario una alta disponibilidad, si se cae en un viernes por la noche usted va a perder un fin de semana de tratamiento, lo que podría costarle un par de semanas de tiempo de procesamiento (en comparación con el servidor de procesamiento principal solamente) . Es posible que desee considerar la posibilidad de su servidor de trabajo en un entorno de equilibrio de carga de alta disponibilidad.
Configuración básica
La configuración básica de nuestro servidor de trabajo consistirá en lo que estoy llamando a uno de mis servidores Bizkit (que es nux Li, m ySql, P HP). The code running on the workers will actually work out what jobs it can run by interacting with with job control system databases. Later on we could create a web service and actually hand out jobs rather than having the workers do the hard work themselves, but for now we'll continue using the KISS principle (Keep it Simple, Stupid!).
So, lets create three mySQL tables to deal with jobs. These will be `jobs`, `jobRecords`, and `jobResults`.
Here I'm using SQL Buddy a great little alternative to phpMyAdmin just because its easier to install on centOS (for others see: 10 Great alternatives to phpMyAdmin )
This table consists of 5 simple fields,
- id: Uniquely identify the job
- name: Could be a client reference, or any number of other identifiers
- Status: You need to know where the job is at, eg
- 0: Not started
- 1: Recogido
- 2: Completado
- started_by: ¿Quién empezó a hacer el trabajo? Esto no es del todo necesario, pero es un agradable de tener. Te sugiero que los trabajadores de seguimiento por su dirección IP en la red
- started_at: ¿Cuándo el trabajador iniciar el trabajo? Mediante el seguimiento de los trabajos que no hayan completado dentro de X cantidad de tiempo que sabemos que tenemos que recoger el trabajo de nuevo y empezar a procesar por otro trabajador. Los trabajadores podrían dejar de procesar / fuera de línea para cualquier número de razones, falta de luz, caída, pérdida de red, etc
Es fácil cómo esta tabla podría ser ampliado con un unos pocos campos adicionales para permitir el seguimiento de las estadísticas, una columna de tiempo de llegada para ver cuánto tiempo tomó el trabajo, un contador para ver cuántos trabajadores tomó el trabajo (obviamente esto tiene que tienden a 1), prioridad de los trabajos, la lista puede seguir y seguir. In more complex job scenarios it would be possible to specify how much memory the worker would need access to (and therefore only use suitable workers), or even what type of worker would be required.
Lets add a few example jobs:
The next table again is quite simple to understand, these are our job records. They are linked to the main jobs table by a column `jobs_id`. The make up of this table very much depends on the data that you need to supply to your workers, lets make a very simple example where we have four columns:
- id: ID of the record
- name: Person's name
- address: Person's address
- jobs_id: The job ID that this record is linked to
The third and final table consists of a results table, it has much the same make up as our records table, and with the addition of some columns could be part of the records table:
- job_record_id: Link the result to the job table
- result: The result data
…and that's all you need for job control! (albeit at a very basic level) In my case I'm pointed to another table where my data to process was located, but this could just as easily been a file, parameters to run simulation code, you name it.
Selecting a job
As stated previously, the workers will do our job management for us for now, so all we need to really do is find a job that needs processing and get the information. How would we do this? Well pick our job selection criteria and look for jobs, in SQL I did the following:
- Take any jobs that are not marked as complete but from our worker and reset them (substitute __ME__ with an identifier, easiest would be IP address):
UPDATE `jobs` SET `status` = 0 WHERE `status` = 1 AND `started_by` = __ME__;
- Using our job selection criteria, select a job and tell the control system that this worker is dealing with it:
UPDATE `jobs` SET `status` = 1, `started_by` = __ME__, `started_at` = NOW() WHERE `status` = 0 OR
(`status` = 1 AND `started_at` > DATE_SUB(NOW(), INTERVAL X HOUR)) ORDER BY `id` ASC;
By grabbing jobs that haven't returned results in X amount of time we ensure that all jobs are run in the event of a worker crashing or going AWOL.
- Next grab the jobs details followed by the records themselves:
SELECT * FROM `jobs` WHERE `started_by` = __ME__ LIMIT 1;
SELECT * FROM `job_records` WHERE `id` = __JOBID__;
Upon completion of the job we insert our result records and mark the job as complete. Remember as jobs can suspend/resume at any time allow for some robustness in your script. It might be that the task suspends half way through updating the job control system, so checking the number of records in a job and the number of results saved back to the job control system would be a wise move.
In addition, whilst this demonstrates how jobs can be selected and managed from an SQL-query frame you should really be abstracting your job control so that if you decide to switch to using a web service, a file based system, XML , or any other number of systems it will not affect the code above it.
Job Configuration
The next aspect to consider is job size and configuration. By playing with job configuration we can strike an excellent balance between speed, process replication, and reliability. Take a couple of scenarios:
- Jobs take 1 day each to run: This means that your workers need 15 days to process each job (remember 10% of the power for 2/3rds of the time). This is clearly not a wise configuration, your job size is way too big! It would take at least double the time to get a job processed should the initial worker go AWOL (time to pick up that it hasn't returned a result plus reprocessing time). In an ideal you'd have at least one full job easily cleared by the end of each long idle period, that way you keep the jobs ticking over and at worst case a job would take two days to process should the first go missing.
- Jobs take 1 minute to run: This means that your workers take about 15 minutes to run each job. Whilst this may initially seem ideal, you gain additional work processing during lunch time, coffee breaks, meetings, etc this scenario puts strain on other areas of your system and introduces its own problems. For example, firstly your setup/processing time ratio is going to go right down, therefore losing system efficiency. Your network is going to be constantly streaming job information to the various workers frustrating staff who are dong their day to day work. You're also going to put more strain on your job processing server as it has to dish out lots and lots of small pieces of work on a regular basis. Lastly, in this situation if your job server goes down you're going to create a huge back log of uncompleted work whereas bigger jobs could of continued processing blissfully unaware that the job server was experiencing difficulties.
In reality there will be no one ideal configuration for your grid setup, much depends on the available resources, types of job, job turnaround time requirements, network capability, and so on. However some guidelines would be:
- Size jobs so that each worker can get through at least 3-4 jobs in a period of 15 hours (the longest likely idle time period)
- Play with the job size so that setup time becomes fairly insignificant compared to the processing time (bearing in mind the above point).
- If a job doesn't complete in double the amount of time (maybe less) you expect it to complete it assume that its gone AWOL and start processing it with another worker. This means you may have to wait up to three times the normal length of a job for it to complete (possibly longer if the subsequent job fails). You may want to reduce this time, but be careful not to reduce it too much as you may start duplicating processing tasks on a regular basis.
- Jobs should be independent of outside requirements as much as possible. The job server, for example, should only be contacted at the start and end of every job.
- Don't saturate your network, this will have two negative effects, your daytime staff will find using the network frustrating and problems may be experienced with connections timing out a problem that will only get worse as you scale your grid.
- Ensure jobs can run on your workers. If jobs become too memory intensive or disk space intensive jobs will start aborting and the only thing you'll notice is a drop in number of jobs processed with no real reason why.
Submitting Results of a Job
When submitting the results of a job it is important to check that results have not been submitted by another worker, especially if the current worker has been dormant for some time.
When results are submitted ensure that the number of results matches the number of records within the job.
As stated previously, and can not be over emphasised, build fault tolerance into job retrieval and results submission. The workers can (and most likely will) go into suspend mode at the most inconvenient of times and this needs to be catered for. Also once again abstracting away your results submission will help cater for future changes to your job control system much easier to deal with.
Resumen
In this section we have looked at what a job control server needs to do and how to get a very basic system set up. We discussed how to retrieve a job from the control system and how best to configure jobs to get the most our of your office grid system. To finish, a paragraph or two on submitting results back to the job control server was presented.
- A job control server manages jobs and ensures that all work units are completed
- By abstracting your job select/results submission we can change the technology of the control server without much problems
- Configure your jobs to ensure that they are run quickly and efficiently without putting too much pressure on your network infrastructure, and without duplicating processing tasks on a regular basis.
- Ensure that you build fault tolerance and error checking into your routines, workers can suspend and resume and the most inconvenient of times. Remember to check if results have already been submitted by another worker.
La próxima vez
In part 3 we'll create our virtual processing machine and set up our windows machines to become idle-time workers.