Teebone Ding Technical Blog

Django, Python, Javascript, Pig, and Hadoop.

A Survey of Job Completion Time Prediction Methodologies

Introduction

Job completion time prediction is not a new story recently. How to schedule a job before it arrives it's deadline is always a hot topic in systems research. In order to allocate sufficient resources to meet a job's service-level objectives (SLOs) on latency, we need to predict job completion time accurately when given resource allocation information.

I have surveyed several related works published last 5 years that focus on data parallel jobs scheduling with SLOs issues. There are different ways to construct performance models and predict job execution time. After understanding these methods, I would like to map these solutions on my research work. The last 2 section will describe open questions for discussions and lessons from a survey of job completion time prediction works.

Learning from history

Simulation (Jockey)

Jockey [2] is a scheduler that provides latency SLOs for data parallel jobs written in SCOPE [5]. As Jockey is implemented on Microsoft Cosmos clusters, jobs are running in production mode which means these must be recurring jobs. Jockey can predict the new coming recurring production job completion time by history logs. The architecture is shown below: (Fig. 2 from [2])

The offline phase includes job profiler and simulator. What information can we learn from history? The jobs running on Cosmos have structures like Dryad [4] that have several processing stages have dependency or parallel part. (See Fig. 3 in [2].) One could visualize a job's structure (execution plan) in a partial order diagram from the compiler and optimizer of SCOPE. In history logs we can obtain per-stage runtime information, job initialization latencies, task failure probabilities, stage and task dependency list for each job, job completion times, and etc.

The next step is applying simulation from given history performance data and different resource allocations to compute job completion time distribution. From extracted performance statistics as the simulator inputs, the simulator simulates events in the execution of a job. The events include: 1.) allocating tasks to machines, 2.) restart failed tasks, and 3.) scheduling tasks. There are no considerations on such like input size variance and scheduling duplicated jobs.

For each SLO jobs, the authors estimate C(p,a), a random variable that given present job progress p and resource tokens a, denoting the remaining completion time. {"The simulator simulates a job's remaining completion time in different resource allocations, i.e., changing token a in C(p,a) to compute job remaining time given present progress. "} Jockey considers resource amount in one unit (a token representation.) After pre-computation we can estimate job completion time by the random variable. Another issue is how to indicate the real progress while executing a job. [2] implemented some progress indicators to solve this problem. For readers who are interested in progress indicator, please read [2] by yourself.

SVM regression model (AROMA)

AROMA [1] provides a machine learning performance model to predict job completion time. The challenge in this paper is to allocate sufficient resources and minimize costs in the heterogeneous Cloud Hadoop environment. For the offline model training, the authors apply support vector machine (SVM) regression model to estimate the MapReduce job completion time. Table 2 in [1] describes feature selection for SORT job. The t-static and p-value are determined by a stepwise regression technique. The model features include Hadoop configurations, input data size, VM types, and number of VMs (which represent amount of resources.) The output of this machine learning model is job completion time estimation.

Before applying SVM regression, jobs are clustered into several clusters by k-medoid algorithm. The clustering features include CPU, network, and disk usage. The SVM regression models are constructed for each different cluster. For a new coming job, AROMA will simulate the job with a small VM and a fraction of its input data to capture the CPU, network, and disk usage and find a suitable SVM regression model to estimate the job completion time. With an assumption: jobs with similar bottlenecks and as a result they would be similar performance behavior,AROMA can predict ad-hoc job completion time by online simulation signature matching to a SVM model.

Analytical functions (ARIA and Jockey's modified Amdahl's Law)

In ARIA [3] and Jockey [1] there are analytical ways to predict job completion time. ARIA [3] solves the same problem as AROMA [1] that predict MapReduce job completion time and allocate sufficient resources to meet SLOs on latency. The difference is that ARIA focus on production (recurring) jobs running on self hosted Hadoop clusters. The authors in [3] induce a function of assigned resources to predict job completion time. The architecture of ARIA is described in Fig. 6 in [3]. It is implemented as a Hadoop scheduler.

ARIA can learn from job profile that extract information from the jab master logs. Information including map/reduce task durations, average input size per task, and the ratio of the map/reduce task output to input size. Based on the job profile and the performance bounds of completion time of different job phases, the authors design a MapReduce performance model. The detail of the mathematical induction is quite lengthy, I left this part for saving space. The completion time estimation equation looks like this:

A,B, and C are constants can be obtained from a job's profile. N(J,M) and N(J,R) are Map/Reduce total tasks amount for job J which are also constants can be obtained before execution. T stands for execution time that is what we want to get. S(J,M) and S(J,R) are two variables for this function which are Map and Reduce slots (resources) that the system can provide for job J. We then could simplify equation 10 to 11 and solve it by Lagrange's multipliers to get the smallest map/reduce slots for one job to finish before its deadline.

Jockey [1] also have an alternative way to predict Cosmos production job completion time. The idea is borrowed from Amdahl's Law: T=S+P/N.
T: completion time
S: serial part of a program (critical path)
P: parallel part of a program
N: number of processors

Now we could transfer "N" to "a" for number of resource tokens to predict job completion time. The accuracy shows in [1] that the simulation method gets better than the modified Amdahl's Law due to the consideration of outliers and failures in history logs that provides a more conservative resource allocation first.

Facing my problems

In my college department (CSIE in NTU), we have around 20 Linux workstations provided for students and faculties use. We found the perfomance and load per machines quite imbalance. We would like to provide a dispatcher/scheduler to determine a new coming job to be executed locally, move to another machine, or offload to the cloud in order to get better performance and load balance. The history log we have is dumped from the command:

1
ps aux
per machine to capture system status every 5 minutes. For a new coming job, we could obtain its full command line arguments and present system load from commands like dstat or top. The question is: how do we predict a new coming job's behavior and its completion time with its command line arguments and current system load information? According to [1 2 3], which solution can be mapped into our scenario?

Discussions

[1 2 3] focus on data parallel jobs/frameworks. These jobs are batch jobs that running on production machines or cloud environments. Our workstations may execute user-interactive jobs like editors or script language prompts. {"The ultimate goal in our project is to offload jobs to the cloud while the cluster cannot execute programs in time."} For user-interactive jobs, these are not what we are focus on. The first thing before dispatching is to classify whether the new coming job is a computation intensive batch task or not. We are more intersted in computation intensive jobs which are few portion in the system but consume almost the whole capacity of one machine.

Conclusions

Job completion time preditction is always a hot topic in systems research. The survey from AROMA, ARIA, and Jockey shows different methodologies to construct prediction models and predict job completion time for different frameworks and environments. Methods include simulation, machine learning model, and analytical functions for different scenarios. After the survey, the next step is to derive a solution in our project: predict job behavior and completion time in Linux workstation. Cloud computing become another solution for load balance. How do we offload jobs and use cloud computation resources to meet SLOs? To offload or not to offload? That is a question.

References

[1] AROMA: Automated Resource Allocation and Configuration of MapReduce Environment in the Cloud ,ICAC 2012
[2] Jockey: Guaranteed Job Latency in Data Parallel Clusters ,EuroSys 2012
[3] ARIA: Automatic Resource Inference and Allocation for MapReduce ,ICAC 2011
[4] Dryad: Distributed Data-Parallel Programs from Sequential Blocks ,EuroSys 2007
[5] SCOPE: easy and efficient parallel processing of massive data sets ,Proceedings of the VLDB Endowment 2008

Comments