Life Hacks


      

ILP PROCESSORS



INTRODUCTION: Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following example:

1. e=a+b

2. f=c+d

3. g=e*f

Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.

Micro-architectural techniques that are used to exploit ILP include:

§ Instruction pipelining where the execution of multiple instructions can be partially overlapped.

§ Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel. In typical superscalar processors, the instructions executing simultaneously are adjacent in the original program order.

§ Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar.

§ Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution.

§ Speculative execution which allow the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction (e.g., a branch) are executed before the target of the control flow instruction is determined. Several other forms of speculative execution have been proposed and are in use including speculative execution driven by value prediction, memory dependence prediction and cache latency prediction.

§ Branch prediction which is used to avoid stalling for control dependencies to be resolved. Branch prediction is used with speculative execution.

Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs. An alternative is to extract this parallelism at compile time and somehow convey this information to the hardware. Due to the complexity of scaling the out-of-order execution technique, the industry has re-examined sets which explicitly encode multiple independent operations per instruction. These instruction set types include:

§ VLIW and the closely related Explicitly Parallel Instruction Computing concepts

Dataflow architectures are another class of architectures where ILP is explicitly specified, but they have not been actively researched since the 1980s.

EVOLUTION OF ILP: In its brief lifetime of just about 30 years, the microprocessor has achieved a total

Performance growth of about 9,000X due to both technology improvements and

Micro architecture innovations. Both transistor count and clock frequency have increased by

an order of magnitude in each of the last three decades. During the 1980’s the average number

of Instructions per Cycle (IPC) also increased by almost an order of magnitude, which is not

the case in the last decade. While the contribution to performance by

technology improvements seems to be accelerating, the contribution from micro architecture

Innovations seems to be diminishing. It is clear that the anticipated technology can support the

implementation of much wider and deeper machines. The key question is how to achieve

additional IPC proportional to the increase of machine resources.

Evolution of Microprocessors:

Current microprocessor architectures assume sequential programs as an input and a

parallel execution model. Their efficiency is highly dependent on the Instruction-Level

Parallelism (ILP) the programs exhibit, as a measure of the potential number of instructions

that can be executed simultaneously. Basically, there are two fundamental approaches to

executing multiple instructions in parallel: the superscalar processor and the VLIW (Very

Long Instruction Word) processor. The superscalar approach extracts the parallelism from a

sequential program at run time (dynamically), by employing sophisticated hardware

mechanisms. On the other hand, the performance of VLIW architectures is dependent on the

capability of the compiler to detect and exploit instruction-level parallelism during instruction

scheduling using advanced (static) techniques.

Regardless of the approach used, instructions cannot always be eligible for parallel

execution due to three classes of constraints: control dependences, true data dependences and

name (false) dependences . Control dependences appear when the execution of

an instruction depends on the outcome of a branch (either conditional or unconditional). The

resolution of the outcome involves deciding whether the branch is taken or not (in the case of

conditional branch) and computing its target address. As long as control dependences are not

manipulated, control dependent instructions cannot even begin their execution because the

branch outcome should determine the next sequence of instructions to be executed. True data

dependences occur when an instruction consumes a value that is generated by a preceding

instruction, and therefore these instructions cannot be executed in parallel. Name dependences

occur when instructions use the same register or memory location (name), but there is no flow

of data between them as in true data dependences. One can distinguish between two kinds of

name dependences: output dependences – when instructions attempt to write simultaneously

to the same name, and anti-dependences – when a later instruction attempts to write to a name

before it is read by an earlier instruction. As long as name dependences are not handled, they

can avoid instructions from being executed in parallel since the program correctness would be

violated.

Several techniques eliminate the effect of dependences or reduce their consequences.

For example, data forwarding and implementation of shelving reduce the effect of true data

dependences, while renaming eliminates the name dependences. However, the potential of

upcoming processors with hundreds of processing units requires more parallelism than the

one obtained by these techniques. One alternative is the introduction of advanced elimination

techniques, and the other is the introduction of various speculation techniques based

on prediction.

Many ILP processors speculatively execute control dependent instructions before

resolving the branch outcome. They rely upon branch prediction in order to tolerate the effect

of control dependences. A branch predictor uses the current fetch address to predict whether a

branch will be fetched in the current cycle, whether that branch will be taken or not, and what

the target address of the branch is. The predictor uses this information to decide where to

fetch from in the next cycle. Since the branch execution penalty is only seen if the branch was

mispredicted, a highly accurate branch predictor is a very important mechanism for reducing

the branch penalty in a high performance ILP processor.

APPLICATIONS OF ILP:

In recent years, ILP techniques have been used to provide performance improvements in spite of the growing disparity between processor operating frequencies and memory access times (early ILP designs such as the IBM 360 used ILP techniques to overcome the limitations imposed by a relatively small register file). Presently, a cache miss penalty to main memory costs several hundreds of CPU cycles. While in principle it is possible to use ILP to tolerate even such memory latencies the associated resource and power dissipation costs are disproportionate. Moreover, the complexity and often the latency of the underlying hardware structures results in reduced operating frequency further reduce any benefits. Hence, the aforementioned techniques prove inadequate to keep the CPU from stalling for the off-chip data. Instead, the industry is heading towards exploiting higher levels of parallelism that can be exploited through techniques such as multiprocessing and multithreading

IMPROVING PERFORMANCE OF ILP:

A performance measurement and evaluation system for ILP (Instruction Level Parallelism) processors which issue multiple instructions and execute them in parallel is developed. The system consists of a C compiler and a simulator. The compiler takes C source programs as an input and generates 3-address style intermediate code. Then the simulator accepts the intermediate code and simulates it. The results of simulation are the contents of memory before and after simulation, the number of executed clocks, the trace and the dynamic count of executed instructions, the prediction hit ratio and profiling information for each branch instruction. To verify and understand the behavior of the system, the performance of predicated execution and one of branch schemes is measured and its results are analyzed.

As ILP processors have been developed, the importance for optimizing compiler is increased. Among various optimization methods, register allocation and code scheduling are essential to improve the performance of ILP processors. But it is required to consider carefully to apply these schemes to them because the results of one conflict with another. In this paper, we presented an extended register allocation algorithm which allocates registers to improve the effects of code scheduling. The conventional register allocation algorithm has possible to leave redundant registers after register allocation. The proposed register allocation algorithm allocates registers without redundant registers and decreases the data dependence relation to achieve a high code scheduling opportunity. Experimental results shown that the proposed register allocation algorithm was reduced the 7% execution clock cycles and the 73% complexity than the conventional scheme.

Recently, many parallel processing technologies were developed, ILP (Instruction Level Parallelism) processor’s performance have been increased very rapidly. Especially, EPIC (Explicitly Parallel Instruction Computing) architectures attempt to enhance the performance in the predicated execution and speculative execution with the hardware. In this paper, to improve the code scheduling possibility by applying to the characteristics of EPIC architectures, a new register allocation algorithm is proposed. And we prove that proposed register allocation algorithm is more efficient scheme than the conventional scheme when predicated execution is applied to our scheme by experiments. In experimental results, it shows much more performance enhancement, about 19% in proposed scheme than the conventional scheme. So, our scheme is verified that it is an effective register allocation method.

PREDICTION AND SPECULATION IN ILP PROCESSORS:

Modeling based analysis is applied to assess the efficiency of predictions and speculative

executions in boosting the performance of ILP processors that fetch, issue, execute and commit a large

number of instructions per cycle. The primary objective of the analysis is to better understand branch and value prediction techniques and their performance potential under different scenarios, with varying parameters of both the micro architecture, as well as the operational environment. So far, the estimations of the influence that these techniques have on ILP processor performance, more or less, rely upon the use of micro architectural simulators with trace-driven or execution-driven simulation – only a few simple deterministic models are known. As opposed to this common trait, a stochastic model of the dynamic behavior of an ILP processor that aims to minimize the impact of control and true data dependences and employs speculative execution based on branch and value prediction, is introduced for the first time in this dissertation. At the same time, the characteristics of the operational environment are formalized and their influence on ILP processor performance is considered, as well.

The proposed dynamic behavior model is built using Fluid Stochastic Petri Nets (FSPN) and the

stochastic process underlying the Petri Net is described by a system of first-order hyperbolic partial differential equations with appropriately chosen initial and boundary conditions. The attempt to capture the dynamic behavior of an ILP processor is a rare example that demonstrates the usage of this recently introduced formalism in modeling actual systems. Moreover, it is the first example to take into consideration numerical transient analysis and present numerical solution of a Fluid Stochastic Petri Net with more than three fluid places. Both the application of finite-difference approximations for the partial derivatives, as well as the discrete-event simulation of the proposed FSPN model, allow for the evaluation of a number of performance measures and lead to numerous conclusions regarding the performance impact of predictions and speculative execution with varying parameters of both the micro architecture and the operational environment. The numerical solution makes possible the

probabilistic analysis of the dynamic behavior, whereas the advantage of the discrete-event simulation is the much faster generation of performance evaluation results, even when compared to the broadly utilized micro architectural simulators. The proposed novel approach is essentially implementation independent and reveals considerable potential for further research in this area using numerical transient analysis and/or discrete-event simulation of FSPN models.

In summary:

The benefits of branch and value prediction are higher when control and true data

dependences have a much higher influence on the total execution time of a program

14

Value prediction is an effective approach that might enable higher levels of

parallelism without the need to increase machine width and/or instruction window

size;

There is a correlation between the value prediction efficiency and the branch

prediction efficiency;

The wider the machine, the more significant performance limitation the branch

mispredictions become.

ILP processor’s performance is strongly influenced by the characteristics of the operational environment.

Leave a Reply