ILP PROCESSORS
INTRODUCTION: Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following example:
1. e=a+b
2. f=c+d
3. g=e*f
Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.
Micro-architectural techniques that are used to exploit ILP include:
§ Instruction pipelining where the execution of multiple instructions can be partially overlapped.
§ Superscalar execution in which multiple execution units are used to execute multiple instructions in parallel. In typical superscalar processors, the instructions executing simultaneously are adjacent in the original program order.
§ Out-of-order execution where instructions execute in any order that does not violate data dependencies. Note that this technique is independent of both pipelining and superscalar.
§ Register renaming which refers to a technique used to avoid unnecessary serialization of program operations imposed by the reuse of registers by those operations, used to enable out-of-order execution.
§ Speculative execution which allow the execution of complete instructions or parts of instructions before being certain whether this execution should take place. A commonly used form of speculative execution is control flow speculation where instructions past a control flow instruction (e.g., a branch) are executed before the target of the control flow instruction is determined. Several other forms of speculative execution have been proposed and are in use including speculative execution driven by value prediction, memory dependence prediction and cache latency prediction.
§ Branch prediction which is used to avoid stalling for control dependencies to be resolved. Branch prediction is used with speculative execution.
Current implementations of out-of-order execution dynamically (i.e., while the program is executing and without any help from the compiler) extract ILP from ordinary programs. An alternative is to extract this parallelism at compile time and somehow convey this information to the hardware. Due to the complexity of scaling the out-of-order execution technique, the industry has re-examined sets which explicitly encode multiple independent operations per instruction. These instruction set types include:
§ VLIW and the closely related Explicitly Parallel Instruction Computing concepts
Dataflow architectures are another class of architectures where ILP is explicitly specified, but they have not been actively researched since the 1980s.
EVOLUTION OF ILP: In its brief lifetime of just about 30 years, the microprocessor has achieved a total
Performance growth of about 9,000X due to both technology improvements and
Micro architecture innovations. Both transistor count and clock frequency have increased by
an order of magnitude in each of the last three decades. During the 1980’s the average number
of Instructions per Cycle (IPC) also increased by almost an order of magnitude, which is not
the case in the last decade. While the contribution to performance by
technology improvements seems to be accelerating, the contribution from micro architecture
Innovations seems to be diminishing. It is clear that the anticipated technology can support the
implementation of much wider and deeper machines. The key question is how to achieve
additional IPC proportional to the increase of machine resources.
Evolution of Microprocessors:
Current microprocessor architectures assume sequential programs as an input and a
parallel execution model. Their efficiency is highly dependent on the Instruction-Level
Parallelism (ILP) the programs exhibit, as a measure of the potential number of instructions
that can be executed simultaneously. Basically, there are two fundamental approaches to
executing multiple instructions in parallel: the superscalar processor and the VLIW (Very
Long Instruction Word) processor. The superscalar approach extracts the parallelism from a
sequential program at run time (dynamically), by employing sophisticated hardware
mechanisms. On the other hand, the performance of VLIW architectures is dependent on the
capability of the compiler to detect and exploit instruction-level parallelism during instruction
scheduling using advanced (static) techniques.
Regardless of the approach used, instructions cannot always be eligible for parallel
execution due to three classes of constraints: control dependences, true data dependences and
name (false) dependences . Control dependences appear when the execution of
an instruction depends on the outcome of a branch (either conditional or unconditional). The
resolution of the outcome involves deciding whether the branch is taken or not (in the case of
conditional branch) and computing its target address. As long as control dependences are not
manipulated, control dependent instructions cannot even begin their execution because the
branch outcome should determine the next sequence of instructions to be executed. True data
dependences occur when an instruction consumes a value that is generated by a preceding
instruction, and therefore these instructions cannot be executed in parallel. Name dependences
occur when instructions use the same register or memory location (name), but there is no flow
of data between them as in true data dependences. One can distinguish between two kinds of
name dependences: output dependences – when instructions attempt to write simultaneously
to the same name, and anti-dependences – when a later instruction attempts to write to a name
before it is read by an earlier instruction. As long as name dependences are not handled, they
can avoid instructions from being executed in parallel since the program correctness would be
violated.
Several techniques eliminate the effect of dependences or reduce their consequences.
For example, data forwarding and implementation of shelving reduce the effect of true data
dependences, while renaming eliminates the name dependences. However, the potential of
upcoming processors with hundreds of processing units requires more parallelism than the
one obtained by these techniques. One alternative is the introduction of advanced elimination
techniques, and the other is the introduction of various speculation techniques based
on prediction.
Many ILP processors speculatively execute control dependent instructions before
resolving the branch outcome. They rely upon branch prediction in order to tolerate the effect
of control dependences. A branch predictor uses the current fetch address to predict whether a
branch will be fetched in the current cycle, whether that branch will be taken or not, and what
the target address of the branch is. The predictor uses this information to decide where to
fetch from in the next cycle. Since the branch execution penalty is only seen if the branch was
mispredicted, a highly accurate branch predictor is a very important mechanism for reducing
the branch penalty in a high performance ILP processor.
APPLICATIONS OF ILP:
In recent years, ILP techniques have been used to provide performance improvements in spite of the growing disparity between processor operating frequencies and memory access times (early ILP designs such as the IBM 360 used ILP techniques to overcome the limitations imposed by a relatively small register file). Presently, a cache miss penalty to main memory costs several hundreds of CPU cycles. While in principle it is possible to use ILP to tolerate even such memory latencies the associated resource and power dissipation costs are disproportionate. Moreover, the complexity and often the latency of the underlying hardware structures results in reduced operating frequency further reduce any benefits. Hence, the aforementioned techniques prove inadequate to keep the CPU from stalling for the off-chip data. Instead, the industry is heading towards exploiting higher levels of parallelism that can be exploited through techniques such as multiprocessing and multithreading
IMPROVING PERFORMANCE OF ILP:
A performance measurement and evaluation system for ILP (Instruction Level Parallelism) processors which issue multiple instructions and execute them in parallel is developed. The system consists of a C compiler and a simulator. The compiler takes C source programs as an input and generates 3-address style intermediate code. Then the simulator accepts the intermediate code and simulates it. The results of simulation are the contents of memory before and after simulation, the number of executed clocks, the trace and the dynamic count of executed instructions, the prediction hit ratio and profiling information for each branch instruction. To verify and understand the behavior of the system, the performance of predicated execution and one of branch schemes is measured and its results are analyzed.
As ILP processors have been developed, the importance for optimizing compiler is increased. Among various optimization methods, register allocation and code scheduling are essential to improve the performance of ILP processors. But it is required to consider carefully to apply these schemes to them because the results of one conflict with another. In this paper, we presented an extended register allocation algorithm which allocates registers to improve the effects of code scheduling. The conventional register allocation algorithm has possible to leave redundant registers after register allocation. The proposed register allocation algorithm allocates registers without redundant registers and decreases the data dependence relation to achieve a high code scheduling opportunity. Experimental results shown that the proposed register allocation algorithm was reduced the 7% execution clock cycles and the 73% complexity than the conventional scheme.
Recently, many parallel processing technologies were developed, ILP (Instruction Level Parallelism) processor’s performance have been increased very rapidly. Especially, EPIC (Explicitly Parallel Instruction Computing) architectures attempt to enhance the performance in the predicated execution and speculative execution with the hardware. In this paper, to improve the code scheduling possibility by applying to the characteristics of EPIC architectures, a new register allocation algorithm is proposed. And we prove that proposed register allocation algorithm is more efficient scheme than the conventional scheme when predicated execution is applied to our scheme by experiments. In experimental results, it shows much more performance enhancement, about 19% in proposed scheme than the conventional scheme. So, our scheme is verified that it is an effective register allocation method.
PREDICTION AND SPECULATION IN ILP PROCESSORS:
Modeling based analysis is applied to assess the efficiency of predictions and speculative
executions in boosting the performance of ILP processors that fetch, issue, execute and commit a large
number of instructions per cycle. The primary objective of the analysis is to better understand branch and value prediction techniques and their performance potential under different scenarios, with varying parameters of both the micro architecture, as well as the operational environment. So far, the estimations of the influence that these techniques have on ILP processor performance, more or less, rely upon the use of micro architectural simulators with trace-driven or execution-driven simulation – only a few simple deterministic models are known. As opposed to this common trait, a stochastic model of the dynamic behavior of an ILP processor that aims to minimize the impact of control and true data dependences and employs speculative execution based on branch and value prediction, is introduced for the first time in this dissertation. At the same time, the characteristics of the operational environment are formalized and their influence on ILP processor performance is considered, as well.
The proposed dynamic behavior model is built using Fluid Stochastic Petri Nets (FSPN) and the
stochastic process underlying the Petri Net is described by a system of first-order hyperbolic partial differential equations with appropriately chosen initial and boundary conditions. The attempt to capture the dynamic behavior of an ILP processor is a rare example that demonstrates the usage of this recently introduced formalism in modeling actual systems. Moreover, it is the first example to take into consideration numerical transient analysis and present numerical solution of a Fluid Stochastic Petri Net with more than three fluid places. Both the application of finite-difference approximations for the partial derivatives, as well as the discrete-event simulation of the proposed FSPN model, allow for the evaluation of a number of performance measures and lead to numerous conclusions regarding the performance impact of predictions and speculative execution with varying parameters of both the micro architecture and the operational environment. The numerical solution makes possible the
probabilistic analysis of the dynamic behavior, whereas the advantage of the discrete-event simulation is the much faster generation of performance evaluation results, even when compared to the broadly utilized micro architectural simulators. The proposed novel approach is essentially implementation independent and reveals considerable potential for further research in this area using numerical transient analysis and/or discrete-event simulation of FSPN models.
In summary:
The benefits of branch and value prediction are higher when control and true data
dependences have a much higher influence on the total execution time of a program
14
Value prediction is an effective approach that might enable higher levels of
parallelism without the need to increase machine width and/or instruction window
size;
There is a correlation between the value prediction efficiency and the branch
prediction efficiency;
The wider the machine, the more significant performance limitation the branch
mispredictions become.
ILP processor’s performance is strongly influenced by the characteristics of the operational environment.
TEXTILE CHEMISTRY Superscalar pentium processor