Friday, May 21, 2010

ACA 2010 MEng

This examination is partly based on the Intel Nehalem processor architecture, as
described in the article “Nehalem: Intel’s Future Processor and System”, by David
Kanter (Real World Technologies, 04-02-2008), which you should have available to
you in the examination. Where the article is incomplete, you are invited to speculate
using your understanding of the underlying architectural principles.


1a
Which parts of Nehalem’s branch predictor need to be replicated for each thread?

Another improved branch target prediction mechanism in Nehalem is the return stack buffer (RSB). When a function is called, the RSB records the address, so that when the function is returned, it will just pick up where it left off instead of ending up at the wrong address. A RSB can overflow if too many functions are called recursively and it can also get corrupted and produce bad return addresses if the branch predictor speculates down a wrong path. Nehalem actually renames the RSB, which avoids return stack overflows, and ensures that most misspeculation does not corrupt the RSB. There is a dedicated RSB for each thread to avoid any cross-contamination.

return stack buffer
b
Nehalem’s branch prediction may lead to branch execution time ranging from
zero cycles to many. What is the branch execution time when a level-1 BTB hit
occurs?

when we have a btb hit, BTB will predict the target PC address, this means that we can perform branch prediction in the fetch stage.

if we predict branches in the decode stage, there is one cycle of wasted fetch after every branch instruction, because we don't know what we're supposed to be fetching until the branch finishes decode. if we predict branches in the fetch stage, there are no cycles of wasted fetch.


c
What is the worst case?

BTB hit but miss prediction.


d
What intermediate cases might occur?

BTB miss, but a taken branch.

2a
What happens when Nehalem’s ROB is full?

when ROB is full, we stop issuing instructions until an entry is made free.

The 128 entry ROB is statically partitioned between both threads, which allows each thread to speculate equally far through the instruction stream.


b
What two things happen in the Nehalem microarchitecture when an instruction is
committed?

once an instruction commits, the entry in ROB is reclaimed and the register or memory destination is updated.

if speculation was wrong, ROB is flushed and execution is restarted at the correct successor of the branch.


c
Nehalem’s level-1 data cache has an access latency of four cycles. If this were
increased, for example to six, its capacity could be much larger. What would be
the disadvantages of doing this?

cache latency increase will result the longer execution time for load/store operation, therefore we trade-off the memory bandwidth to capacity,for small number of threads running, the competition for cache is relatively low, therefore the memory bandwidth will result as the bottle neck of the performance.

3
Consider the following code fragment:
float A[N],B[N],C[N];
S1:
for (i=0; i<N; i++) {
S2
if (A[i] > 0) {
S3:
C[i] = A[i] + B[i];
}
}
a
Explain (in outline terms) how this loop could be executed using SSE
instructions.

SSE originally added eight new 128-bit registers known as XMM0 through XMM7.

pseudo code for SSE

for i=0:N/4

A[i*4:i*4+3]-> xmm0

B[i*4:i*4+3]-> xmm1

if(xmm0 >0)

xmm0=xmm0+xmm1;

xmm0->C[i*4:i*4+3]

end

-----------------------------

instruction used

movaps, addps


b
It has been proposed that the SSE-like instructions in some future designs might
operate on much longer registers.
(i) What performance advantage might arise, in suitable applications?
(ii) What problem arises with the example above? How might it be minimised,
in suitable applications?

(i) the future AVX (advanced vector extension) has increased the SIMD vector register from 128 bits to 256 bits. this will double the amount of data parallelism for vector computation. Suitable for floating point-intensive calculations in multimedia, scientific and financial applications.

(ii) the conditional execution which compare the vector register will take longer time since the length has increased. in some cases, we can preprocess the data vector such that the conditional execution can be eliminated.

loop stream detector

The loop stream detector is located inside the IDQ to improve power consumption
and front end efficiency for loops with a short sequence of instructions.
The instruction decoder supports micro-fusion to improve front end throughput,
increase the effective size of queues in the scheduler and re-order buffer (ROB). The
rules for micro-fusion are similar to those of Intel Core microarchitecture.
The instruction queue also supports macro-fusion to combine adjacent instructions
into one micro-ops where possible. In previous generations of Intel Core microarchitecture,
macro-fusion support for CMP/Jcc sequence is limited to the CF and ZF flag,
and macrofusion is not supported in 64-bit mode.

No comments: