Showing posts with label architecture. Show all posts
Showing posts with label architecture. Show all posts

Thursday, December 2, 2010

parallel computing and memory networks

when we implementing an algorithm, often we are facing a decision, to match the fork-joint flow shape of our algorithm, we can have either one single long thread or many short parallel threads that do the parallel-able computation task on computation resources with varying amount of inter-threads communication.

obviously it can be modelled as a optimization problem, and depend on the nature of our algorithm (or application) and problem size, we will result different code for execution which will minimize the execution time and possibly satisfying certain constrains.

it all sounds we almost got the solution, but if we think about it carefully, we are still facing a big challenge, the memory band width, although most of our computation resources (CPU,GPU,accelerators) have their own memory system for caching data, we still face a challenge of delivering data and instructions to those devices in time.

for example, we have 128 treads running on 32 cores, when the threads are switching, they will likely to cause cache miss and require a main memory access, if one core does it, it should be ok, but if 32 core all accessing the same memory, we will have a network congestion, therefore resulting a reducing parallel performance.

if we think about how our neurons in the brain communicate,this is a very different architecture, first, we have a dynamic physical communication network, and the dynamic connections are evolved by some degree of competition and cooperation, one example is the ion gate on the synapses are varied by how it is used.

but the real different is possibly how memory is structured in our brain, a very good example would be performing calculation on a abacus and in our mind. surely we can do the abacus way much faster than do it in our mind, unless their some quick algorithm for large problems, but the real point of this is, we don’t have much memory (possibly RAM like memory) for the tedious calculation, where our brain is much more capable of doing visual information analysis and muscle control signal generation, and the same time very deep in our brain, a look up table for conditional branches, and I guess that’s may just be a configuration of our neuron connections.

so where is the memory? you may ask, well, I think most our memory is just a neuron network pattern , which is a ROM (read only and take long time to write) like thing but the different is reading it’s info is by using it, which is more like a FPGA LUT net.

so from a architecture point of view, our neuron network in the brain would not be very good at execute the repetitive simple instructions, since we don’t have the right memory structure (RAM) for them, but we seems to be doing much better vision task than the computer which has very few number of computation units and very large amount of RAM, what could be the issue here? again, the real answer for this should be, computer can do certain specific vision task better than human brain, but when you think about a general case (large data set), the human brain will out perform the computer, one answer to this could be the algorithm in our brain are optimised by a long term evolution, where the computer just execute what we think might be happening in our brain in terms of numerical calculation.

but how does it relate to the memory architecture problem? we can see the trend of adding more computing resource on a single chip, but should we try to go towards the brain like structure where dynamically routing the connections of different resources and have millions of them? that perhaps will work if we don’t use digital format for computation and lose the machine like robust properties, but do we really want to do that? I guess that will just denied the purpose of building machines, we want to have a high degree of certainty of what we do at each step, this is just a complementary behaviour to human, and that’s why we need them to be like that.

so if we have decided to go the machine way, what is the problem we need to solve? the network? the memory hierarchy, or the load balancing and scheduling on computation resources? I think all these issue can be solved by a good communication protocol, with a good protocol, we can reduce global communication and help reduce the main memory traffic, we can also make good use of memory hierarchy and automatically solve the resource sharing problem. this is more or less like how human communicate with each other, we have a good protocol that allow us to communication in small groups, large lecture theatre, and one to one talk.

so what’s the secrete of this protocol then, although I am not fully confident with my answer, but I think it’s has a strong link with model predictive control or MPC for short, because in order to optimize our behaviour, we much know as much information of our communication objects as  possible and build a model of it, then a dynamic optimization process goes on and we find the current best action for a goal of better resource utilization. obviously this is not a simple scenario when many node in the network is doing MPC, but with more in depth  research, I hope we can have more robust framework for this future intelligent communication protocol.                   

resources

http://www.brains-minds-media.org/current

Friday, May 21, 2010

COD notes

why limited number of registers

  1. large number of regs may increase the clock cycle time since the addressing circuit (mux/demux) delay time will increase.
  2. only registers communicated to each other with operators (ALU). they also communicated with the main memory by load/store instructions. it acted as a small cache but without hardware hit/miss management
  3. optimization on using registers is a key to both performance and  energy efficiency.
  4. ALU can also take operand data from instructions for computation with constant.

instruction format

  1. in order to use the bit field efficiently, the meaning of the later bit field is determined by previous field. normally the op field will indicate the field segmentation of the rest field.

conditional branches and jumps

  1. to achieve switch and loop structure of algorithm, in other words, reuse hardware resource in a time multiplexing fashion, we use jump instruction to reset the program counter.
  2. for procedures and functions, a black box model for structuring program also need to use jumps, but the program counter will need to return to it’s original address when the procedure/function is done, the interface is implemented by sharing registers.
  3. procedure/functions can be run as pipeline model (each core running a procedure/function) with shared memory between those cores (possibly need to send a done message between cores for better efficiency ), if there is a explicit parallelism, all the procedure/func and run concurrently and synchronies by done message again. note when a core finish early, instead waiting for the done message from previous core, it can just switch to another thread to keep the core busy.
  4.       

SDA Garph

image

image

Antisymmetry means the when there is a loop between two node a and b, a must equal b, otherwise it violate the antisymmetry definition.

 

image

image

image

trail is a sequence of linked edges does not repeat, or [e1,e2…en] in E, where ei !=ej. and e1.node2=e2.node1.

path is a sequence of linked edge does not touch it self.

cycle is a path with two end vertices coincide.

image

image

from an adjacency matrix (g) point of view

complete graph is ~diag(ones(1,n)) or 1-diag(ones(1,n))

complement graph is cg=~g

image

ACA 2010 MEng

This examination is partly based on the Intel Nehalem processor architecture, as
described in the article “Nehalem: Intel’s Future Processor and System”, by David
Kanter (Real World Technologies, 04-02-2008), which you should have available to
you in the examination. Where the article is incomplete, you are invited to speculate
using your understanding of the underlying architectural principles.


1a
Which parts of Nehalem’s branch predictor need to be replicated for each thread?

Another improved branch target prediction mechanism in Nehalem is the return stack buffer (RSB). When a function is called, the RSB records the address, so that when the function is returned, it will just pick up where it left off instead of ending up at the wrong address. A RSB can overflow if too many functions are called recursively and it can also get corrupted and produce bad return addresses if the branch predictor speculates down a wrong path. Nehalem actually renames the RSB, which avoids return stack overflows, and ensures that most misspeculation does not corrupt the RSB. There is a dedicated RSB for each thread to avoid any cross-contamination.

return stack buffer
b
Nehalem’s branch prediction may lead to branch execution time ranging from
zero cycles to many. What is the branch execution time when a level-1 BTB hit
occurs?

when we have a btb hit, BTB will predict the target PC address, this means that we can perform branch prediction in the fetch stage.

if we predict branches in the decode stage, there is one cycle of wasted fetch after every branch instruction, because we don't know what we're supposed to be fetching until the branch finishes decode. if we predict branches in the fetch stage, there are no cycles of wasted fetch.


c
What is the worst case?

BTB hit but miss prediction.


d
What intermediate cases might occur?

BTB miss, but a taken branch.

2a
What happens when Nehalem’s ROB is full?

when ROB is full, we stop issuing instructions until an entry is made free.

The 128 entry ROB is statically partitioned between both threads, which allows each thread to speculate equally far through the instruction stream.


b
What two things happen in the Nehalem microarchitecture when an instruction is
committed?

once an instruction commits, the entry in ROB is reclaimed and the register or memory destination is updated.

if speculation was wrong, ROB is flushed and execution is restarted at the correct successor of the branch.


c
Nehalem’s level-1 data cache has an access latency of four cycles. If this were
increased, for example to six, its capacity could be much larger. What would be
the disadvantages of doing this?

cache latency increase will result the longer execution time for load/store operation, therefore we trade-off the memory bandwidth to capacity,for small number of threads running, the competition for cache is relatively low, therefore the memory bandwidth will result as the bottle neck of the performance.

3
Consider the following code fragment:
float A[N],B[N],C[N];
S1:
for (i=0; i<N; i++) {
S2
if (A[i] > 0) {
S3:
C[i] = A[i] + B[i];
}
}
a
Explain (in outline terms) how this loop could be executed using SSE
instructions.

SSE originally added eight new 128-bit registers known as XMM0 through XMM7.

pseudo code for SSE

for i=0:N/4

A[i*4:i*4+3]-> xmm0

B[i*4:i*4+3]-> xmm1

if(xmm0 >0)

xmm0=xmm0+xmm1;

xmm0->C[i*4:i*4+3]

end

-----------------------------

instruction used

movaps, addps


b
It has been proposed that the SSE-like instructions in some future designs might
operate on much longer registers.
(i) What performance advantage might arise, in suitable applications?
(ii) What problem arises with the example above? How might it be minimised,
in suitable applications?

(i) the future AVX (advanced vector extension) has increased the SIMD vector register from 128 bits to 256 bits. this will double the amount of data parallelism for vector computation. Suitable for floating point-intensive calculations in multimedia, scientific and financial applications.

(ii) the conditional execution which compare the vector register will take longer time since the length has increased. in some cases, we can preprocess the data vector such that the conditional execution can be eliminated.

loop stream detector

The loop stream detector is located inside the IDQ to improve power consumption
and front end efficiency for loops with a short sequence of instructions.
The instruction decoder supports micro-fusion to improve front end throughput,
increase the effective size of queues in the scheduler and re-order buffer (ROB). The
rules for micro-fusion are similar to those of Intel Core microarchitecture.
The instruction queue also supports macro-fusion to combine adjacent instructions
into one micro-ops where possible. In previous generations of Intel Core microarchitecture,
macro-fusion support for CMP/Jcc sequence is limited to the CF and ZF flag,
and macrofusion is not supported in 64-bit mode.