Showing posts with label memory. Show all posts
Showing posts with label memory. Show all posts

Thursday, December 2, 2010

parallel computing and memory networks

when we implementing an algorithm, often we are facing a decision, to match the fork-joint flow shape of our algorithm, we can have either one single long thread or many short parallel threads that do the parallel-able computation task on computation resources with varying amount of inter-threads communication.

obviously it can be modelled as a optimization problem, and depend on the nature of our algorithm (or application) and problem size, we will result different code for execution which will minimize the execution time and possibly satisfying certain constrains.

it all sounds we almost got the solution, but if we think about it carefully, we are still facing a big challenge, the memory band width, although most of our computation resources (CPU,GPU,accelerators) have their own memory system for caching data, we still face a challenge of delivering data and instructions to those devices in time.

for example, we have 128 treads running on 32 cores, when the threads are switching, they will likely to cause cache miss and require a main memory access, if one core does it, it should be ok, but if 32 core all accessing the same memory, we will have a network congestion, therefore resulting a reducing parallel performance.

if we think about how our neurons in the brain communicate,this is a very different architecture, first, we have a dynamic physical communication network, and the dynamic connections are evolved by some degree of competition and cooperation, one example is the ion gate on the synapses are varied by how it is used.

but the real different is possibly how memory is structured in our brain, a very good example would be performing calculation on a abacus and in our mind. surely we can do the abacus way much faster than do it in our mind, unless their some quick algorithm for large problems, but the real point of this is, we don’t have much memory (possibly RAM like memory) for the tedious calculation, where our brain is much more capable of doing visual information analysis and muscle control signal generation, and the same time very deep in our brain, a look up table for conditional branches, and I guess that’s may just be a configuration of our neuron connections.

so where is the memory? you may ask, well, I think most our memory is just a neuron network pattern , which is a ROM (read only and take long time to write) like thing but the different is reading it’s info is by using it, which is more like a FPGA LUT net.

so from a architecture point of view, our neuron network in the brain would not be very good at execute the repetitive simple instructions, since we don’t have the right memory structure (RAM) for them, but we seems to be doing much better vision task than the computer which has very few number of computation units and very large amount of RAM, what could be the issue here? again, the real answer for this should be, computer can do certain specific vision task better than human brain, but when you think about a general case (large data set), the human brain will out perform the computer, one answer to this could be the algorithm in our brain are optimised by a long term evolution, where the computer just execute what we think might be happening in our brain in terms of numerical calculation.

but how does it relate to the memory architecture problem? we can see the trend of adding more computing resource on a single chip, but should we try to go towards the brain like structure where dynamically routing the connections of different resources and have millions of them? that perhaps will work if we don’t use digital format for computation and lose the machine like robust properties, but do we really want to do that? I guess that will just denied the purpose of building machines, we want to have a high degree of certainty of what we do at each step, this is just a complementary behaviour to human, and that’s why we need them to be like that.

so if we have decided to go the machine way, what is the problem we need to solve? the network? the memory hierarchy, or the load balancing and scheduling on computation resources? I think all these issue can be solved by a good communication protocol, with a good protocol, we can reduce global communication and help reduce the main memory traffic, we can also make good use of memory hierarchy and automatically solve the resource sharing problem. this is more or less like how human communicate with each other, we have a good protocol that allow us to communication in small groups, large lecture theatre, and one to one talk.

so what’s the secrete of this protocol then, although I am not fully confident with my answer, but I think it’s has a strong link with model predictive control or MPC for short, because in order to optimize our behaviour, we much know as much information of our communication objects as  possible and build a model of it, then a dynamic optimization process goes on and we find the current best action for a goal of better resource utilization. obviously this is not a simple scenario when many node in the network is doing MPC, but with more in depth  research, I hope we can have more robust framework for this future intelligent communication protocol.                   

resources

http://www.brains-minds-media.org/current

how genome build the brain

Then there's the mystery of the developing brain. How does something so complex manage to build itself? The Allen Institute is also measuring genetic expression in the mouse brain, from embryo to adult, to explore how the orchestra of genes is switched on and off in different areas during development. Which snippets of DNA transform the hippocampus into a center of long-term memory? Which make the amygdala a warehouse of fear and anxiety? "One of the things I've come to appreciate about the brain is the importance of location," Allen says. "It's not just a set of interchangeable parts that you can swap in and out. These brain areas are all so distinct, and for reasons we can finally begin to understand."

One unexpected—even disheartening—aspect of the Allen Institute's effort is that although its scientists have barely begun their work, early data sets have already demonstrated that the flesh in our head is far more complicated than anyone previously imagined.

The brain might look homogenous to the naked eye, but it's actually filled with an array of cell types, each of which expresses a distinct set of genes depending on its precise location. Consider the neocortex, the so-called CPU of the brain: Scientists assumed for decades that most cortical circuits were essentially the same—the brain was supposed to rely on a standard set of microchips, like a typical supercomputer. But the atlas has revealed a startling genetic diversity; different slabs of cortex are defined by entirely different sets of genes. The supercomputer analogy needs to be permanently retired.

Read More http://www.wired.com/medtech/health/magazine/17-04/ff_brainatlas?currentPage=5#ixzz10RZv3R5E

 

also if you are interested in memory system in the brain, for example difference between remember and record, hierarchy of information caching and information abstraction. spatial and temporal pattern exploration and learning.

read more at http://www.numenta.com/ , they have been doing many interesting research in this areas.  

compiler optimization and algorithm synthesis

this is a big topic, especially the synthesis part is very hard problem, compiler is really a translation tool with some optimization of the context information in software and hardware side.

and algorithm synthesis is like a pattern generation process where the correctness of result is critical.

here I will just list some ideas recently been thinking in my head.

both pattern translation and generation request we have a way to learn the patterns (can be from existing code base or human coding process).  here both the pattern in time and  in graph domain are very important.

once we have those patterns, we then do the following,

analysis –> model extraction –> optimization ( convex, genetic algorithm, annealing) –> validation

the pattern can be static hierarchy or data/control flow patterns. in other words, relational or communication patterns.

    

communication patterns

explicit & implicit communication require a distributed memory system with a shared memory space. these means a special high speed cache to cache network is needed for resolving conflicts.

cache system design

cache is a transaction filtering system with a locally self managed memory, internally, cache has a data structure (e.g 4-way, set associative etc) and a read/write algorithm (hit/miss test, replacement policy, external transaction scheduling such as write back).

the aim of the cache system is to learn the transaction pattern of clients and avoid transaction collision or cache miss.

statistical measure and learning maybe needed for better replacement algorithms. things like victim cache is a good example.

for some cases, double miss + random replacement policy rather than just single miss + least recently used policy

cache service protocol is also important part of the design, for example optional caching memory access. (non cacheable address, always miss region, OR special delivery channel for non-cacheable memory space )

transaction model

transaction model should be aware of the memory system and related protocol for better performance. for example, cache geometry aware malloc algorithm.

 

hardware/ software partition

by life time, context switching overhead and communication explicitness.

hardware function cache / fpga bit stream cached in the frabic

cache system support both high speed local transactions and global shared memory space communication, also a local scratch memory for self managed memory for clients.

clients can be cpu cores or fpga frabic.

Friday, May 21, 2010

ACA memory hierarchy

principle of locality state that programs access a relatively small portion of their address space at any instant of time.

  • Temporal locality (locality in time)
  • spatial locality (locality in space)

if viewed from a space(2D) –time thread, temporal locality shows the inertia of the thread, and the spatial locality shows the small energy disturbance (additive disturbance ).

due to this nature of programs, and the different speed/size trade off off memory technology, we can mix different type of memory in a hierarchy into our memory system to form a more efficient non-uniform access time memory model for our programs.

image

the concept of cache is we speed up certain region of our memory map, and the region is a minimum granularity we accelerate by using the cache. it is often called block(line), since the blocks are sparely distributed in our memory map, when a block is not present in the cache, we have a cache miss, and the miss rate is the fraction of memory access not found in a level of the memory hierarchy.

the hit time is the time required to access a level of memory hierarchy (include the time to justify hit or miss), it’s normally much faster than the lower memory hierarchy.

the miss penalty is the time required to fetch a block into a level of the memory hierarchy from the lower level, including the time to access the block, transmit if from one level to another,insert it in the level that experienced the miss, and the pass the block to the requestor, the detail of the data transfer is hidden from the requestor, the requestor is only experiencing the extra latency but does not worry the data integrity.

direct mapped cache

a cache structure in which each memory location is mapped to exactly one location in the cache.

image

from the graph we can see, the cache simply partition the address into two part, the tag and index, the index is used to select a block and the tag is the information used to justify a hit or miss.

this directed mapped cache will lower the memory latency in a evenly distributed manner, but in run time, the actual lowered region is scattered.

image

the diagram shows a memory partition for direct mapped cache, low access time is possible if the location has a matched tag in a block of the cache.note that only one region is possible to cached in the block.

about the block size

image

here the granularity and sparsity is the key to the performance of the cache.the block size is the granularity, and bigger size will introduce the extra time for block replacement, some improvement can be used (but it also depend on the access pattern of programs, eg, instruction cache and date cache), such as early restart and request word first scheme.

also the formula for calculating the cache recourse consumption.

image

image

Cache miss handle

image

the use of stall rather than interrupt when a miss occurred is to balance between the cost of

Wednesday, February 10, 2010

brain, memory, learning and creativity

understanding the architecture

  • sensors
  • filters
  • processors

 

hierarchy of information

  • vision
  • voice
  • emotions
  • language
  • semantics

the higher the information level, the faster the memory for it but smaller quantity in our brain. the visual memory has huge quantity, but quite slow in terms of erase and update. so we can use our permanent memory to generated vision info and sore them into our visual memory which can store many many information,but come with a draw back that it can not erase and update very quickly, but don’t worry they are just buffers so eventually you can still reuse them.

a good strategy is to generated information( compact and structured) to a visual form and then take a look at your picture. then you will be able to remember them for a much longer time and access them on demand.

so the key is the picture or voice generation, so how to generated them, any guidelines?

imagine you have two picture to look, the more interesting the picture it is, the more likely you are able to remember them longer. same goes with the voice. so how exactly we find things interesting, one thing might be related to the new information and our knowledge, say how novel it is and how well does it fit our knowledge and rules.

this process of getting a picture shows some aspect of information synthesis and creativity which I will talk about in the later part. 

algorithm of learning

the idea of learning is the opposite to synthesis, where we extract level by level from the raw information, and finally discover the gold. very like a filtering process, but with more sophistication and recursion.

so when we learned some thing, say a language, we can use it to learn further knowledge about some thing, and keeps going, so inside our brain, we have a hierarchical knowledge base and we keep building it up, kind of like a graph which model the environment we live in, but not a firmware which means we can change part of it if it does not model correctly.

 

information synthesis and creativity

one we have a model of the world in our brain, we can run our simulations in side our brain and generate some outputs which after some refinement or optimization, we can speck them out or write them down. that is the essence of information synthesis. but how creative our information is generated, has much more thing involved.

think about generating a joke, that will obviously require much more creativity than just a small talk.

joke is great example, but it has some formulas which is good for us to study more about creative info synth. of course there are many different kind of jokes, some are so called stupid jokes and some are called intelligent jokes, here we have look at the intelligent jokes, dramatic change but logically reasonable, dirty but beautiful, push and pulls.

so it seems all about a good balance, or symmetry, or many other other principles such as recursions, reflections,etc which we perceive them as our fundamental knowledge, which again is part of our model has the most agreement with the nature!