Thursday, December 2, 2010

interleaving two sine waves

when I was doing some experiments on genome design in genetic algorithms, I had a idea of testing the coross-over operator effect on FFTs.

suppose you have two genomes of since waves (all code below are written in python and spyder 2)

x1=array(range(2**10))*2*pi/2**6  (green curve below)

x2=array(range(2**10))*2*pi/2**5  (red curve below )

 

you can do a cross-over operation between them to make a new genome.

x[:-1:2],x[1:-1:2]=x1[:-1:2], x2[1:-1:2] (blue curve below)

and you feed this new genome to your FFT algorithm,

you get the following

image

instead of two spikes (if you just add the two waves), you now have a mirrored the two in to the other end of the FFT spectrum.

quite interesting, I am wondering what would be the effect of having more complex genome operations to the FFT algorithms.

in case you want the plotting code to try it yourself, here it is.

--------------------

subplot(211)

plt.plot(sin(x))

plt.plot(sin(x1))

plt.plot(sin(x2))

subplot(212)

plt.plot(abs(fft(sin(x))))

Matlab and ARM

 

image image

MatLab products, long dominant in simulations, were extending their usefulness into the evolving electronic design automation (EDA) market. A number of traditional EDA vendors were creating links between their products and MatLab, allowing engineers to use the same language for simulation and implementation.

pico projector technology review

“Interesting Times”
By Mark Harward, Syndiant CEO
September 9, 2010

The ancient Chinese curse “May you live in interesting times,” is certainly applicable to the companies working in the fast emerging pico projector business. One can look at virtually any emerging technology market to understand the market chaos which occurs while the market sorts out winners and losers. Unfortunately, not every technology can be a winner. The early personal computer market was littered with now forgotten “winners” like the TRS-80 and the Commodor-64. But it was products like the Apple II and later the IBM PC that emerged from the pack. The pico projector market is about to undergo a similar transition, from a low volume niche market to one that will excite consumers and explode into a mass-market product category.

Currently, a technology called Color Filter (CF) LCOS enjoys the highest sales volumes, largely because it enables a low-cost solution, albeit with low resolution and very poor color quality. Texas Instrument’s DLPTM technology currently has the #2 position today. Pico industry analysts now have enough insight into the market to forecast that in Field Sequential Color (FSC) LCOS will move from 3rd to 1st. DisplaySearch: “For the long-term forecast, we think LCOS field-sequent/ial will be number one, DLP will be number two, MEMS scanning will be number three, and LCOS with color filters will drop to be the lowest.”

I generally agree with this prediction and our team has enthusiastically taken on the challenge to become #1 with our leading family of VueG8TM FSC LCOS microdisplays. There are good reasons for the analyst’s prediction that FSC LCOS will take the lead:

  • CF LCOS was earliest in the market and requires a simpler optical design. But color filter LCOS requires three (3) color filter (red, green, blue) sub-pixels which inherently make it bigger than FSC LCOS. Already the colors tend to bleed together which cause poor color saturation with CF LCOS. But perhaps more important, requiring three sub-pixels fundamentally limits the ability to scale this technology, particularly as resolution is expected to increase. We expect CF LCOS to stay strong in applications such as toys where low resolution and poor color images are acceptable.
  • DLP® has found its way into a number of low volume “test the market” products including some cell phones. DLP was a “safe” choice for early market products as it was an established technology from a large company. But as the analysts realize and many customers have indicated to us, DLP is fundamentally more expensive than LCOS and this problem grows worse as resolution increases. Syndiant is already making LCOS pixels that are half the size of the smallest DLP mirror which means we can fit 2X the pixels in the same space. DLP is also inherently much higher in power-per-pixel which as a big drawback for mobile applications. We expect DLP to be strongest in a niche market of A/C powered (non-mobile) applications whose main requirements are high brightness.
  • We do disagree with DisplaySearch on MEMS scanning as we find it difficult to believe that MEMS laser beam scanning will meet the cost, size, resolution, power consumption, brightness and image quality (particularly speckle-free) that will get it out of being a high cost, very low volume, niche product. In our 6 way comparative technology demonstration at SID 2010, we demonstrated that laser scanning is presently uncompetitive on almost all of the critical market requirements. It will be tough to chase the rapidly lowering LED cost structure and output efficiency improvements which are driven by the huge volumes of the general purpose LED lighting market. And when lasers eventually become cost effective, we believe that FSC LCOS using lasers will be able to provide focus-free operation with higher resolution, lower cost, less speckle, and lower power.

Syndiant’s product roadmap positions us very well to play in the largest market segments:

  • Mobile phone: This is the single largest market (~1.4B unit/year). The embedded module cost needs to be < $30 and move to < $20.
  • Digital still camera and Digital Video embedded projectors (>100M units/year)
  • Multi-media stand-alone projectors - think iPodTM with a camera, projector and wireless internet access
  • Accessory projectors that are mobile (think wireless video transmission from your laptop, phone etc. – cables will eventually be eliminated)
  • High Definition projectors for Gaming – Think of a 720P projector that can run on batteries or be plugged in for a high brightness mode. This is an attractive market with gamers spending about USD $20B per year.
  • 3D – Now imagine a low cost gaming projector that can run in either 2D or 3D mode that your teenager can use in their room. Beyond cool. Of course, your kid will need to hide it or it will be swiped to watch 3D movies in the family room.

Our Syndiant team is working very hard to enable our customers can deliver truly amazing products to you that use Syndiant’s VueG8TM high resolution technology that far surpasses what has been available till now. The next few years will certainly be “interesting times” as the market sorts out winners and losers. One prediction we can make with confidence is that consumers will be big winners.

I’d like to hear from you. Perhaps you have ideas for cool applications for pico projectors or just want to tell me what you think. Emails to info@syndiant.com will reach me.

Cheers,
Mark Harward
CEO, Syndiant, Inc.

 

==========================================

Personally I would like to see if there is the possibility of combining the MEMS scanning mirror and the LCOS together,

人脑的学习原理[转载]

现代神经科学和认知科学认为,几乎没有任何技能是人一出生就会的。哪怕是对简单物体的识别,把东西抓取过来这些简单的动作,也是婴儿后天学习的结果。一个人一出生的时候根本不可能预见到将来自己需要什么技能,基因不可能把一切技能都用遗传的方法事先编程,那样的话太浪费大脑的存储空间。最好的办法是不预设任何技能,只提供一个能够学习各种技能的能力,这就是人脑的巧妙之处。基因的做法是先预设一些对刺激的基本反应和感觉,比如看见好吃的东西我们会饿等等。这些基本的反应需要调动的神经较少。但对于更高级别的技能,比如演奏音乐,需要协调调动很多神经,就必须靠后天学习了。

人的任何一个技能,都是大脑内一系列神经纤维传递的电脉冲信号的组合。解剖表明拥有不同技能的人,其大脑的神经结构非常不同,比如出租车司机大脑内识别方向的区域就特别发达。也就是说与计算机不同,人对于技能的掌握是在大脑硬件层次实现的。

而最近有一派科学家认为,髓磷脂是技能训练的关键,它的作用是像胶皮把电线包起来一样,把这些神经纤维给包起来,通过防止电脉冲外泄而使得信号更强,更快,更准确。不管练习什么,我们都是在练习大脑中的髓磷脂,就好像把一堆杂乱无章的电线被排列整齐变成电缆。直到2000年新技术允许科学家直接观察活体大脑内的髓磷脂之后,髓磷脂的作用才被发现,而且一直到2006年才第一次被在学术期刊上说明。科学家认为髓磷脂是脑神经的高速公路,提高信号传递速度,并且可以把延迟时间减少30倍,总共提速3000倍,甚至可以控制速度,想慢就慢。

人脑之中分布着大量“自由的”髓磷脂,它们观测脑神经纤维的信号发射和组合,哪些神经纤维用的越多,它们就过去把这一段线路给包起来,使得线路中的信号传递更快,形成高速公路。这就是为什么练习是如此重要。

髓磷脂理论可以解释很多事情。比如为什么小孩常会犯错?他们的神经系统都在,也知道对错,只是需要时间去建立起来髓磷脂的高速网络。为什么习惯一旦养成不容易改变?因为所谓“习惯”,其实是以神经纤维电缆组合的形式“长”在大脑之中的,髓磷脂一旦把神经包起来,它不会自动散开 — 改变习惯的唯一办法是形成新习惯。为什么年轻人学东西快?因为尽管人的一生之中髓磷脂都在生长,但年轻人生长得最快。最激进的理论则认为人跟猴子的最显著区别不在于脑神经元的多少,而在于人的髓磷脂比猴子多20%!解剖表明,爱因斯坦的大脑中的神经元数量是平均水平,但他拥有更多能够产生髓磷脂的细胞。

parallel computing and memory networks

when we implementing an algorithm, often we are facing a decision, to match the fork-joint flow shape of our algorithm, we can have either one single long thread or many short parallel threads that do the parallel-able computation task on computation resources with varying amount of inter-threads communication.

obviously it can be modelled as a optimization problem, and depend on the nature of our algorithm (or application) and problem size, we will result different code for execution which will minimize the execution time and possibly satisfying certain constrains.

it all sounds we almost got the solution, but if we think about it carefully, we are still facing a big challenge, the memory band width, although most of our computation resources (CPU,GPU,accelerators) have their own memory system for caching data, we still face a challenge of delivering data and instructions to those devices in time.

for example, we have 128 treads running on 32 cores, when the threads are switching, they will likely to cause cache miss and require a main memory access, if one core does it, it should be ok, but if 32 core all accessing the same memory, we will have a network congestion, therefore resulting a reducing parallel performance.

if we think about how our neurons in the brain communicate,this is a very different architecture, first, we have a dynamic physical communication network, and the dynamic connections are evolved by some degree of competition and cooperation, one example is the ion gate on the synapses are varied by how it is used.

but the real different is possibly how memory is structured in our brain, a very good example would be performing calculation on a abacus and in our mind. surely we can do the abacus way much faster than do it in our mind, unless their some quick algorithm for large problems, but the real point of this is, we don’t have much memory (possibly RAM like memory) for the tedious calculation, where our brain is much more capable of doing visual information analysis and muscle control signal generation, and the same time very deep in our brain, a look up table for conditional branches, and I guess that’s may just be a configuration of our neuron connections.

so where is the memory? you may ask, well, I think most our memory is just a neuron network pattern , which is a ROM (read only and take long time to write) like thing but the different is reading it’s info is by using it, which is more like a FPGA LUT net.

so from a architecture point of view, our neuron network in the brain would not be very good at execute the repetitive simple instructions, since we don’t have the right memory structure (RAM) for them, but we seems to be doing much better vision task than the computer which has very few number of computation units and very large amount of RAM, what could be the issue here? again, the real answer for this should be, computer can do certain specific vision task better than human brain, but when you think about a general case (large data set), the human brain will out perform the computer, one answer to this could be the algorithm in our brain are optimised by a long term evolution, where the computer just execute what we think might be happening in our brain in terms of numerical calculation.

but how does it relate to the memory architecture problem? we can see the trend of adding more computing resource on a single chip, but should we try to go towards the brain like structure where dynamically routing the connections of different resources and have millions of them? that perhaps will work if we don’t use digital format for computation and lose the machine like robust properties, but do we really want to do that? I guess that will just denied the purpose of building machines, we want to have a high degree of certainty of what we do at each step, this is just a complementary behaviour to human, and that’s why we need them to be like that.

so if we have decided to go the machine way, what is the problem we need to solve? the network? the memory hierarchy, or the load balancing and scheduling on computation resources? I think all these issue can be solved by a good communication protocol, with a good protocol, we can reduce global communication and help reduce the main memory traffic, we can also make good use of memory hierarchy and automatically solve the resource sharing problem. this is more or less like how human communicate with each other, we have a good protocol that allow us to communication in small groups, large lecture theatre, and one to one talk.

so what’s the secrete of this protocol then, although I am not fully confident with my answer, but I think it’s has a strong link with model predictive control or MPC for short, because in order to optimize our behaviour, we much know as much information of our communication objects as  possible and build a model of it, then a dynamic optimization process goes on and we find the current best action for a goal of better resource utilization. obviously this is not a simple scenario when many node in the network is doing MPC, but with more in depth  research, I hope we can have more robust framework for this future intelligent communication protocol.                   

resources

http://www.brains-minds-media.org/current

Beautiful Carbon

carbon can make graphene and diamonds.

 

it’s the future of electronics!

why?

  • it make lights

Raman Laser and diamond

http://web.science.mq.edu.au/groups/mqphotonics/research/group_raman/

http://en.wikipedia.org/wiki/C._V._Raman

  • it transmit lights

diamond fibers 

http://en.wikipedia.org/wiki/Refractive_index#Nonlinearity

  • it super conducts electrons

wiki graphene

  • it’s printable

circuits, antenna, RFIDs, etc.

  • it’s lithograph-able

diamond wafer, CVD diamond doping.

  • it make tubes

CNT MEMS, Sensors

and so on…

btw, we are mostly made of carbon too, how beautiful!

Embedded DSP Processor design Notes

DSP has turned out to be one of the most important technologies in the development of communications systems and other electronics. After Marconi and Armstrong’s invention of basic radio communication and transceivers, users were soon unsatisfied with the poor communication quality. The noise was high and the bandwidth utility was inefficient. Parameters changed due to variations in temperature, and the signals could not be reproduced exactly. In order to improve the quality, analogue radio communication had to give way to digital radio communication.

Coding for communications can be further divided into reliable coding (for error detection and error correction) and efficient coding (compression).

Without special hardware acceleration,the performance of Huffman coding and decoding is usually much lower than one bit per instruction. With special hardware acceleration, two bits per instruction can be achieved. Huffman coding and decoding are typical applications for a FSM (Finite State Machine).

Voice (speech) compression has been thoroughly investigated, and advanced voice compression today is based on voice synthesis techniques.

---------------------------------

it’s quite a good book in DSP processor design, and here is some of my thoughts.

Most current DPS algorithms are predictable, streaming computing is currently sufficient. However, more intelligent DSP applications are emerging and introduced such as language recognition, etc. Searching and sorting algorithms are used by intelligent signal processing. We therefore plan to analyze behaviours of most used sequential algorithms in intelligent DSP and implement them into ASIP.

Systolic array

  • MIMD like engine
  • reduce Load/Store instructions by evolving data on the array
  • controlled by MIMD like instructions with loop-able syntax

Analogue computation

when accuracy and computation result need not to be point but range reproducible. it is a good candidate for power,performance and area improvements.

  • pros

      fast, low power, smaller size, addition is almost free, just merge wires

  • cons

      noisy, not good for multiplication or differentiation, need extra ADC, DAC interface.

 

read more at

http://hrl.harvard.edu/analog/

http://hrl.harvard.edu/

 

some time adding electrons together is much elegant than using them as fuels for the ALUs… but only some times!  

 

------------------

some old thoughts about systolic array, FYC, I merge it into DSP area.

systolic array was a  hot research topic in the 80’s, but just like all the other parallel computing research, they all suppressed by the exponential performance growth of sequential machine offered by Moore's law.

but things will not always go faster, the sequential machines has certainly reached a wall in terms of both power and speed.

so it make us start rethinking parallel, and especially how software behaves and hardware implementation to match them.

the are few concepts are very helpful.

  • computer architecture design is a optimization process which find best solution to match the software-hardware behaviour given the constrains of communication and computation cost.
  • as the transistor size shrinks, clock frequency increases, the communication cost is getting bigger than computation.
  • many techniques such as cache, parallel memory banks, data pre-allocation, traffic data compression, or even dynamic traffic optimization. were used to balance the imbalance of communication and computation.
  • if software = algorithm + data structure, then they both adaptable to hardware environment. but if the adaptation has reduced the performance of the software too much, hardware will also adapts to software to better match it’s behaviour. such as GPUs and OpenGL.
  • model based design is very useful to capture software behaviours.
  • systolic and cellular array provided MIMD like behaviour with looping capabilities.

how genome build the brain

Then there's the mystery of the developing brain. How does something so complex manage to build itself? The Allen Institute is also measuring genetic expression in the mouse brain, from embryo to adult, to explore how the orchestra of genes is switched on and off in different areas during development. Which snippets of DNA transform the hippocampus into a center of long-term memory? Which make the amygdala a warehouse of fear and anxiety? "One of the things I've come to appreciate about the brain is the importance of location," Allen says. "It's not just a set of interchangeable parts that you can swap in and out. These brain areas are all so distinct, and for reasons we can finally begin to understand."

One unexpected—even disheartening—aspect of the Allen Institute's effort is that although its scientists have barely begun their work, early data sets have already demonstrated that the flesh in our head is far more complicated than anyone previously imagined.

The brain might look homogenous to the naked eye, but it's actually filled with an array of cell types, each of which expresses a distinct set of genes depending on its precise location. Consider the neocortex, the so-called CPU of the brain: Scientists assumed for decades that most cortical circuits were essentially the same—the brain was supposed to rely on a standard set of microchips, like a typical supercomputer. But the atlas has revealed a startling genetic diversity; different slabs of cortex are defined by entirely different sets of genes. The supercomputer analogy needs to be permanently retired.

Read More http://www.wired.com/medtech/health/magazine/17-04/ff_brainatlas?currentPage=5#ixzz10RZv3R5E

 

also if you are interested in memory system in the brain, for example difference between remember and record, hierarchy of information caching and information abstraction. spatial and temporal pattern exploration and learning.

read more at http://www.numenta.com/ , they have been doing many interesting research in this areas.  

compiler optimization and algorithm synthesis

this is a big topic, especially the synthesis part is very hard problem, compiler is really a translation tool with some optimization of the context information in software and hardware side.

and algorithm synthesis is like a pattern generation process where the correctness of result is critical.

here I will just list some ideas recently been thinking in my head.

both pattern translation and generation request we have a way to learn the patterns (can be from existing code base or human coding process).  here both the pattern in time and  in graph domain are very important.

once we have those patterns, we then do the following,

analysis –> model extraction –> optimization ( convex, genetic algorithm, annealing) –> validation

the pattern can be static hierarchy or data/control flow patterns. in other words, relational or communication patterns.

    

communication patterns

explicit & implicit communication require a distributed memory system with a shared memory space. these means a special high speed cache to cache network is needed for resolving conflicts.

cache system design

cache is a transaction filtering system with a locally self managed memory, internally, cache has a data structure (e.g 4-way, set associative etc) and a read/write algorithm (hit/miss test, replacement policy, external transaction scheduling such as write back).

the aim of the cache system is to learn the transaction pattern of clients and avoid transaction collision or cache miss.

statistical measure and learning maybe needed for better replacement algorithms. things like victim cache is a good example.

for some cases, double miss + random replacement policy rather than just single miss + least recently used policy

cache service protocol is also important part of the design, for example optional caching memory access. (non cacheable address, always miss region, OR special delivery channel for non-cacheable memory space )

transaction model

transaction model should be aware of the memory system and related protocol for better performance. for example, cache geometry aware malloc algorithm.

 

hardware/ software partition

by life time, context switching overhead and communication explicitness.

hardware function cache / fpga bit stream cached in the frabic

cache system support both high speed local transactions and global shared memory space communication, also a local scratch memory for self managed memory for clients.

clients can be cpu cores or fpga frabic.

python coding style and performance

Method of time counting

------------------------

import time

start = time.clock()

elapsed = (time.clock() - start)

print elapsed

 

-------------------------

results

#----------------fibonaci-----------------
# run time 1.5e-05
#a, b = 0, 1
#v=[]
#while b < 4000000:
#    a, b = b, a+b
#    v.append(a)

# run time 4.2e-05
#v = [1,0];
#while v[0] < 1000000:
#    v=sum([v], [v[0]+v[1]])

# run time 3e-05
#v = [0,1];
#while v[-1] < 1000000:
#    v=v+ [v[-1]+v[-2]]

# run time 2.3e-05
#v=[0,1];
#while v[-1] < 1000000:
#    v[-1:-1],v[-1]=[v[-1]],v[-1]+v[-2]

# run time 1.5e-05
#v=[0,1];
#while v[-1] < 1000000:
#    v.append(v[-1]+v[-2])

the 5th is the best style, both fast and elegant.