D-waves: 2010

Thursday, December 2, 2010

interleaving two sine waves

when I was doing some experiments on genome design in genetic algorithms, I had a idea of testing the coross-over operator effect on FFTs.

suppose you have two genomes of since waves (all code below are written in python and spyder 2)

x1=array(range(2**10))*2*pi/2**6 (green curve below)

x2=array(range(2**10))*2*pi/2**5 (red curve below )

you can do a cross-over operation between them to make a new genome.

x[:-1:2],x[1:-1:2]=x1[:-1:2], x2[1:-1:2] (blue curve below)

and you feed this new genome to your FFT algorithm,

you get the following

instead of two spikes (if you just add the two waves), you now have a mirrored the two in to the other end of the FFT spectrum.

quite interesting, I am wondering what would be the effect of having more complex genome operations to the FFT algorithms.

in case you want the plotting code to try it yourself, here it is.

--------------------

subplot(211)

plt.plot(sin(x))

plt.plot(sin(x1))

plt.plot(sin(x2))

subplot(212)

plt.plot(abs(fft(sin(x))))

MatLab products, long dominant in simulations, were extending their usefulness into the evolving electronic design automation (EDA) market. A number of traditional EDA vendors were creating links between their products and MatLab, allowing engineers to use the same language for simulation and implementation.

pico projector technology review

“Interesting Times”
By Mark Harward, Syndiant CEO
September 9, 2010

The ancient Chinese curse “May you live in interesting times,” is certainly applicable to the companies working in the fast emerging pico projector business. One can look at virtually any emerging technology market to understand the market chaos which occurs while the market sorts out winners and losers. Unfortunately, not every technology can be a winner. The early personal computer market was littered with now forgotten “winners” like the TRS-80 and the Commodor-64. But it was products like the Apple II and later the IBM PC that emerged from the pack. The pico projector market is about to undergo a similar transition, from a low volume niche market to one that will excite consumers and explode into a mass-market product category.

Currently, a technology called Color Filter (CF) LCOS enjoys the highest sales volumes, largely because it enables a low-cost solution, albeit with low resolution and very poor color quality. Texas Instrument’s DLPTM technology currently has the #2 position today. Pico industry analysts now have enough insight into the market to forecast that in Field Sequential Color (FSC) LCOS will move from 3rd to 1st. DisplaySearch: “For the long-term forecast, we think LCOS field-sequent/ial will be number one, DLP will be number two, MEMS scanning will be number three, and LCOS with color filters will drop to be the lowest.”

I generally agree with this prediction and our team has enthusiastically taken on the challenge to become #1 with our leading family of VueG8TM FSC LCOS microdisplays. There are good reasons for the analyst’s prediction that FSC LCOS will take the lead:

CF LCOS was earliest in the market and requires a simpler optical design. But color filter LCOS requires three (3) color filter (red, green, blue) sub-pixels which inherently make it bigger than FSC LCOS. Already the colors tend to bleed together which cause poor color saturation with CF LCOS. But perhaps more important, requiring three sub-pixels fundamentally limits the ability to scale this technology, particularly as resolution is expected to increase. We expect CF LCOS to stay strong in applications such as toys where low resolution and poor color images are acceptable.
DLP® has found its way into a number of low volume “test the market” products including some cell phones. DLP was a “safe” choice for early market products as it was an established technology from a large company. But as the analysts realize and many customers have indicated to us, DLP is fundamentally more expensive than LCOS and this problem grows worse as resolution increases. Syndiant is already making LCOS pixels that are half the size of the smallest DLP mirror which means we can fit 2X the pixels in the same space. DLP is also inherently much higher in power-per-pixel which as a big drawback for mobile applications. We expect DLP to be strongest in a niche market of A/C powered (non-mobile) applications whose main requirements are high brightness.
We do disagree with DisplaySearch on MEMS scanning as we find it difficult to believe that MEMS laser beam scanning will meet the cost, size, resolution, power consumption, brightness and image quality (particularly speckle-free) that will get it out of being a high cost, very low volume, niche product. In our 6 way comparative technology demonstration at SID 2010, we demonstrated that laser scanning is presently uncompetitive on almost all of the critical market requirements. It will be tough to chase the rapidly lowering LED cost structure and output efficiency improvements which are driven by the huge volumes of the general purpose LED lighting market. And when lasers eventually become cost effective, we believe that FSC LCOS using lasers will be able to provide focus-free operation with higher resolution, lower cost, less speckle, and lower power.

Syndiant’s product roadmap positions us very well to play in the largest market segments:

Mobile phone: This is the single largest market (~1.4B unit/year). The embedded module cost needs to be < $30 and move to < $20.
Digital still camera and Digital Video embedded projectors (>100M units/year)
Multi-media stand-alone projectors - think iPodTM with a camera, projector and wireless internet access
Accessory projectors that are mobile (think wireless video transmission from your laptop, phone etc. – cables will eventually be eliminated)
High Definition projectors for Gaming – Think of a 720P projector that can run on batteries or be plugged in for a high brightness mode. This is an attractive market with gamers spending about USD $20B per year.
3D – Now imagine a low cost gaming projector that can run in either 2D or 3D mode that your teenager can use in their room. Beyond cool. Of course, your kid will need to hide it or it will be swiped to watch 3D movies in the family room.

Our Syndiant team is working very hard to enable our customers can deliver truly amazing products to you that use Syndiant’s VueG8TM high resolution technology that far surpasses what has been available till now. The next few years will certainly be “interesting times” as the market sorts out winners and losers. One prediction we can make with confidence is that consumers will be big winners.

I’d like to hear from you. Perhaps you have ideas for cool applications for pico projectors or just want to tell me what you think. Emails to info@syndiant.com will reach me.

Cheers,
Mark Harward
CEO, Syndiant, Inc.

==========================================

Personally I would like to see if there is the possibility of combining the MEMS scanning mirror and the LCOS together,

人脑的学习原理[转载]

现代神经科学和认知科学认为，几乎没有任何技能是人一出生就会的。哪怕是对简单物体的识别，把东西抓取过来这些简单的动作，也是婴儿后天学习的结果。一个人一出生的时候根本不可能预见到将来自己需要什么技能，基因不可能把一切技能都用遗传的方法事先编程，那样的话太浪费大脑的存储空间。最好的办法是不预设任何技能，只提供一个能够学习各种技能的能力，这就是人脑的巧妙之处。基因的做法是先预设一些对刺激的基本反应和感觉，比如看见好吃的东西我们会饿等等。这些基本的反应需要调动的神经较少。但对于更高级别的技能，比如演奏音乐，需要协调调动很多神经，就必须靠后天学习了。

人的任何一个技能，都是大脑内一系列神经纤维传递的电脉冲信号的组合。解剖表明拥有不同技能的人，其大脑的神经结构非常不同，比如出租车司机大脑内识别方向的区域就特别发达。也就是说与计算机不同，人对于技能的掌握是在大脑硬件层次实现的。

而最近有一派科学家认为，髓磷脂是技能训练的关键，它的作用是像胶皮把电线包起来一样，把这些神经纤维给包起来，通过防止电脉冲外泄而使得信号更强，更快，更准确。不管练习什么，我们都是在练习大脑中的髓磷脂，就好像把一堆杂乱无章的电线被排列整齐变成电缆。直到2000年新技术允许科学家直接观察活体大脑内的髓磷脂之后，髓磷脂的作用才被发现，而且一直到2006年才第一次被在学术期刊上说明。科学家认为髓磷脂是脑神经的高速公路，提高信号传递速度，并且可以把延迟时间减少30倍，总共提速3000倍，甚至可以控制速度，想慢就慢。

人脑之中分布着大量“自由的”髓磷脂，它们观测脑神经纤维的信号发射和组合，哪些神经纤维用的越多，它们就过去把这一段线路给包起来，使得线路中的信号传递更快，形成高速公路。这就是为什么练习是如此重要。

髓磷脂理论可以解释很多事情。比如为什么小孩常会犯错？他们的神经系统都在，也知道对错，只是需要时间去建立起来髓磷脂的高速网络。为什么习惯一旦养成不容易改变？因为所谓“习惯”，其实是以神经纤维电缆组合的形式“长”在大脑之中的，髓磷脂一旦把神经包起来，它不会自动散开 — 改变习惯的唯一办法是形成新习惯。为什么年轻人学东西快？因为尽管人的一生之中髓磷脂都在生长，但年轻人生长得最快。最激进的理论则认为人跟猴子的最显著区别不在于脑神经元的多少，而在于人的髓磷脂比猴子多20%！解剖表明，爱因斯坦的大脑中的神经元数量是平均水平，但他拥有更多能够产生髓磷脂的细胞。

parallel computing and memory networks

when we implementing an algorithm, often we are facing a decision, to match the fork-joint flow shape of our algorithm, we can have either one single long thread or many short parallel threads that do the parallel-able computation task on computation resources with varying amount of inter-threads communication.

obviously it can be modelled as a optimization problem, and depend on the nature of our algorithm (or application) and problem size, we will result different code for execution which will minimize the execution time and possibly satisfying certain constrains.

it all sounds we almost got the solution, but if we think about it carefully, we are still facing a big challenge, the memory band width, although most of our computation resources (CPU,GPU,accelerators) have their own memory system for caching data, we still face a challenge of delivering data and instructions to those devices in time.

for example, we have 128 treads running on 32 cores, when the threads are switching, they will likely to cause cache miss and require a main memory access, if one core does it, it should be ok, but if 32 core all accessing the same memory, we will have a network congestion, therefore resulting a reducing parallel performance.

if we think about how our neurons in the brain communicate,this is a very different architecture, first, we have a dynamic physical communication network, and the dynamic connections are evolved by some degree of competition and cooperation, one example is the ion gate on the synapses are varied by how it is used.

but the real different is possibly how memory is structured in our brain, a very good example would be performing calculation on a abacus and in our mind. surely we can do the abacus way much faster than do it in our mind, unless their some quick algorithm for large problems, but the real point of this is, we don’t have much memory (possibly RAM like memory) for the tedious calculation, where our brain is much more capable of doing visual information analysis and muscle control signal generation, and the same time very deep in our brain, a look up table for conditional branches, and I guess that’s may just be a configuration of our neuron connections.

so where is the memory? you may ask, well, I think most our memory is just a neuron network pattern , which is a ROM (read only and take long time to write) like thing but the different is reading it’s info is by using it, which is more like a FPGA LUT net.

so from a architecture point of view, our neuron network in the brain would not be very good at execute the repetitive simple instructions, since we don’t have the right memory structure (RAM) for them, but we seems to be doing much better vision task than the computer which has very few number of computation units and very large amount of RAM, what could be the issue here? again, the real answer for this should be, computer can do certain specific vision task better than human brain, but when you think about a general case (large data set), the human brain will out perform the computer, one answer to this could be the algorithm in our brain are optimised by a long term evolution, where the computer just execute what we think might be happening in our brain in terms of numerical calculation.

but how does it relate to the memory architecture problem? we can see the trend of adding more computing resource on a single chip, but should we try to go towards the brain like structure where dynamically routing the connections of different resources and have millions of them? that perhaps will work if we don’t use digital format for computation and lose the machine like robust properties, but do we really want to do that? I guess that will just denied the purpose of building machines, we want to have a high degree of certainty of what we do at each step, this is just a complementary behaviour to human, and that’s why we need them to be like that.

so if we have decided to go the machine way, what is the problem we need to solve? the network? the memory hierarchy, or the load balancing and scheduling on computation resources? I think all these issue can be solved by a good communication protocol, with a good protocol, we can reduce global communication and help reduce the main memory traffic, we can also make good use of memory hierarchy and automatically solve the resource sharing problem. this is more or less like how human communicate with each other, we have a good protocol that allow us to communication in small groups, large lecture theatre, and one to one talk.

so what’s the secrete of this protocol then, although I am not fully confident with my answer, but I think it’s has a strong link with model predictive control or MPC for short, because in order to optimize our behaviour, we much know as much information of our communication objects as possible and build a model of it, then a dynamic optimization process goes on and we find the current best action for a goal of better resource utilization. obviously this is not a simple scenario when many node in the network is doing MPC, but with more in depth research, I hope we can have more robust framework for this future intelligent communication protocol.

resources

http://www.brains-minds-media.org/current

Beautiful Carbon

carbon can make graphene and diamonds.

it’s the future of electronics!

why?

it make lights

Raman Laser and diamond

http://web.science.mq.edu.au/groups/mqphotonics/research/group_raman/

http://en.wikipedia.org/wiki/C._V._Raman

it transmit lights

diamond fibers

http://en.wikipedia.org/wiki/Refractive_index#Nonlinearity

it super conducts electrons

wiki graphene

it’s printable

circuits, antenna, RFIDs, etc.

it’s lithograph-able

diamond wafer, CVD diamond doping.

it make tubes

CNT MEMS, Sensors

and so on…

btw, we are mostly made of carbon too, how beautiful!

Embedded DSP Processor design Notes

DSP has turned out to be one of the most important technologies in the development of communications systems and other electronics. After Marconi and Armstrong’s invention of basic radio communication and transceivers, users were soon unsatisfied with the poor communication quality. The noise was high and the bandwidth utility was inefficient. Parameters changed due to variations in temperature, and the signals could not be reproduced exactly. In order to improve the quality, analogue radio communication had to give way to digital radio communication.

Coding for communications can be further divided into reliable coding (for error detection and error correction) and efficient coding (compression).

Without special hardware acceleration,the performance of Huffman coding and decoding is usually much lower than one bit per instruction. With special hardware acceleration, two bits per instruction can be achieved. Huffman coding and decoding are typical applications for a FSM (Finite State Machine).

Voice (speech) compression has been thoroughly investigated, and advanced voice compression today is based on voice synthesis techniques.

---------------------------------

it’s quite a good book in DSP processor design, and here is some of my thoughts.

Most current DPS algorithms are predictable, streaming computing is currently sufficient. However, more intelligent DSP applications are emerging and introduced such as language recognition, etc. Searching and sorting algorithms are used by intelligent signal processing. We therefore plan to analyze behaviours of most used sequential algorithms in intelligent DSP and implement them into ASIP.

Systolic array

MIMD like engine
reduce Load/Store instructions by evolving data on the array
controlled by MIMD like instructions with loop-able syntax

Analogue computation

when accuracy and computation result need not to be point but range reproducible. it is a good candidate for power,performance and area improvements.

pros

fast, low power, smaller size, addition is almost free, just merge wires

cons

noisy, not good for multiplication or differentiation, need extra ADC, DAC interface.

how genome build the brain

Then there's the mystery of the developing brain. How does something so complex manage to build itself? The Allen Institute is also measuring genetic expression in the mouse brain, from embryo to adult, to explore how the orchestra of genes is switched on and off in different areas during development. Which snippets of DNA transform the hippocampus into a center of long-term memory? Which make the amygdala a warehouse of fear and anxiety? "One of the things I've come to appreciate about the brain is the importance of location," Allen says. "It's not just a set of interchangeable parts that you can swap in and out. These brain areas are all so distinct, and for reasons we can finally begin to understand."

One unexpected—even disheartening—aspect of the Allen Institute's effort is that although its scientists have barely begun their work, early data sets have already demonstrated that the flesh in our head is far more complicated than anyone previously imagined.

The brain might look homogenous to the naked eye, but it's actually filled with an array of cell types, each of which expresses a distinct set of genes depending on its precise location. Consider the neocortex, the so-called CPU of the brain: Scientists assumed for decades that most cortical circuits were essentially the same—the brain was supposed to rely on a standard set of microchips, like a typical supercomputer. But the atlas has revealed a startling genetic diversity; different slabs of cortex are defined by entirely different sets of genes. The supercomputer analogy needs to be permanently retired.

also if you are interested in memory system in the brain, for example difference between remember and record, hierarchy of information caching and information abstraction. spatial and temporal pattern exploration and learning.

read more at http://www.numenta.com/ , they have been doing many interesting research in this areas.

compiler optimization and algorithm synthesis

this is a big topic, especially the synthesis part is very hard problem, compiler is really a translation tool with some optimization of the context information in software and hardware side.

and algorithm synthesis is like a pattern generation process where the correctness of result is critical.

here I will just list some ideas recently been thinking in my head.

both pattern translation and generation request we have a way to learn the patterns (can be from existing code base or human coding process). here both the pattern in time and in graph domain are very important.

once we have those patterns, we then do the following,

analysis –> model extraction –> optimization ( convex, genetic algorithm, annealing) –> validation

the pattern can be static hierarchy or data/control flow patterns. in other words, relational or communication patterns.

communication patterns

explicit & implicit communication require a distributed memory system with a shared memory space. these means a special high speed cache to cache network is needed for resolving conflicts.

cache system design

cache is a transaction filtering system with a locally self managed memory, internally, cache has a data structure (e.g 4-way, set associative etc) and a read/write algorithm (hit/miss test, replacement policy, external transaction scheduling such as write back).

the aim of the cache system is to learn the transaction pattern of clients and avoid transaction collision or cache miss.

statistical measure and learning maybe needed for better replacement algorithms. things like victim cache is a good example.

for some cases, double miss + random replacement policy rather than just single miss + least recently used policy

cache service protocol is also important part of the design, for example optional caching memory access. (non cacheable address, always miss region, OR special delivery channel for non-cacheable memory space )

transaction model

transaction model should be aware of the memory system and related protocol for better performance. for example, cache geometry aware malloc algorithm.

hardware/ software partition

by life time, context switching overhead and communication explicitness.

hardware function cache / fpga bit stream cached in the frabic

cache system support both high speed local transactions and global shared memory space communication, also a local scratch memory for self managed memory for clients.

clients can be cpu cores or fpga frabic.

python coding style and performance

Method of time counting

------------------------

import time

start = time.clock()

…

elapsed = (time.clock() - start)

print elapsed

-------------------------

results

#----------------fibonaci-----------------
# run time 1.5e-05
#a, b = 0, 1
#v=[]
#while b < 4000000:
# a, b = b, a+b
# v.append(a)

# run time 4.2e-05
#v = [1,0];
#while v[0] < 1000000:
# v=sum([v], [v[0]+v[1]])

# run time 3e-05
#v = [0,1];
#while v[-1] < 1000000:
# v=v+ [v[-1]+v[-2]]

# run time 2.3e-05
#v=[0,1];
#while v[-1] < 1000000:
# v[-1:-1],v[-1]=[v[-1]],v[-1]+v[-2]

# run time 1.5e-05
#v=[0,1];
#while v[-1] < 1000000:
# v.append(v[-1]+v[-2])

the 5th is the best style, both fast and elegant.

Tuesday, September 28, 2010

C++ 0x and boost in Visual studio 2010

first download and install the compiled binary from http://www.boostpro.com/download/

so you can skip the compiling the boost by yourself.

open VC++, get a new console application

then set the additional include to the the path for boost

paste the following code to the main cpp file

// boost_acc.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <iostream>
#include <vector>
#include <algorithm>
#include <boost/accumulators/accumulators.hpp>
#include <boost/accumulators/statistics/stats.hpp>
#include <boost/accumulators/statistics/mean.hpp>
#include <boost/accumulators/statistics/moment.hpp>

using namespace std;
using namespace boost::accumulators;

void showPtr(void*& ptr){
    cout << "Input pointer address is:   " << ptr <<endl;
}

int main(){

   int x = 1; auto y = x;
   void* p=&x;showPtr(p);
         p=&y;showPtr(p);
   typedef decltype(y) Tv;
   // Create a vector object with each element set to 1.
   vector<Tv> v(10);
   // Define an accumulator set for calculating the mean and the 2nd moment ...
   accumulator_set<Tv, stats<tag::mean, tag::moment<2>>> acc;

   // Assign each element in the vector to the sum of the previous two elements.
   generate_n(v.begin() + 2, v.size() - 2, [=,&acc]() mutable throw() -> Tv {
      // Generate current value.Update previous two values.
      auto n = x + y; x = y; y = n;
      acc(n);
      return n;
   });

   for_each(v.begin(), v.end(), [](const Tv& n) { cout << &n << " "<<endl; });

   // Display the results ...
   std::cout << "Mean:   " << mean(acc) << std::endl;
   std::cout << "Moment: " << moment<2>(acc) << std::endl;

   return 0;
}

press ctrl+F5, you can see the code being compiled and executed.

read more about c++ 0x at http://en.wikipedia.org/wiki/C%2B%2B0x

Monday, September 20, 2010

24 小时精通Vim

vim作为linux下的经典editor，它的学习的确让我挣扎过，尤其是当我想打字可屏幕却毫无反应的时候，我很失望，于是发誓再也不碰这东西了。

几年来，我一直使用notepad++作为我的主要代码编辑器，所以工欲善其事，必先利其器。作为代码工匠，程序员往往需要一个能够一个能够提供浏览，搜索，编辑等方面强大的功能。而notepad++这几年的发展的确在这方面不断推陈出新，受到了很多程序员的喜爱。

但是，notepad++毕竟是win平台下的（估计收了MS好处），所以在linux下我急需一个能让我顺手的工具。其实选择并不多，无非就是command based，比如emacs，vim 或是一个common editor user interface的工具（比如KDE下的kate），我思来想去，如果不使用前两个的其中一个，我很有可能会被同事鄙视，所以我开始对vim和emacs做了一个小比较，觉得vim虽然指令复杂，但是毕竟不用去搞lisp，可是emacs却还要再去学个没几个人用语言，投资成本比较高。于是，就先选择了vim作为学习的目标。

0-1 小时

其实就是了解下vim的历史，wikipedia是一个不错的source，里面信息比较全面，但是其实也对实际操作没什么大用，但是至少知道vim的大概工作原理和设计理念。

1-3 小时

在google上发现 Efficient Editing With vim 一文，地址如下，http://jmcpherson.org/editing.html，看完以后，感触加深了，于是按照文中的操作理念练习了一下，发现有点明白了，可是还是老忘记指令，于是还是有点着急，毕竟这样的程度还是属于低效率的，根本无法顺利展开工作。

3-6 小时

我回想起了wiki上的内容，vim是可以customize的，既然我已经熟悉了notepad++的环境，为什么不能把我想要的功能map到vim上呢。于是开始了一段有趣的客制过程。

首先想到那些syntax highlights，这些不用说，Efficient Editing With vim 的作者在blog里的其他问章里已经给出一些，剩下的google一下基本就都出来了。

于是继续想加Google search，代码折叠，自动完成，显示行号这些基本的都搞定了。小庆祝一下，可是仔细一想，notepad++好像还有很多功能我还没移植过来，比如Tab浏览，鼠标双击自动搜索加highlights，而且一个很有用的功能，ctrl+S 保存文件。

其实，google一下，vim的gui版本gvim可以实现tab浏览，双击自动搜索加highlights也有现成的key mapping script，直接复制，保存，启动gvim，呵呵，这些东西都有了。

6-10 小时

其实之前的工作已经可以完全应付工作需要了，但是还有些小细节我觉得不太满意，输入模式要按i, 而我觉得space键更好。

Tab虽然不错，可是现在所有的主流浏览器都是ctrl+T 开新Tab，Ctrl+W 关闭Tab，我发现gvim里面有个分页浏览，也就是说你可以任意横竖分割一个Tab里面的页面去显示你的各种文件。我想Notepad++里面的goto another view是vertical分页的，那我就用shift+T 来做同样的事吧。当然我还另加了ctrl+shift+left/right 用来快速切换Tab，这个也算是比较intuitive了。

鼠标滚轮在分页浏览的时候，我们有时需要同步滚动，这功能好像notepad++没有快捷键，我就设定为shift+鼠标中键去toggle，这样同步和不同步就随心所欲了。

最后，一个column selection, notepad++是Alt+leftmouse, 当然vim也就直接map了。

10-12 小时

经过那么一段时间的练习加客制，vim的使用已经很顺手了。其强大程度已经和notepad++一样，剩下的应该就是加强学习，掌握新功能了，这方面而言，notepad++只能靠作者来实现，而有了vim，我则可以随时DIY。

这也许就是高手们都喜欢vim的原因吧。

12-24

累了一天了，该睡觉了。一天征服了vim，该知足了。

=======================================

.vimrc 文件内容 (我的系统没有设Alt的keybinding，所以没有Alt组合的快捷键，你可以修改相关部分以满足自己的需求)

vim wiki 里面有很多设置代码，你在里面search可能比google 更快http://vim.wikia.com/wiki/Vim_Tips_Wiki

" set editor colors
:colors torte
:syntax enable

"setting from Jonathan's tutorial
:set nocompatible

:set autoindent
:set smartindent
:set showmatch
:set ruler
:set incsearch
:set tabstop=4
:set shiftwidth=4
":set vb t_vb=

" code folding
:set foldmethod=syntax
:set nofoldenable

" code typing
:set showmatch
:set matchtime=8
" show line number
:set number
:nmap <C-N><C-N> :set invnumber<CR>
:set numberwidth=3
":set cpoptions+=n
:highlight LineNr term=bold cterm=NONE ctermfg=Green ctermbg=NONE gui=NONE guifg=Green guibg=NONE
:set wrap!

" tab navigation
map <C-S-Left> :tabp<CR>
map <C-S-Right> :tabn<CR>
map <C-T> :tabnew<CR>
map <C-W> :close<CR>
map <S-T> :vsp<CR>

" page navigation

set scb!
map <S-MiddleMouse> :set scb!<CR>

map <S-Left> F
map <S-Right> f
map <C-Up> (
map <C-Down> )
map <A-Left> 0
map <A-Right> $

" files
" Use CTRL-S for saving, also in Insert mode

:imap <c-s> <Esc><c-s>
if has("gui_running")
" If the current buffer has never been saved, it will have no name,
" call the file browser to save it, otherwise just save it.
:map <silent> <C-S> :if expand("%") == ""<CR>:browse confirm w<CR>:else<CR>:confirm w<CR>:endif<CR>
endif

"search
"map <2-LeftMouse> :normal *<CR>, this version jump cursor to next search result
map <2-LeftMouse> :let @/ = expand("<cword>")<CR>:set hlsearch<CR>
:hi Search guifg=black guibg=white
map <RightMouse> `.

"Editing

map <Space> i
map <C-LeftMouse> <C-V>

Friday, May 21, 2010

Pink noise beat analysis -part2

windowing effect in the spectrum

peak detection method

method for AM sin wave
http://www.mathworks.com/products/demos/shipping/dspblks/dspenvdet.html#7
but for a drum beat, it fails, because the carrier wave is not a narrow band signal
need to use the neighborhood method, details here http://billauer.co.il/peakdet.html
noise in the signal can also effect the neighborhood method, so may need to to some threshold.
peak can be sharp or flat, so

COD notes

why limited number of registers

large number of regs may increase the clock cycle time since the addressing circuit (mux/demux) delay time will increase.
only registers communicated to each other with operators (ALU). they also communicated with the main memory by load/store instructions. it acted as a small cache but without hardware hit/miss management
optimization on using registers is a key to both performance and energy efficiency.
ALU can also take operand data from instructions for computation with constant.

instruction format

in order to use the bit field efficiently, the meaning of the later bit field is determined by previous field. normally the op field will indicate the field segmentation of the rest field.

conditional branches and jumps

to achieve switch and loop structure of algorithm, in other words, reuse hardware resource in a time multiplexing fashion, we use jump instruction to reset the program counter.
for procedures and functions, a black box model for structuring program also need to use jumps, but the program counter will need to return to it’s original address when the procedure/function is done, the interface is implemented by sharing registers.
procedure/functions can be run as pipeline model (each core running a procedure/function) with shared memory between those cores (possibly need to send a done message between cores for better efficiency ), if there is a explicit parallelism, all the procedure/func and run concurrently and synchronies by done message again. note when a core finish early, instead waiting for the done message from previous core, it can just switch to another thread to keep the core busy.

最新晶体管制造工艺技术前瞻

HJCBUG 《微型计算机》 2010年3月上期

制程的历史与演进

CPU制程技术发展到今天，其尺寸已经从1971年Intel发布的4004 CPU时的10μm进化到了今天的32nm级别，Intel公司最新推出的新款处理器—代号Westmere的32nm制程处理器(见图1)就是目前顶尖制程工艺的代表。台积电也计划于年内推出其28nm制程工艺，另外一家主要的芯片制造厂商GlobalFoundries公司则计划于年内推出基于SOI的32nm制程工艺和基于体硅的28nm制程工艺。

图1：45nm制程和32nm制程

不过，从早期的Intel 486时代发展到目前的Westmere，各家厂商制造CPU的制程基本都是基于传统的平面型晶体管结构，熟悉 MOSFET结构的爱好者都知道，所谓的平面型晶体管，指的是MOSFET的漏极、源极、栅极、沟道以及基体结构的横断面位于同一平面上的晶体管结构，如图2所示：

图2

需要说明的是，即便是传统的平面型晶体管技术，业界也存在两种不同的流派，上图中左侧的称为传统的体硅技术(Bulk SI)，而右侧的则是相对较新的绝缘层覆硅(SOI)技术，两者的区别在于后者在硅基体顶部增加了一层埋入式氧化物(BOX)层，而BOX上则覆有一层相对较薄的硅层。Intel是体硅技术的坚定支持者，而IBM/AMD则是SOI技术的绝对守护者。

尽管历经了数十年风雨的平面型晶体管制造技术发展至今已经相当的成熟，对各家厂商而言也是最经济的制造技术，但随着晶体管关键尺寸的不断缩小，平面型晶体管技术的瓶颈现象越来越严重。那么是在现有的部分耗尽型平面晶体管(为了行文方便，下文如不作特别说明均用传统平面型晶体管表示部分耗尽型平面晶体管)技术上进行新技术研究还是抛弃现在的传统平面型晶体管以求创新呢？下面我们就以这两个方向作分别阐述。

延续现有晶体管架构

应变硅与HKMG——延续传统平面型晶体管的希望

在过去的几十年中，为了延续传统平面型晶体管制造技术的寿命，弥补关键尺寸缩小给传统平面型晶体管带来的负面效应，以Intel、台积电、AMD(也就是现在的GlobalFoundries)为代表的制造厂商已经开发出了很多能够改善传统平面型晶体管性能的技术，这些技术中，已经投入商用的技术尤以面向改善沟道性能的应变硅技术和改善栅极性能的HKMG(High-K栅氧化物层+金属栅极，此后简称HKMG)技术为代表，自从Intel在90nm制程的Pentium 4处理器上首次启用应变硅技术之后，这两种主要的辅助技术便成了各家厂商开发制程技术的两大热点，各家厂商均先后在自家制程工艺中加入了类似的技术，Intel和AMD包括台积电都在90～32nm的演进过程中采用了应变硅技术和HKMG技术，尽管他们的具体实现手法不同。为了读者能够更好的理解本文，下面我们就对这两种技术进行简单的介绍。

a.应变硅技术

注意图3中的“启用eSiGe(嵌入式硅锗)材料”，指的便是专门用于改善传统平面型晶体管管沟道性能的应变硅技术中的一种，应变硅技术的实质是改善沟道中空穴/电子流动的速度。

图3

eSiGe技术主要面向PMOS管，其原理是在PMOS管的漏源区外延生长一层晶格常数(即晶格原子之间的距离)比PMOS沟道中硅材料的晶格常数更大的SiGe层，以此来生成对PMOS管沟道压缩应力的技术，其原理如图3所示。根据研究，当向PMOS管沟道施加纵向(即栅极宽度方向)的压缩应力时，可以大大改善沟道的载流子移动性，提升效率。

PMOS是指N型衬底、P沟道，靠空穴的流动运送电流的MOS管，全称为P-channel Metal Oxide Semiconductor FET；NMOS是指P型衬底，N沟道，依靠电子的流动来运行电流的MOS管。全称为N-channel Metal Oxide Semiconductor FET。

b.HKMG技术

HKMG是以High-K绝缘层替代传统的SiO２氧化层，并以金属材料栅极替换旧有的硅材料栅极的一项技术，这项技术主要有助于晶体管开关速度的提升，并可减小栅极的漏电流。我们可以看到，Intel、AMD和台积电都在自己的制程工艺规划中加入了HKMG技术，说明这项技术得到了三巨头的普遍认可。图4是Intel 45nm制程NMOS管的HKMG结构实物图：因为篇幅有限再加上这两个技术点非常复杂，这里就不对应变硅和HKMG进行展开描述了。

图4

看到这里可能你会问，应变硅和HKMG技术不就可以让传统平面型晶体管一直延续下去了吗？非也，当制程下降到15nm以下后，传统平面型晶体管本身的技术壁垒将成为很难逾越的大山，除非在这段时间内又有新的“奇兵”技术出现。

传统平面型晶体管技术的瓶颈

尽管应变硅和HKMG技术曾经相当有效，而且在过去的一段时间里也起到了成功延续传统平面型晶体管寿命的重要作用，但以栅极宽度为代表的关键尺寸的不断减小所带来的负面效应已经越来越明显。

图5

首先，当栅极宽度减小到一定程度后，如图5所示，沟道的宽度(图中的L)也必然随之缩小，此时由于源、漏极区覆盖的耗尽层宽度(图中的XdS和XdD)在整个沟道中所占的比重增大，与沟道耗尽层重合程度也越来越大，这便会导致所谓的短沟道效应(SCE)。

什么是短沟道效应？

随着MOSFET沟道长度不断缩短，使得MOSFET出现了一系列在长沟道模型中得不到反映的现象，而这些原来可以忽略的效应变得愈发显著，甚至成为影响性能的主导因素，这种现象的统称即为短沟道效应。短沟道效应的坏处多多，首先是容易造成栅极门限电压Vt的上升，使管子的功耗增加；其次是可造成热载流子效应，影响器件寿命，另外还有可能造成管子无法关断，沟道中载流子迁移率下降等问题。

为了控制短沟道效应，人们不得不向沟道中掺杂磷、硼等杂质元素，这便导致用于控制管子开关的门限电压Vt的上升，同时还会降低沟道中空穴/电子流动的速度，造成管子速度的下降。而且用来向沟道中掺杂杂质的离子注入工艺也存在很难控制的问题，很容易造成管子门限电压过大等不良结果。其次，传统的SiGe PMOS应变硅技术也开始面临瓶颈，以Intel为例，截至目前为止，其应变硅技术在32nm制程节点中已经发展到了第4代。在Intel的第4代应变硅技术中，PMOS管漏源区内的eSiGe层掺杂的Ge元素比例也已经达到了40%的水平。很难再为沟道提供更高级别的应变。第三，栅极氧化物的厚度方面也将出现发展瓶颈问题。仍以Intel为例，其HKMG技术在32nm制程节点中已经发展到了第二代，其第二代HKMG技术中High-K绝缘层的厚度已经被减小到0.9nm的水平。

图6：图中下方Vt线为门限电压，上方两根线为管子饱和电流
和输入电流，制程越发展，工艺控制越困难。

IBM研发中心的高管Bruce Doris表示，栅极氧化物厚度减薄的速度已经很难再跟上栅极宽度缩小的步伐。而Intel公司负责制程技术的经理Mark Bohr也表示，Intel对现有结构的部分耗尽式平面型晶体管技术能否继续沿用到15nm制程节点感到“非常悲观”。

最后，其它一些传统平面型晶体管所面临的问题也将越来越难解决。工作电压的不断升高，使芯片的功耗控制变得越来越困难；而且关键尺寸的缩小还会导致漏/源极电阻的不断增大。

突破传统平面型晶体管技术瓶颈的思路和方向

目前占主流地位的思路是放弃传统的平面型晶体管技术，想办法减小沟道区的厚度，消除沟道中耗尽层底部的中性层，让沟道中的耗尽层能够填满整个沟道区—这便是所谓的全耗尽型(Fully Depleted：FD)晶体管，而传统的平面型晶体管则属于部分耗尽型(Partialiy Depleted：PD)晶体管，两者之间的区别如图7所示：

图7：中性层消失后，沟道厚度降低，进一步抑制短沟道效应，
漏电流大大降低，同时还具有具有载流子迁移率增大，
电流驱动能力提高的优势，这都为进一步降低功耗打下了扎实的基础。

不过，要制造出全耗尽型晶体管，要求沟道所处的硅层厚度极薄，这样才有可能形成全耗尽式的结构。传统的制造工艺，特别是传统基于体硅的制造工艺很难造出符合要求的结构，即便对新兴的SOI工艺而言，沟道硅层的厚度也很难控制在较薄的水平。另外一种相对较新的思路则是在晶体管的平面型工艺技术不作太大变化的条件下，转而开发全新的晶体管材料。

不论是以上哪一种方式，都要求芯片制造商转而寻求其它的晶体管结构形式或制造材料，而这也意味着业界未来一段时间内的研发重心将从应变硅和HKMG等技术转向新型晶体管结构和新材料的研制方面。

Gartner的分析师Dean Freeman为此表示，目前半导体业界所面临的情况与1980年代非常类似，当时业界为了摆脱面临的发展瓶颈，开始逐步采用CMOS技术来制造内存和逻辑芯片，从而开创了半导体业界的新纪元。

放弃传统平面型晶体管技术

围绕如何实现全耗尽型晶体管和开发新型晶体管材料这两个中心思想，以Intel/IBM为首的CPU制造厂商发展出了三种解决方案，分别是转向立体型晶体管结构，转向全耗尽型ETSOI(FD-ETSOI)技术以及转向III-V族技术，以下我们便为大家一一介绍这三种方案。

解决方案一：转向立体型晶体管结构

a.什么是立体型晶体管

立体型晶体管结构(有的材料中也称为垂直型晶体管)指的是管子的漏/源极和栅极的横截面并不位于同一平面内的技术，Intel的三门晶体管(Tri-gate)技术，以及IBM/AMD的Finfet技术均属立体型晶体管结构一类。其中Intel的三门晶体管技术尽管名字里面不含Finfet字样，但其实质仍属Finfet结构，只不过由于Intel采用的是三栅极配置的Finfet，而IBM/AMD准备的是使用双栅极配置的Finfet技术，因此为了区别于对手，同时又显示出自家技术的特色，因此便造成了两家立体结构晶体管技术命名上的区别。

图8、图9是Intel公司三门晶体管结构的原理图，栅极纵剖图以及实物放大图片：

图8

图9

IBM/AMD公司的FinFET结构则与Intel的三门结构大同小异，只不过栅极数量改为2，而且是基于SOI结构而已，其FinFET结构的纵剖图如图10。

图10：注意栅极数量与Intel三门结构的区别，
以及沟道底部SOI BOX结构与体硅结构中硅基体的区别。

转向立体型晶体管结构之后，由于沟道区不再包含在体硅或SOI中，而是从这些结构中独立出来，因此可以采取蚀刻等方式制作出厚度极薄的高质量全耗尽型沟道，这样传统平面型晶体管所面临的许多问题均可迎刃而解。不过，从传统平面型与立体型晶体管的构造对比我们便可以看出，立体型晶体管所用的制造工艺与传统的平面型晶体管存在较大的差别，制造工艺的复杂程度也比后者高出许多，因此尽管有关的技术多年前便已经被提出，但要想在短时间内转向立体型晶体管技术难度是非常大的，各家在采用这种新技术之前也总是小心翼翼。接下来我们来了解一下Intel/AMD方面转向立体型结构的计划。

按Intel的脾气，他们一向对在延续平面型晶体管技术寿命方面较有优势的SOI工艺保持抗拒的态度。不过最近他们的口风不再一贯式的强硬，Intel的制程技术经理Mark Bohr表示：“我们要找的是一种性价比最高的方案，不管是SOI或者其它的什么技术，只要某种技术能够带来额外的性能提升或较低的功耗，那么我们就会采用这些技术。”而Intel前技术经理Scott Thompson预计Intel最终会选择采用三门结构晶体管制程工艺，而其它的厂商则会因为FinFET结构的制程工艺复杂性而对FinFET望而却步。

b.Intel何时转向三门技术

据Intel表示，在32nm制程的下一代22nm制程产品中，他们仍将继续采用传统基于体硅的平面型晶体管结构(此前曾有传言称 Intel会在22nm制程中转向立体结构的三门晶体管技术)，他们计划于2011年年底正式推出22nm制程技术。而在2009年9月，Intel已经展示过一款采用22nm制程技术制造的SRAM芯片，这种芯片的存储密度为364Mb/inch2，内含29亿个晶体管，并且采用了Intel第三代Gate-last HKMG制程技术，栅极绝缘层和金属栅极的主要部分在制造工序的最后几个工步制造成型，避开前面的高温退火工步(45/32nm中使用的前代技术则只有金属栅极才在最后几个工步制造成型)。

至于15nm制程节点，Intel目前则正在考虑要采用哪些新的制程技术以满足要求，Intel的制程技术经理Mark Bohr表示：“全耗尽技术对降低芯片的功耗非常有效。” Intel目前正在考虑除此之外的多种可行性方案，比如是转向三门晶体管技术或者是转向全耗尽+平面型晶体管技术等等。Intel预计会在今年年中就15nm制程节点将采用哪一种新技术做出最后的决定。

c.IBM/AMD何时转向新技术

相比之下，IBM阵营方面则与Intel稍有不同，由于采用较为独特的SOI技术，加上最近他们在超薄ETSOI开发方面取得了一些进展，因此在延续平面型晶体管寿命方面具备一些得天独厚的优势。不过，出于行文流畅方面的考虑，我们准备将有关FD-ETSOI的说明放到文章的下一节阐述。这里我们可以先明确的一点是，IBM/AMD公司已经开始考虑要在22nm/15nm制程节点开始使用全耗尽型SOI技术(FD-ETSOI)，不过FD-ETSOI的下一步(15nm或更高规格制程)，则仍然会转向基于Finfet的立体型晶体管结构。

据AMD公司的CPU代工生产商GlobalFoundries公司的高管Pellerin表示：“在ETSOI技术发展的下一步很可能会开始启用FinFET立体型晶体管结构，两者的关系就像我们从PD-SOI过渡到FD-E TSOI那样。我看不出来ETSOI和FinFET两种技术之间存在什么矛盾之处，而且采用平面型结构ETSOI技术所能达到的晶体管密度总会出现发展瓶颈，而FinFET则可以解决这种问题。”

2.解决方案二：转向全耗尽型

ETSOI(FD-ETSOI)技术正如我们上文所介绍的那样，虽然立体型晶体管结构具有很多优点，但其制造工艺的复杂性则会令不少厂商望而却步，如果能继续延续平面型工艺的寿命，那么无论在风险还是成本方面的担忧都会大大减小。这便是IBM公司推出全耗尽型超薄SOI(FDETSOI)的目的所在。IBM 公司2009年12月份曾经展示了一种基于ETSOI(Extremely Thin SOI：超薄SOI)的22nm制程FD-ETSOI工艺，并在IEDM2009会议上展示了22nm FDETSOI晶体管制造流程图(图11)：

图11

IBM专家表示：“我们采用的是不会损害ETSOI层的就地掺杂技术(in-situ doping)。我们首先生成栅极隔离层；然后在漏源区用外延技术沉积生长出漏/源极，形成外延层并在漏/源极的生长过程中同时就地掺杂所需的杂质元素；此后我们会对晶体管进行加热处理，令漏源极中的掺杂原子向沟道方向扩散，形成扩散层(图11中的ext)。而加热处理过程中我们使用的尖峰退火技术(Spike Anneal)则不会对ETSOI层的结构造成不必要的损害。”那么这个如此强大的FD-ETSOI工艺，其晶体管结构是怎样的呢？

图12

如图12可见，这种22nm FDETSOI工艺的本质是将位于埋入式氧化物(BOX)上方的SOI层的厚度缩小到极低的水平，使用这种技术之后，22nm制程中的SOI层的厚度仅有6.3nm，而传统的SOI层厚度通常在 20nm以上，发展到15nm制程，SOI层的厚度还可以进一步被缩小到5nm左右。极薄的SOI层厚度保证了全耗尽设计的实现。当然，如此薄厚度的SOI层制作起来并非易事，要想将整片晶圆上的SOI层厚度控制在一定的误差水平之内，其制作难度可想而知。据IBM表示，目前由Soitec公司提供，能用于制造ETSOI产品的SOI晶圆数量仍十分有限，不过他们已经可以把这种SOI层的厚度误差控制在±5埃左右。除了对晶圆厂商提出了较高的要求之外，FD-ETSOI技术还存在其它的难点，由于SOI层的厚度极薄，因此很容易受到损坏。而且为了避免对SOI层造成损坏，在制造漏/源极时不能采用传统破坏性较强的离子注入技术，必须采用就地掺杂技术(In-situ Doping)。

隶属IBM技术同盟的GobalFoundries的技术开发经理John Pellerin表示，FD-SOI技术从应用结构上看与现有的PD-SOI技术非常相近，“我们只需要把SOI层的厚度变薄，并想办法解决ETSOI带来的一些问题即可，其它的部分则和传统的制造工艺基本相同。”当然ETSOI技术仍有许多其他的问题需要解决，比如如何减小器件的寄生电阻等。

尽管凭借FD-ETSOI技术仍可暂时延长平面型晶体管工艺的寿命，但要真正将这种技术投入实用同样需要解决很多难题，故FD-ETSOI技术仍可算得上是对传统制造技术的一次较大变革。

2.继续ETSOI的辅助技术：SiC应变硅技术

说到IBM的22nm FD-ETSOI技术，便不能不提其中采用的一种辅助型应变硅技术SiC。与我们前面提到的eSiGe类似，这也是一种应变硅技术，所不同的是eSiGe面向的对象是PMOS管中的沟道，而SiC则面向NMOS管中的沟道。

与eSiGe能为PMOS管沟道的纵向方向施加应变力的道理相反，由于C原子的体积比Si原子小，因此SiC化合物的晶格常数比Si小，这样当把SiC层嵌入NMOS管的漏源极之后，便可对沟道纵向方向施加拉伸应变力，如图13所示：

图13

IBM在描述自己的FD-ETSOI工艺时曾经提到，他们会在沉积NMOS管的漏源极时向极内掺杂碳杂质。而且另外一家IBM工艺技术联盟的成员Applied Mater ials公司也分别在IEDM2008和2009年的Semicon会展上两次强调了这种SiC应变硅技术的可行性。那么外界对SiC技术的评价如何呢？

有趣的是，Intel过去也曾经对SiC技术进行了深入的研究，不过他们现在似乎完全改变了对待SiC应变硅技术的态度，Intel过去曾经表示他们将有望使用SiC应变硅技术，不过最近Intel公司的有关人员在IEDM2009会议中接受采访时则表示不愿意就Intel在SiC应变硅技术方面取得的进展发表任何评论。而会上代表Intel做有关Intel 32nm制程技术演讲的Paul Packan则在演讲后回答记者提问的环节干脆没有理会一位记者提出的有关SiC应变硅技术在32nm制程NMOS结构中应用状况的问题。

GlobalFoundries公司的Pellerin表示：“我们正在关注SiC应变硅技术，并且正在考虑在我们的22nm制程及更高级别制程中使用这项技术。”在目前的工艺尺寸条件情况下，要想很好地控制漏源区的离子注入过程将是一项非常复杂的任务，而在IBM的FD-ETSOI工艺中，NMOS中使用的SiC应变硅技术则与PMOS中的SiGe应变硅技术一样是采用外延沉积实现的，不必再为如何控制离子注入而担忧。同时这位专家也表示：“如何在NMOS管中应用应变硅技术将是另外一个改善晶体管性能的关键技术。”

3.解决方案三：转向III-V族技术

III-V族技术是另外一种很有希望的晶体管技术发展方向，这种方案的特点是采用位于元素周期表中III-V族元素组成的材料来替代现有MOSFET管的材料，因此人们便将这种技术形象地称为“III-V族”技术，也有将采用这种技术制作的场效应管称为“QWFET”的。图14是Intel在IDEM2009会展上展示的他们在使用这种技术制造的QWFET场效应管方面取得的新进展，当时他们向这种晶体管结构中引入了High-K栅极氧化物层，这种新的High-K栅氧化物层的加入，大大减小了QWFET的漏电现象。

图14

从图14中可见，III-V族技术同样也可以在保证传统平面型晶体管制造工艺变化不大的基础上制造出关键尺寸符合发展要求的产品。

在IEDM2009会议上，来自斯坦福大学的教授Krishna Saraswat还表示，当沟道宽度降至10nm左右时，必须采用新材料来制造沟道。据他估计，业界将首先开发出NMOS管使用III-V族元素构建沟道，PMOS管使用锗元素构建沟道的技术，然后再向PMOS/NMOS统一采用III- V族元素制造沟道的方向发展。转向使用III-V族元素将大大减小器件的工作电压和管子的能耗，可将管子的工作电压减小至仅0.5V。

总结

回归到大家最关心的新制程技术在Intel/AMD产品的实际应用方面，笔者认为Intel和AMD会继续走自己的老路。其中Intel不太可能会使用ET SOI技术， IBM/GlobalFoundries/AMD则会继续将SOI发扬光大。其理由很简单，因为Intel如果采用三门晶体管技术，便可以绕开SOI。与Intel会尽可能地延长体硅制程寿命的作法截然相反的是，IBM/GlobalFoundries/AMD则会尽快转向FD-ETSOI技术，并尽全力延续FD-ETSOI的寿命。但从技术角度来分析，两者又是在统一中追求变化，都在向全耗尽型立体晶体管转变。当然，新技术的推出有时是无法按常理推测的，到底哪个方向是最正确的我们还不得而知。这就好比现在我们在不停地寻找方法或者更换交通工具，也许哪一天会去改造路面的材料甚至结构，甚至于将来的某一天去调整前进路上的空气阻力，或者突然有人告诉我们：“现在可以不走这条路了，我们发现了另一条路”，也许在陆地上，也许存在于水中、空中！在制程工艺不断发展之路上，“Intel”和“AMD”们也在不停地更换“工具”，以便让自己前进得更快、更舒服一些。

=====================================================================================

Another option is the vertical transistor. Even while the addition of many metal layers has
turned the integrated circuit into a truly three-dimensional artifact, the transistor itself is
still mostly laid out in a horizontal plane. This forces the device designer to jointly optimize
packing density and performance parameters. By rotating the device so that the drain
ends up on top, and the source at the bottom, these concerns are separated: packing density
still is dominated by horizontal dimensions, while performance issues are mostly determined
by vertical spacings (Figure 3.42). Operational devices of this type have been fabricated
with channel lengths substantially below 0.1 mm. [Eaglesham00].
Integrated circuits integrating more then one billion transistors clocked at speeds of
tens of GHz’s hence seem to be well under way. Whether this will actually happen is an
open question. Even though it might be technologically feasible, other parameters have an
equal impact on the feasibility of such an undertaking. A first doubt is if such a part can be

manufactured in an economical way. Current semiconductor plants cost over $5 billion,
and this price is expected to rise substantially with smaller feature sizes. Design considerations
also play a role. Power consumption of such a component might be prohibitive. The
growing role of interconnect parasitics might put an upper bound on performance. Finally,
system considerations might determine what level of integration is really desirable. All in
all, it is obvious that the design of semiconductor circuits still faces an exciting future.

试诗

年少轻狂,目光里闪烁着不拘

挥洒青春,只为这自我的理想

衣食无忧,血管里流淌着幸福

天降大任,身心上压满了痛苦

SDA Garph

Antisymmetry means the when there is a loop between two node a and b, a must equal b, otherwise it violate the antisymmetry definition.

trail is a sequence of linked edges does not repeat, or [e1,e2…en] in E, where ei !=ej. and e1.node2=e2.node1.

path is a sequence of linked edge does not touch it self.

cycle is a path with two end vertices coincide.

from an adjacency matrix (g) point of view

complete graph is ~diag(ones(1,n)) or 1-diag(ones(1,n))

complement graph is cg=~g

Genetic Algorithm

optimization for fitness function

fitness function may not necessarily be the target optimization problem, often it’s the output of a system where the fitness being evaluated.

the system can be viewed as a black box where the detail of the working mechanism is unknow.

for example given a set of training examples, and a neural net work, we can view it as a system.

the input of the system is the weights for each neuron, and the output of the neurons are the result of the system.

the fitness function can be a error between the training example and the neuron output when a particular weight inputs is applied.

but for the case of neuron network, we can extracted information of the system by varying our input (weights) to the system.

UWB pulse radio

transmitter

pull up/ down transistor with delay gate input to generate pulse

receiver

threshold analogue to binary converter + deply lines + delay line vector sampler.

ACA memory hierarchy

principle of locality state that programs access a relatively small portion of their address space at any instant of time.

Temporal locality (locality in time)
spatial locality (locality in space)

if viewed from a space(2D) –time thread, temporal locality shows the inertia of the thread, and the spatial locality shows the small energy disturbance (additive disturbance ).

due to this nature of programs, and the different speed/size trade off off memory technology, we can mix different type of memory in a hierarchy into our memory system to form a more efficient non-uniform access time memory model for our programs.

the concept of cache is we speed up certain region of our memory map, and the region is a minimum granularity we accelerate by using the cache. it is often called block(line), since the blocks are sparely distributed in our memory map, when a block is not present in the cache, we have a cache miss, and the miss rate is the fraction of memory access not found in a level of the memory hierarchy.

the hit time is the time required to access a level of memory hierarchy (include the time to justify hit or miss), it’s normally much faster than the lower memory hierarchy.

the miss penalty is the time required to fetch a block into a level of the memory hierarchy from the lower level, including the time to access the block, transmit if from one level to another,insert it in the level that experienced the miss, and the pass the block to the requestor, the detail of the data transfer is hidden from the requestor, the requestor is only experiencing the extra latency but does not worry the data integrity.

direct mapped cache

a cache structure in which each memory location is mapped to exactly one location in the cache.

from the graph we can see, the cache simply partition the address into two part, the tag and index, the index is used to select a block and the tag is the information used to justify a hit or miss.

this directed mapped cache will lower the memory latency in a evenly distributed manner, but in run time, the actual lowered region is scattered.

the diagram shows a memory partition for direct mapped cache, low access time is possible if the location has a matched tag in a block of the cache.note that only one region is possible to cached in the block.

about the block size

here the granularity and sparsity is the key to the performance of the cache.the block size is the granularity, and bigger size will introduce the extra time for block replacement, some improvement can be used (but it also depend on the access pattern of programs, eg, instruction cache and date cache), such as early restart and request word first scheme.

also the formula for calculating the cache recourse consumption.

Cache miss handle

the use of stall rather than interrupt when a miss occurred is to balance between the cost of

visual perception with deep learning

ACA 2010 MEng

This examination is partly based on the Intel Nehalem processor architecture, as
described in the article “Nehalem: Intel’s Future Processor and System”, by David
Kanter (Real World Technologies, 04-02-2008), which you should have available to
you in the examination. Where the article is incomplete, you are invited to speculate
using your understanding of the underlying architectural principles.

1a
Which parts of Nehalem’s branch predictor need to be replicated for each thread?

Another improved branch target prediction mechanism in Nehalem is the return stack buffer (RSB). When a function is called, the RSB records the address, so that when the function is returned, it will just pick up where it left off instead of ending up at the wrong address. A RSB can overflow if too many functions are called recursively and it can also get corrupted and produce bad return addresses if the branch predictor speculates down a wrong path. Nehalem actually renames the RSB, which avoids return stack overflows, and ensures that most misspeculation does not corrupt the RSB. There is a dedicated RSB for each thread to avoid any cross-contamination.

return stack buffer
b
Nehalem’s branch prediction may lead to branch execution time ranging from
zero cycles to many. What is the branch execution time when a level-1 BTB hit
occurs?

when we have a btb hit, BTB will predict the target PC address, this means that we can perform branch prediction in the fetch stage.

if we predict branches in the decode stage, there is one cycle of wasted fetch after every branch instruction, because we don't know what we're supposed to be fetching until the branch finishes decode. if we predict branches in the fetch stage, there are no cycles of wasted fetch.

c
What is the worst case?

BTB hit but miss prediction.

d
What intermediate cases might occur?

BTB miss, but a taken branch.

2a
What happens when Nehalem’s ROB is full?

when ROB is full, we stop issuing instructions until an entry is made free.

The 128 entry ROB is statically partitioned between both threads, which allows each thread to speculate equally far through the instruction stream.

b
What two things happen in the Nehalem microarchitecture when an instruction is
committed?

once an instruction commits, the entry in ROB is reclaimed and the register or memory destination is updated.

if speculation was wrong, ROB is flushed and execution is restarted at the correct successor of the branch.

c
Nehalem’s level-1 data cache has an access latency of four cycles. If this were
increased, for example to six, its capacity could be much larger. What would be
the disadvantages of doing this?

cache latency increase will result the longer execution time for load/store operation, therefore we trade-off the memory bandwidth to capacity,for small number of threads running, the competition for cache is relatively low, therefore the memory bandwidth will result as the bottle neck of the performance.

3
Consider the following code fragment:
float A[N],B[N],C[N];
S1:
for (i=0; i<N; i++) {
S2
if (A[i] > 0) {
S3:
C[i] = A[i] + B[i];
}
}
a
Explain (in outline terms) how this loop could be executed using SSE
instructions.

SSE originally added eight new 128-bit registers known as XMM0 through XMM7.

pseudo code for SSE

for i=0:N/4

A[i*4:i*4+3]-> xmm0

B[i*4:i*4+3]-> xmm1

if(xmm0 >0)

xmm0=xmm0+xmm1;

xmm0->C[i*4:i*4+3]

end

-----------------------------

instruction used

movaps, addps

b
It has been proposed that the SSE-like instructions in some future designs might
operate on much longer registers.
(i) What performance advantage might arise, in suitable applications?
(ii) What problem arises with the example above? How might it be minimised,
in suitable applications?

(i) the future AVX (advanced vector extension) has increased the SIMD vector register from 128 bits to 256 bits. this will double the amount of data parallelism for vector computation. Suitable for floating point-intensive calculations in multimedia, scientific and financial applications.

(ii) the conditional execution which compare the vector register will take longer time since the length has increased. in some cases, we can preprocess the data vector such that the conditional execution can be eliminated.

loop stream detector

The loop stream detector is located inside the IDQ to improve power consumption
and front end efficiency for loops with a short sequence of instructions.
The instruction decoder supports micro-fusion to improve front end throughput,
increase the effective size of queues in the scheduler and re-order buffer (ROB). The
rules for micro-fusion are similar to those of Intel Core microarchitecture.
The instruction queue also supports macro-fusion to combine adjacent instructions
into one micro-ops where possible. In previous generations of Intel Core microarchitecture,
macro-fusion support for CMP/Jcc sequence is limited to the CF and ZF flag,
and macrofusion is not supported in 64-bit mode.

写在毕业

毕业,这天终于来了.

如果光阴似箭,那么岁月就是时间流逝的升华.

如果人生如戏,那么岁月就是生命存在的感悟.

想说的话,该说的话,现在也是时候了. 现在做个总结,等32岁时再来看看自己.

关于生活

4年之前,我天真地带着理想和热情,来到了IC,就像是块烧红了的生铁,有着激动的能量.

4年之后,我冷静地带着理想和热情,离开了IC,就像是块锤炼了的精钢,有着坚毅的品质.

这几年的生活,对我来说,绝对可以说是人生的无价财富, 大到人生意义,小到做事为人,我感到自己的思路更加清晰了.

关于理想

以前的理想就是创业,现在虽然是理想未变,可是对创业的理解却有了天地之别.

以前对创业的理解,无非是功名成就,有一份富裕的家业.至于能否饮水思源,那就看自己有没有必要.

而现在的理解,是实现能够创造价值的理想.

关于友情

关于友情,以前的看法是,物以类聚,人以群分,总是以为自己努力上进,专研学术,自然就有人喜欢.

可是后来才发现,这种看法其实只对了一半,物以类聚,人以群分是没错,可是友情其实源自内心的联系,一个有情有义的联系.

很多时候,个人的实力只是一个建立友情的开始.而真正的感情,确是建立在相互的依靠.

所以要学会更多的与人交流,了解别人的内心,表达自己的情感. 让朋友可以依靠和信任.

关于爱情

一个Geek 的内心

曾几何时,我们都是一群天真的孩子,对世界和社会的好奇让们有了自己的思想.

沟通

言语,表情,行动.

交流

同步,异步,随机

识人

长短,习性,理想

用人

天时,地利,人和

管理

松紧,规矩,原则

回报

无私,感恩,循环

Sunday, March 7, 2010

Jacobi iterations and vector rotation

The Jacobi method [see wiki] for solving liner system is based on the fact that matrix multiplication will rotate the vector and converge if the infinity norm of Eigen values are less than 1.

From a practical point of view, the performance of convergence is determined by the spectral radius [see wiki] of the matrix. and the small the spectral radius the better.

CORDIC computations are also a sequence of rotations but the rotation matrix is adaptive. the spectral radius of rotation matrix is always 1, but when rotating using shift & add, the spectral radius is slightly bigger than 1, but as the rotation angle –>0, the spectral radius->1, so resulting vector converge to the set angle/point. more information about CORDIC please check these papers.

I have also wrote a matlab script for visualizing the path of the vector converge/ diverge. the red curve shows the path, and green/blue lines are the lines defined by eigenvectors. note that the rotation vector is random.

%%  eig value roatation divergence/ convegence
N=100; %number of interations
r=rand(2,2);

[l v]=eig(r);
x=N*ones(1:N,2);
eigenvalu_max=max(max(v))
if eigenvalu_max<1.1

    for i=1:N+1
        x(i+1,:)=r*x(i,:)';
    end
end

%compass(x(:,1),x(:,2));
plot(x(:,1),x(:,2),'r'); axis equal; hold on
plot([-N:N]'./l(1,1),[-N:N]'./l(2,1),'g');
plot([-N:N]'./l(1,2),[-N:N]'./l(2,2),'b'); hold off;

the path of convergence some oscillate along the way down to 0, this another interesting fact about eigenvalues, so far I am still not very sure the exact reason, but I have some ideas, maybe I will post another one when I got the time to prove it.

Saturday, March 6, 2010

Pink noise beat analysis -part1

pink noise [see wiki] is a very interesting phenomena where the frequency spectrum has a 1/f curve.

but many musical instrument’s harmonics often show this 1/f like curve as well, a interesting fact is, if the frequency spectrum curve is smooth (no leakages [see wiki] or holes ), then they don’t sounds like noise any more. otherwise they sounds more noisy.

blow is a sample pink beat I found from this link http://www.brusi.com/downloads.shtml, a download link is provided below, you can try it first and then read on.

in matlab, the periodogram analysis shows the PSD like this,

and the time domain waveform as below

to compare it with another base generated by Maxim Vedenev

from http://dj-toolbox.9f.com/

and PSD

personally, I would say the pink beat is more rich with dynamics and sounds more attractive,and the drum beat is more pure and more leaky in the spectrum, but that’s not end of story.

I will post a second blog which explain the windowing effect on PSD and spectral smoothing. thanks for reading.

Thursday, March 4, 2010

Virtual Box and Fedora 12 Guest

Installing Guest Additions on Virtualbox enables you to use some very powerful features such as seamless integration and full screen resolution mode.

for a fresh fedora 12 install

su -
yum –y install dkms gcc kernel-devel kernel-headers
reboot
cd /media/VBOXADDITIONS_3.1.4_57640 (version 3.1.4)
./VBoxLinuxAdditions-x86.run (for 32bit)
reboot
done!

for more detail

http://digitizor.com/2009/05/26/how-to-install-virtualbox-guest-additions-for-a-linux-guest/

and..

Fedora 12 Installation and Post-Installation Guide

http://www.my-guides.net/en/content/view/174/26/

Wednesday, March 3, 2010

异步FIFO设计

很多人在论坛上发过Peter Alfke 的那篇

Simulation and Synthesis Techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons

但是具体设计却没什么人提及,可能是由于技术的保留吧,这里送一份schematcis 吧, verilog嘛,因为是公司项目的一部分,所以就不便透露了.

read_ptr和write_ptr都没什么特别的,一个4位的k-walk gray counter.

n0(V)是指针比较.

dir1,dir0 给出quadrant 结果.

然后进行empty,full evaluation.

剩下就是指针控制了.

enjoy~

Wednesday, February 10, 2010

天鹅之美

她在湖中中嬉戏,在天空中翱翔,飞越珠峰,停歇长城.

我欣赏她的温柔,优雅,无惧山高海深.

一生闯荡,一生厮守.

爱情,伴侣一生的结合,在天鹅身上得到了诗意般的升华.

活着如何精彩,如何感动,我想到天鹅.

也许是这种纯粹,让人回到原始,反思自我.

世界没有完美,人生充满了现实的残酷.

面对得失,我该如何.

有时信仰真是唯一的解药,

希望,对未来的寄托

但愿明天更好

但愿爱可爱，非常爱

brain, memory, learning and creativity

understanding the architecture

sensors
filters
processors

hierarchy of information

vision
voice
emotions
language
semantics

the higher the information level, the faster the memory for it but smaller quantity in our brain. the visual memory has huge quantity, but quite slow in terms of erase and update. so we can use our permanent memory to generated vision info and sore them into our visual memory which can store many many information,but come with a draw back that it can not erase and update very quickly, but don’t worry they are just buffers so eventually you can still reuse them.

a good strategy is to generated information( compact and structured) to a visual form and then take a look at your picture. then you will be able to remember them for a much longer time and access them on demand.

so the key is the picture or voice generation, so how to generated them, any guidelines?

imagine you have two picture to look, the more interesting the picture it is, the more likely you are able to remember them longer. same goes with the voice. so how exactly we find things interesting, one thing might be related to the new information and our knowledge, say how novel it is and how well does it fit our knowledge and rules.

this process of getting a picture shows some aspect of information synthesis and creativity which I will talk about in the later part.

algorithm of learning

the idea of learning is the opposite to synthesis, where we extract level by level from the raw information, and finally discover the gold. very like a filtering process, but with more sophistication and recursion.

so when we learned some thing, say a language, we can use it to learn further knowledge about some thing, and keeps going, so inside our brain, we have a hierarchical knowledge base and we keep building it up, kind of like a graph which model the environment we live in, but not a firmware which means we can change part of it if it does not model correctly.

information synthesis and creativity

one we have a model of the world in our brain, we can run our simulations in side our brain and generate some outputs which after some refinement or optimization, we can speck them out or write them down. that is the essence of information synthesis. but how creative our information is generated, has much more thing involved.

think about generating a joke, that will obviously require much more creativity than just a small talk.

joke is great example, but it has some formulas which is good for us to study more about creative info synth. of course there are many different kind of jokes, some are so called stupid jokes and some are called intelligent jokes, here we have look at the intelligent jokes, dramatic change but logically reasonable, dirty but beautiful, push and pulls.

so it seems all about a good balance, or symmetry, or many other other principles such as recursions, reflections,etc which we perceive them as our fundamental knowledge, which again is part of our model has the most agreement with the nature!

electron and atomic network

a first subtle question for those physicist,

What is the time that the electron need to jump from one energy state to the next energy state? [a professor’s answer]

I am a EECS person, But I know the importance of materials structures for my circuits and system to perform. I know Carbon network is the future, but in order to harness it’s power I have to understand how the network or bond form between atoms (carbon for example).

I started with quantum mechanics, first thing I learned is the E=hv, it’s fine, discrete energy is not a problem for me to accept, but the concept of energy is related to frequency, I took 1 hour to figure it out, the real reason is classical mechanic ‘think’ the energy is the area under the signal in the amplitude-time plane, which is not actually true for EM waves, for EM waves, the energy is the area under the spiral of the EM vector in E-M-T 3D space. then the frequency does contribute to the area, therefore related to the energy.

the second concept than is the discrete frequency/energy for those photons, after a bit of thought, I know it’s a consequence of standing waves, but what is from of the standing wave, if you think the very first question I asked, electron jump/drop from one energy state to another state, how long does it take? if the answer is 0, then the frequency will be infinite and there will be not way we can see it either by eye or instrument! if the time is 1/f, then the length of the gap will tell us the time, but note, for H, 3->2 transistion give a 656.3 nm gap!! a transistor in today’s semiconductor process only has a gap of 28nm!

so how about look at it from a different perspective, a standing wave in the energy band ring, 2*pi*radius=656.3nm, which give the radius 104.45nm, still a bit too big, how about 4/3*pi*(radius)^3, which give 5.39nm for 3->2 gap of H, given the fact that H=H(base band) bond is 31 +-5pm, 3->2 band should be much bigger to generate the 656.3nm light, which is the lowest for hydrogen.

I will make more figure for the explanation, but now, let’s think about the consequence of this electron model for atom. first, electron don’t really have those rigid energy bands now, instead, it’s like a spherical standing wave! imagine a water sphere in space, the surface standing wave is what I am talking about.

what about double split experiment, think electron as a cloud or loosely coupled network like water, you can get the right result.

so what is electron then, and why it can be used to from bonds between atoms, well, I would like to imagine two water drops positively charged and surround by a layer of oil and another layer of negatively charged water, both floating in space, and they should form bond if the amount of charge for each layer is right.

a last question, is speed of light really constant? I would say electron’s standing wave motion is the key, currently, c=wavelength * frequency, if the wave length is the radius of the sphere, then the outer layer will have slower frequency, and inner layer will have higher frequency, so the product for c is a constant.

in our world, matter seems to either reflex or slow down light. and the maximum speed is when there is no atom in it’s way.

but what if the atom is a anti-matter? will these pattern reverse, say, speed it up? you can also check the article below

http://www.sciencedaily.com/releases/2010/01/100126175921.htm

Water standing wave

new atomic network formation

Although ice melts and water freezes under equilibrium conditionsat 0°C, water can be supercooled under homogeneous conditionsin a clean environment down to –40°C without freezing.The influence of the electric field on the freezing temperatureof supercooled water (electrofreezing) is of topical importancein the living and inanimate worlds. We report that positivelycharged surfaces of pyroelectric LiTaO₃ crystals and SrTiO₃thin films promote ice nucleation, whereas the same surfaceswhen negatively charged reduce the freezing temperature. Accordingly,droplets of water cooled down on a negatively charged LiTaO₃surface and remaining liquid at –11°C freeze immediatelywhen this surface is heated to –8°C, as a result ofthe replacement of the negative surface charge by a positiveone. Furthermore, powder x-ray diffraction studies demonstratedthat the freezing on the positively charged surface starts atthe solid/water interface, whereas on a negatively charged surface,ice nucleation starts at the air/water interface.

quantum entanglement and six degree of connection

if you feel stressed, it actually means your another live is calling you, is that amazing?

we all lived in two lives in the time span of our existence. in fact, 1/3 is the normal world, 1/3 in dream world and 1/3 doing nothing.

a interesting fact is, the dream world is non-causal and the normal world is causal. a duality just like the wave-partial duality in physics.

when you try to observe which live you are living, the duality disappear. and that’s what we are mostly thinking about the normal world and it’s duality pair.

so, once we accept the two lives theory, we can now make further conclusions about the “real world”.

fact 1, we are all connected, and our live depend on it, when things connected, they have share a common live, a duality live if you will. such things are (electron cloud, C60, VLSI circuit, neuron network (== our brain), society etc… )

fact 2, the “secret” is right, what you think determines reality, and the secret of the “secret” is what to think. it up to your perspective, still thinking in the normal world? try to see it as a duality.

fact 3, entanglement of quantum physics is limited by distance, because the connect is limited in space, but what about entanglement in our society? although we don’t know everyone in the society, but we all connected, that’s why six degree of connection come to our mind. you may ask what if someone who has no connection at all, then he/she is not living from our perspective of “living”.