http://www.jianshu.com/p/009e31c93d7c
100 作者 金戈大王 2016.03.26 00:19
写了72243字,被172人关注,获得了339个喜欢
从零开始制作自己的指令集架构
字数8805 阅读1338 评论6 喜欢35
阅读经典——《深入理解计算机系统》06
本文,我们要做一件大胆的事情,从零开始实现一个全新的指令集架构,以此深入理解处理器的工作原理。
指令集发展历史概况
Y86指令集
指令集及其编码
硬件控制语言HCL
存储器和时钟
指令的分阶段执行
SEQ的状态改变周期
SEQ的各阶段实现
流水线的一般原则
流水线冒险
更完善的设计
与真实指令集架构的差距
指令集发展历史概况
开始我们的创造之旅前,先了解一下历史上的指令集架构都有哪些。
一个处理器支持的指令和指令的字节级编码称为它的指令集架构(Instruction Set Architecture, ISA)。
最为我们熟知的就是x86架构,因为我们日常所用的个人电脑就采用了x86架构的处理器。目前世界上最大的两个处理器制造商Intel和AMD都有基于x86架构的一系列产品。从Intel i386处理器开始,x86架构进入32位时代,称为IA32架构(Intel Architecture 32bit)。后来,32位也不能满足我们的需求了,Intel开始进军64位处理器领域,提出IA64架构。但是,这个架构并不是我们现在在用的64位处理器,而是一个与x86完全无关的新的处理器架构,不保持向后兼容。虽然可以实现很高的性能,但是由于兼容性不好,市场反应冷淡。于此同时,AMD公司抓住机会,率先提出了x86-64处理器架构,支持64位的同时保持向后兼容,一举在与Intel的市场竞争中占据了主动权。当然,Intel也不会执迷不悟,他们果断放弃了IA64,开始转向x86-64架构,并逐步收回丧失的市场份额。后来,虽然AMD将自己的架构命名为AMD64,Intel将自己的架构命名为Intel64,但人们仍然习惯性地将它们统称为x86-64。
Y86指令集
为了致敬伟大的x86指令集架构,我们将自己的指令集架构命名为Y86。其实呢,Y86的设计理念完全借鉴x86,相当于一个简化的x86架构。
要想从头设计一个指令集架构,需要先规定指令集和指令集编码,然后将每个指令划分为几个阶段分步执行,每个阶段只需要做简单的一两项工作,之后,将硬件设备结合适当的逻辑电路实现指令每个阶段的工作。下面我们详细讲解具体的实现过程。
指令集及其编码
对于一个简易的指令集来说,不需要太多的指令,能实现基本的数据转移和流程控制就够了。下图列出了Y86指令集中包含的所有指令,以及每个指令的编码。
Y86指令集
这些都是非常基本的指令,不过看起来有些奇怪,这是因为我们把x86中的movl指令替换成了四个独立的指令rrmovl、irmovl、rmmovl和mrmovl,每个指令指明了操作数的来源,这样就避免了各种寻址方式的麻烦。
可以看到,各个指令的长度从1字节到6字节不等,这样编码可以减少程序代码占用的空间。第1个字节的高4位作为指令编码,用来区分不同的指令,低4位要么是0,要么是fn。fn称为功能代码,用来区分不同的操作。如下图所示,不同的功能码在不同的指令中有不同的含义。在运算指令中,分别代表加、减、与和异或;在分支跳转指令中,分别代表不同的跳转条件;在条件转移指令中,分别代表不同的转移条件。
Y86指令集的功能码
第2个字节,对于大部分指令来说存放的是寄存器标识符,请看下图:
Y86程序寄存器标识符
每个寄存器与一个数字一一对应,F代表无寄存器操作数。
最后,有些指令还包含四个字节的立即数。
举一个例子来帮助我们更好地理解指令编码。例如对于如下指令
rmmovl %esp, 0x12345(%edx)
对应的编码为
40 42 45 23 01 00
其中,从左到右,40是指令编码,42分别是寄存器%esp对应的4和寄存器%edx对应的2,45230100是偏移量0x12345在小端机器上的表示。
硬件控制语言HCL
处理器的各个硬件设备(比如ALU、程序计数器)之间通常需要特定功能的逻辑电路来连接,在设计阶段,我们使用一种结构化的语言来描述这些逻辑关系。
HCL(Hardware Control Language)是一种类似C的硬件控制语言,用于描述处理器的控制逻辑。
举一个简单的例子,对于如下所示的组合逻辑电路:
组合电路
可以用HCL语言表示为
e = (a && !(b||c)) || (!d && !(b||c))
这句话描述了输出和输入的逻辑关系,无论多么复杂的组合电路,都可以用最基本的与或非门来实现。HCL在后面将会有大量的应用。
存储器和时钟
细心的读者可能会注意到,上一段话讲到“无论多么复杂的组合电路”。为什么特别强调组合电路呢,因为还有另一种电路——时序电路。
大家应该都有基本的电路知识,组合电路只是完成了一个函数的功能,不同的输入导致不同的输出,电路本身并不存储任何信息。而时序电路就不一样了,它可以存储信息,而且在时钟信号的控制下对输入做出反应。
接下来,重点来了。在处理器中有两种存储设备:
时钟寄存器(简称寄存器) 存储单个位或字。时钟信号控制寄存器加载输入值。
随机访问存储器(简称存储器) 存储多个字,由地址选择读写哪个字。这里所说的存储器可以分为两种:处理器的虚拟存储器系统和寄存器文件。前者是通常意义上的内存系统,后者才是我们指令集中8个寄存器标识符对应的通用寄存器。
下图为寄存器的工作原理,寄存器输出一直保持在当前状态,直到时钟上升沿,新的输入将成为当前的寄存器状态。
寄存器操作
寄存器文件可以看成这样一个功能块:
寄存器文件.png
它有两个读端口和一个写端口,支持读写同时操作。值得注意的是,寄存器文件的读操作是即时的,而写操作是基于时钟的。也就是说,读出的值valA和valB随时根据srcA和srcB的变化而变化,而要写入的值valW只在clock的上升沿才能写入。仔细想想,寄存器文件的读写特性好像和寄存器是完全一样的,只不过是多了一个选址操作。
指令的分阶段执行
虽然宏观上来看,指令已经是程序不可分割的基本元素。但在处理器中,一条指令的执行还是要分多个阶段,这样才可以提高硬件的处理效率。在Y86架构中,我们将每个指令的执行分为6个阶段。
取指:从PC中取出当前要执行的指令,并按照指令编码对其分解,得到icode、ifun、rA、rB、valC等值。
译码:根据rA、rB取出对应寄存器的值valA、valB。
执行:ALU在不同指令下执行不同的操作,包括简单运算、地址加减等等,运算结果为valE,运算时会对条件码产生影响。
访存:从存储器读取数据或向存储器写入数据。读出的值为valM。
写回:将前面生成的结果写回寄存器文件。
更新PC:将PC设置成下一条指令的地址。
这些步骤现在看起来杂乱无章,不知有何用处。但仔细分析,可以看到,每个阶段只做与一两个硬件相关的事情,由输入决定输出,完全可以在一个时钟周期内做完。而各个阶段之间的联系就是各种信号的输入和输出,比如,译码阶段的输出valA可以作为执行阶段的输入,而执行阶段的输出又可以作为写回阶段的输入,这样就可以用简单的组合电路把这些硬件单元连接起来,实现我们需要的功能。
为了大家更清楚地理解各个阶段的作用,我们用一个例子来详细说明。
指令的分阶段实现
上图分别为OPl rA, rB、rrmovl rA, rB、irmovl V, rB这三个指令的分阶段执行过程。在取指阶段中,M表示存储器,M1表示以PC为基址从存储器中取出1字节数据。由于各个指令长短不一,因此取指阶段做的事情也不尽相同。在该阶段最后,会计算出PC的新值valP。译码阶段是从寄存器文件中取出寄存器的值,用R来表示寄存器rA的值。执行阶段对于OPl指令来说会设置状态码CC,而后两个指令则不会对状态码产生影响。访存阶段在这三个指令中都没有涉及。最后的更新PC阶段将valP的值赋值给PC。
当我读到这里的时候,我有很大的疑问:不是说每个阶段只做一件简单的事情吗,但是不同的指令在同一个阶段做的事情似乎各不相同。比如刚才的三个指令,在执行阶段只有OPl指令会设置状态码,而另外两个不会,这是为什么?包括书中后面举的其它例子,更新PC阶段并不一定是把valP的值赋值给PC,有些指令比如call和ret,它们会将valC的值或valM的值赋值给PC,这又是怎么做到的?
大家是否也想到了这些问题呢?很显然,每个阶段对不同的指令有不同的响应是很自然的事情,不然怎么适应各个指令的不同功能呢。我们前面提到的HCL硬件控制语言,就是要完成这个任务,控制每个指令在每个阶段要完成的任务。
好了,在详细说明如何用HCL控制逻辑之前,先给出完整的硬件结构图。
SEQ硬件结构
我们要注意图中不同颜色的方块和不同粗细的线条,它们代表着不同的意思。绿色块代表基本的硬件单元,比如ALU、寄存器文件、PC,基本上我们都已经接触过。灰色方块将是我们下一步研究的重点,它们是HCL描述的组合逻辑电路,用于连接绿色块并实现特定的选择或逻辑运算。白色圆圈并没有特殊的含义,只是用来标识信号线的名称。图中还有三种线条,粗实线表示宽度为字长的信号线,细实线表示宽度为1个字节或更窄的信号线,而虚线表示单个位的信号线。
图中从下到上分别是刚才介绍的取指、译码(写回)、执行、访存和更新PC阶段。由于译码和写回阶段都是对寄存器文件的操作,因此它们在图中画在了同一个位置。用圆圈标出的信号就是前文提到的各个阶段产生的中间值,这些值通常在不同指令中担任着不同的角色,因此会出现一个信号分叉为两个信号的情况。例如图中valA产生之后分为两条线,一条通向ALUB控制逻辑,另一条通向Data控制逻辑。再例如图中valM产生之后分为两条线,一条通向New PC控制逻辑,另一条通向寄存器文件的输入端。我们需要明白的是,一个信号分为两个信号,意味着两个接收端都可以读取到该信号的值,但读取到该值并不意味着使用该值,接收端的控制逻辑决定是否使用该值,下文将会详细叙述。
SEQ的状态改变周期
上一张图的标题我没做解释,其实是留了个疑问。SEQ的意思是Sequential(顺序的),“SEQ硬件结构”就是说“顺序的硬件结构”或者“硬件结构的顺序实现”。什么!!难道还有其它方式的实现?答案是当然的,我们留到后面再揭开谜底。SEQ的硬件结构使得指令必须按顺序一个接一个地执行,下一条指令的开始必须晚于上一条指令的结束。这就导致处理器效率极其低下,因为一个指令必须在一个时钟周期内通过所有阶段,而由于电路延迟的固有因素,通过所有阶段需要的时间很长,也就限制了时钟周期无法提高。然而,为什么一个指令必须在一个时钟周期内通过所有阶段呢?
因为对于时序逻辑电路,比如SEQ中的存储器、寄存器文件、CC和程序计数器,它们只在时钟信号的上升沿写入数据。当前个指令结束,下个指令开始的时候,时钟信号上升沿触发这几个硬件单元的更新。如果在下一个时钟周期上升沿到来之前,需要更新的新值还没有产生,这个指令就相当于没执行或执行了一半。因此时钟周期不能提得太高,否则将造成指令执行紊乱。
下图展示了两个指令周期的过程中,由时钟控制的各个硬件单元的状态改变。
跟踪SEQ的两个执行周期
可以看到,图中将四个时序逻辑电路之外的其它部分作为一个组合逻辑电路的整体来看待。当周期3开始时,组合逻辑电路开始运行,直到周期3结束前,所有结果都已得出,准备写入存储器等设备。当周期4开始时,存储器、寄存器文件、CC和程序计数器的值被更新,同时,这些新值被组合逻辑电路读取并开始计算结果,如此循环往复。因此,每个时钟周期SEQ的状态改变一次。
SEQ的各阶段实现
前文给出的SEQ硬件结构图只是一个大概的实现,有些细节并没有给出。现在,我们一个阶段一个阶段地分析SEQ的具体实现。
取指阶段:
SEQ取指阶段
指令从内存中取出后按字节分为了两部分:Split和Align。Split又分为icode和ifun。Align分为rA、rB和valC,这些都很容易理解。重点在于PC增加的逻辑。PC增加多少要根据本条指令的长短来决定,而本条指令的长短又在于指令中是否包含寄存器标识,以及是否包含常数valC,图中的两个组合电路Need valC和Need rigids就是用来做这个判断。
以Need rigids为例,它的HCL语言描述如下:
bool need_rigids =
icode in { IRRMOVL, IOPL, IPUSHL, IPOPL, IIRMOVL, IRMMOVL, IMRMOVL };
意即,当icode等于括号中7种指令码之一时,need_rigids为真。也就是说这7种指令中包含寄存器标识。同理,need_valC也可以用这个枚举的方法确定,只需要查前面的指令集编码表,找到包含valC的指令,放在括号里面就行了。
当need_rigids和need_valC都确定了之后,PC increment将按如下公式计算新的PC值,其实就是加上了该条指令的长度:
newPC = oldPC + 1 + need_rigids + 4*need_valC
现在我们明白了,灰色方框代表的组合电路可以用HCL语言来描述。而实际电路中这些HCL语句将通过综合成为真正的组合逻辑电路。在这里,HCL是一种很好的抽象,将原理与具体的实现相分离,方便我们的设计。
译码和写回阶段:
SEQ译码和写回阶段
这两个阶段都与寄存器文件的读写相关。从取指阶段得到的信号icode、rA和rB在这里作为输入信号,经过一些组合电路生成寄存器文件的输入。我们的目的是,在译码阶段,对于那些需要使用特定寄存器的命令,从寄存器文件中取出这些寄存器的值,地址由srcA和srcB来决定,结果输出为valA和valB;在写回阶段,将执行阶段的结果valE或访存阶段的结果valM写回特定的寄存器,寄存器的地址由dstE和dstM来决定。以组合电路srcA为例,它的HCL表述为:
int srcA =
;
方括号类似C语言中的switch语句,当第一个分号前的条件满足时返回rA,后面的两个条件不再考虑;否则再判断第二个条件是否满足,满足则返回RESP;否则返回RNONE,表示不需要读取寄存器文件。从中可以看出,在译码阶段,当指令为第一个分号前的四种时,将读取rA寄存器的值并放入结果valA;当指令为第二个分号前的两种时,将读取RESP寄存器的值并放入结果valA;否则,不必读取任何寄存器。
与srcA类似的还有srcB、dstE和dstM三个组合逻辑电路,它们的HCL表述可以从SEQ的硬件结构和指令集编码中分析得出,不再一一叙述。
执行阶段:
SEQ执行阶段
ALU需要两个操作数和一个alufun信号,alufun信号用于指明ALU对两个操作符执行怎样的逻辑运算(加、减、与、异或)。
以第一个操作数aluA为例,它的HCL描述如下:
int aluA =
;
可以看出,操作数aluA有时取valA,有时取valC,有时取-4或4,完全决定于指令类型。
alufun信号的HCL描述如下:
int alufun =
;
仅当指令为IOPL指令(即运算指令)时,alufun由ifun决定,其它情况下ALU全部当做加法器来使用。这也就不难理解为什么刚才aluA会取-4或4,因此此时aluA作为加法器的一个加数,而另一个加数从图中可以看到只能来自于valB,虽然valB在译码阶段的HCL我们并没有给出,不过可以告诉大家valB在这四种情况下的输出都是RESP。因此对于ICALL和IPUSH来说是为了让栈指针esp-4,对于IRET和IPOPL来说是为了让栈指针esp+4。
访存阶段:
SEQ访存阶段
Mem read和Mem write决定当前指令对存储器是读操作还是写操作。Mem addr和Mem data决定读写操作的地址和数据。以Mem addr为例,HCL描述如下:
int mem_addr =
;
更新PC阶段:
SEQ更新PC阶段
新的PC值来源可以从valC、valM和valP中选择,New PC的HCL描述如下:
int new_pc =
;
流水线的一般原则
到此为止,我们的前奏刚刚落幕,终于要步入正题了。(这个前奏的确有点长,哈哈。)
在“SEQ的状态改变周期”中埋下了一个伏笔,现在我们来揭开谜底。由于SEQ的时钟频率太低,我们需要想些办法来提高时钟频率。通常可以想到两种途径,一是缩短每条指令的执行时间,二是让多条指令同时执行。方法一不可行,因为每条指令的执行时间很难压缩,这是由电路的固有性质决定的。因此只能采用方法二,即流水线技术。
先来用一个形象的比喻来形容流水线技术。有一种带传送带的自助餐厅,食物摆在传送带上经过顾客,顾客可以随意取走自己喜欢的食品。如果我们把一盘食物当做一条指令,而传送带两旁的顾客当做指令执行的各个阶段,那么SEQ的实现就相当于每次只往传送带上放一盘食物,当这盘食物走到传送带尽头后再放下一盘食物,如果餐馆真这么做的话顾客恐怕都要饿死了。实际情况是,食物一盘接一盘地放在传送带上,每个顾客送走这一盘食物马上迎来下一盘食物,效率大大提高。
处理器架构的流水线技术也是这样,每个阶段都有一条指令正在执行,6个阶段就会有6条指令同时执行,将吞吐量提高为SEQ时的6倍。这样是不是感觉非常给力呢,不过,事情远没有想象中那么简单,最直接的问题是多个指令间会不会相互干扰?
我们回顾一下SEQ的硬件结构图,不同阶段间经常有跨阶段的连线,比如取指阶段得到的valC直接连接到了更新PC阶段的New PC。这在流水线情况下会出问题,因为后面的指令会覆盖前面指令产生的valC,因此,当先前的指令到达更新PC阶段再回头取valC的值时,已经不是当初自己在译码阶段生成的值了。怎么办呢?
解决方案也很容易想到,把每条指令后面有可能用到的值都保存下来不就行了。相当于每个阶段多加一套寄存器,在阶段开始时将这些寄存器的值更新为当前指令配套的值。在流水线技术中,这些插入到各个阶段间的寄存器称为流水线寄存器。
现在我们的处理器架构更新为PIPE-(Pipeline-,减号表示非最终版本),如下图所示。
PIPE-硬件结构
与SEQ相比有两处变化,一是将更新PC阶段和取指阶段放在了一起,在取指之前更新PC;二是每两个阶段间插入了流水线寄存器。这些流水线寄存器是基于时钟更新的,每个时钟周期的开始将会更新这些寄存器中的数据,相当于把当前指令的状态传递到了下一个阶段。
流水线冒险
现在大功告成了吗?还没有。当我们仔细分析PIPE-的时候我们会发现仍然存在一些问题。虽然流水线寄存器隔离了各个指令之间的数据共享,但是多个指令之间仍然存在依赖,包括两个方面:
数据依赖:前一条指令写入的寄存器或存储器正好是后一条指令需要读取的寄存器或存储器。在PIPE-中,当后一条指令在译码阶段读寄存器的时候,前一条指令才刚刚到执行阶段,因此新值还没有写入寄存器,如果此时后一条指令直接读寄存器的话,读到的是旧值,这就违反了代码顺序执行的规则。
控制依赖:当一条指令是jump、call或return时,下一条指令的地址是无法提前确定的,它依赖于当前指令的执行结果。因此流水线很可能需要中断。
这些依赖可能导致流水线产生计算错误,这种现象称为流水线冒险。我们先来考虑数据冒险。下图画出了一段代码的分阶段执行过程。
prog1代码段的执行过程
irmovl $3, %eax和addl %edx, %eax之间插入了三个空指令。这样的话,前者执行完写回阶段,后者才开始执行译码阶段,保证了读取寄存器前已经写入完毕。不发生数据冒险。
再看下图。
prog2代码段的执行过程
现在去掉了一个空指令,情况立马恶化。指令0x006的写回阶段和指令0x00e的译码阶段同时发生,但由于写回寄存器的操作直到第7周期的开始才会生效,因此译码阶段读出的值仍然是旧值,出现数据冒险现象。
如果把剩下的两个空指令也去掉,结果可想而知,肯定会发生更严重的数据冒险,我们在此不再验证。接下来考虑如何避免数据冒险。
仍然有两种解决方案:
暂停:与插入nop空指令类似,处理器自动向可能发生数据冒险的代码间插入bubble,使当前正在执行的指令暂停一个时钟周期。
prog2使用暂停时的执行过程
如上图所示,当addl指令执行到译码阶段时,检测到将会发生数据冒险,于是插入一个bubble,addl指令在译码阶段重复一个时钟周期。
如果把所有nop都去掉,仍然可以用插入bubble的方法解决数据冒险,只不过需要插入多个bubble而已,如下图所示。
prog4使用暂停时的执行过程
转发:暂停有一点很不好,它会降低程序执行效率,因为加入了很多无用的指令,纯粹在浪费时间。而转发可以更充分地利用每一个周期的时间。
仍然以刚才的代码段为例讲解转发如何起作用。
prog2使用转发时的执行过程
如图,当addl到译码阶段的时候,irmovl到写回阶段,由于还没有写入寄存器,因此读取数据时发生数据冒险。不过,我们可以用一个巧妙的方法避免这个冒险。既然写回阶段需要等到下个周期开始才能写入寄存器,那不如直接把要写入的值转发给译码阶段,这样的话译码阶段也不需要再从寄存器读了,直接拿转发来的值用就行了。
接下来,如果是prog3代码段呢?
prog3使用转发时的执行过程
prog3和prog2的区别在于少了一个nop指令,这就导致当addl到译码阶段的时候irmovl指令才到访存阶段。不过似乎对转发并没有影响,因为irmovl指令并不操作内存,在下一个阶段将要写入寄存器的值现在已经产生了,就是M_valE(需要注解一下,M_valE的意思是M阶段的流水线寄存器中保存的valE的值,请查看前面的PIPE-硬件结构图),所以直接把M_valE转发给译码阶段就行了。
再接下来,如果是prog4代码段呢?
prog4使用转发时的执行过程
现在,一个nop指令都没有了,irmovl后面紧跟着addl,当addl到译码阶段的时候irmovl才到执行阶段。可是令人惊讶的是,仍然可以转发。首先,我们可以发现最后需要的寄存器的值就是在执行阶段经过计算得出的。其次,我们要考虑到执行阶段得出结果需要一定时间,这个时间会不会导致不能按时转发到译码阶段呢?答案是否定的。因为译码阶段即使很早拿到这个值,也会等到下一个周期开始才把它写入执行阶段的流水线寄存器。因此只要在下个周期开始之前计算出这个值就可以了,而这个条件是永远都能得到满足的。
有没有感觉到很神奇呢?竟然可以用转发在不降低程序效率的条件下解决数据冒险问题,简直太棒了。可是任何事情都不是完美的,刚才的例子只是irmovl后面跟addl且两者使用同一个寄存器,而实际程序有非常多种可能的组合,是不是转发可以解决所有的问题?我们看下面这个例子。
load/use数据冒险
prog5代码段的0x018和0x01e两行代码称为加载/使用数据冒险,mrmovl将数据从存储器加载到寄存器%eax,然后紧接着addl使用寄存器%eax的值。仍然用转发,将mrmovl执行阶段的值转发给addl,却得到了错误的结果。其实原因很容易想到,因为mrmovl指令需要到访存阶段才能获取到正确的值并赋值给%eax,因此再从执行阶段转发到译码阶段已经完全不可行了。如何解决这个问题呢?我们可以把暂停和转发两种方式结合起来,先暂停一个周期,然后mrmovl到了访存阶段就可以把值正确地转发给addl了。
好了,解决了这么多问题,终于可以给出我们的最终版硬件结构图了。
PIPE硬件结构
比PIPE-增加的内容就是为了解决数据冒险问题而增加的转发电路,转发的接收方基本都在译码阶段。
更完善的设计
任何事情都讲究完美,我们现在得到的PIPE其实还不够完美,有些关键细节没有考虑到。
异常处理:处理器非常重要的一个方面就是异常处理。很多指令执行过程中都可能发生各种各样的异常,比如访问存储器时无效的地址、无效的指令的编码等等。当程序发生异常时,应该立即中止程序,从外面来看的效果应该是正好停在异常发生的位置:即前面的代码已经完全执行,而后面的代码完全没有执行。看起来很简单的事情在PIPE中并不那么容易实现,因为流水线中有多个指令同时执行,如果某个指令在某个阶段发生了异常,此时很可能后面的代码已经执行了一部分,要想得到完全没执行的效果,就要消除掉已经产生的影响,这需要加强控制逻辑的功能。
控制冒险:上一节流水线冒险中我们提到了控制依赖,它会导致控制冒险。当执行到条件跳转指令时,需要做分支预测,一旦预测错误,就需要消除已经执行的若干条指令,重新执行正确分支的指令。当执行到子函数返回指令时,需要从存储器中取出返回地址,因此下一条指令直到访存阶段才能开始执行。这些特殊情况都需要我们特殊考虑,并在控制逻辑中实现。
如果详细讲解这两部分的具体实现,又会花很多篇幅,有兴趣的朋友可以访问这本书的官网进一步了解。
与真实指令集架构的差距
本文讲述了Y86指令集架构的设计过程,虽然叙述已经足够粗略,可还是写了这么长的篇幅。然而如果与真实的指令集架构(比如x86)的复杂度相比那又真是小巫见大巫了。我们只规定了一个非常简单的指令集,并完成了一个简易的实现。而真实的指令集会包含非常多的指令,包括一些多周期的指令,比如浮点数运算指令,这些指令无法在一个周期内完成,因此需要一些额外的硬件单元的支持。Y86中的存储器被我们看做是理想的存储单元,我们认为数据的存取操作都可以在一个时钟周期内完成。然而CPU速率与内存速率其实相差上千倍,通常需要多级缓存构成一个复杂的存储器层次结构才能加快存取效率。现代处理器还采用了多发射和乱序执行技术,已经不是Y86中所描述的一个阶段一个阶段地执行了,而是多条指令同时执行,而且与它们在代码中的先后顺序无关。近些年,处理器向多核方向发展,多个核具有更强的处理能力,也使指令在代码级别的并行执行成为潮流。今后,处理器会采用哪些新技术我们无从得知,但一定会变得越来越复杂。不过万变不离其宗,理解了处理器和指令集的基本原理,我们可以看透一切,再复杂的系统也是从基本形式一步步扩展得到的,把握核心才是最关键的。
关注作者或文集《深入理解计算机系统》,第一时间获取最新发布文章。
参考资料
深入理解计算机系统(4.2)---硬件的魅力 左潇龙
Last edited by zzz19760225 on 2016-12-12 at 20:02 ]
http://www.jianshu.com/p/009e31c93d7c
100 Author Jin Ge Da Wang 2016.03.26 00:19
Wrote 72,243 words, followed by 172 people, and got 339 likes
Making Your Own Instruction Set Architecture from Scratch
Word count 8,805 Read 1,338 Comments 6 Likes 35
Reading Classics - "Deep Understanding of Computer Systems" 06
In this article, we are going to do a bold thing, to implement a brand new instruction set architecture from scratch, so as to deeply understand the working principle of the processor.
Overview of Instruction Set Development History
Y86 Instruction Set
Instruction Set and Its Encoding
Hardware Control Language HCL
Memory and Clock
Phased Execution of Instructions
State Change Cycles of SEQ
Implementation of Each Stage of SEQ
General Principles of Pipelining
Pipeline Hazards
More Perfect Design
Gap with Real Instruction Set Architectures
Overview of Instruction Set Development History
Before starting our creative journey, let's first understand what instruction set architectures have been in history.
The instructions supported by a processor and the byte-level encoding of the instructions are called its Instruction Set Architecture (ISA).
The most familiar one is the x86 architecture, because the personal computers we use daily use processors based on the x86 architecture. The two largest processor manufacturers in the world, Intel and AMD, both have a series of products based on the x86 architecture. Starting from the Intel i386 processor, the x86 architecture entered the 32-bit era, called the IA32 architecture (Intel Architecture 32bit). Later, 32 bits were no longer enough for our needs, and Intel began to enter the field of 64-bit processors and proposed the IA64 architecture. However, this architecture is not the 64-bit processor we are using now, but a new processor architecture completely unrelated to x86, and does not maintain backward compatibility. Although it can achieve very high performance, due to poor compatibility, the market response is cold. At the same time, AMD seized the opportunity and first proposed the x86-64 processor architecture, which supports 64 bits while maintaining backward compatibility, and took the initiative in the market competition with Intel. Of course, Intel will not be stubborn, they decisively gave up IA64 and began to shift to the x86-64 architecture, and gradually recovered the lost market share. Later, although AMD named its architecture AMD64 and Intel named its architecture Intel64, people still habitually call them x86-64 collectively.
Y86 Instruction Set
To pay tribute to the great x86 instruction set architecture, we name our own instruction set architecture Y86. In fact, the design concept of Y86 is completely borrowed from x86, which is equivalent to a simplified x86 architecture.
To design an instruction set architecture from scratch, we need to first define the instruction set and instruction set encoding, then divide each instruction into several stages for step-by-step execution, each stage only needs to do one or two simple tasks, and then combine the hardware devices with appropriate logic circuits to implement the work of each stage of the instruction. Now we will explain the specific implementation process in detail.
Instruction Set and Its Encoding
For a simple instruction set, not too many instructions are needed, and basic data transfer and process control are enough. The following figure lists all the instructions in the Y86 instruction set and the encoding of each instruction.
Y86 Instruction Set
These are very basic instructions, but they look a bit strange because we replaced the movl instruction in x86 with four independent instructions: rrmovl, irmovl, rmmovl, and mrmovl. Each instruction indicates the source of the operand, which avoids the trouble of various addressing methods.
It can be seen that the lengths of each instruction range from 1 byte to 6 bytes, and this encoding can reduce the space occupied by the program code. The high 4 bits of the first byte are used as the instruction code to distinguish different instructions, and the low 4 bits are either 0 or fn. fn is called the function code, which is used to distinguish different operations. As shown in the following figure, different function codes have different meanings in different instructions. In arithmetic instructions, they represent addition, subtraction, AND, and XOR respectively; in branch jump instructions, they represent different jump conditions; in conditional transfer instructions, they represent different transfer conditions.
Function Codes of Y86 Instruction Set
The second byte, for most instructions, stores the register identifier. Please see the following figure:
Program Register Identifiers in Y86
Each register corresponds to a number one by one, and F represents no register operand.
Finally, some instructions also contain a four-byte immediate number.
Let's take an example to help us better understand the instruction encoding. For example, for the following instruction
rmmovl %esp, 0x12345(%edx)
The corresponding encoding is
40 42 45 23 01 00
Among them, from left to right, 40 is the instruction code, 42 are the 4 corresponding to register %esp and 2 corresponding to register %edx, and 45230100 is the representation of the offset 0x12345 on a little-endian machine.
Hardware Control Language HCL
The various hardware devices in the processor (such as ALU, program counter) usually need specific functional logic circuits to connect. In the design stage, we use a structured language to describe these logical relationships.
HCL (Hardware Control Language) is a hardware control language similar to C, used to describe the control logic of the processor.
Take a simple example, for the following combinational logic circuit:
Combinational Circuit
It can be expressed in HCL language as
e = (a && !(b||c)) || (!d && !(b||c))
This sentence describes the logical relationship between the output and the input. No matter how complex the combinational circuit is, it can be implemented with the most basic AND, OR, and NOT gates. HCL will be applied a lot later.
Memory and Clock
Careful readers may have noticed that the previous paragraph mentioned "no matter how complex the combinational circuit". Why is combinational circuit emphasized specially? Because there is another kind of circuit - sequential circuit.
Everyone should have basic circuit knowledge. Combinational circuit only completes the function of a function. Different inputs lead to different outputs, and the circuit itself does not store any information. Sequential circuits are different. They can store information and respond to inputs under the control of clock signals.
Next, the key point comes. There are two types of storage devices in the processor:
Clock register (referred to as register) stores a single bit or word. The clock signal controls the register to load the input value.
Random access memory (referred to as memory) stores multiple words, and the address is used to select which word to read and write. The memory mentioned here can be divided into two types: the virtual memory system of the processor and the register file. The former is the usual memory system, and the latter is the general-purpose register corresponding to the 8 register identifiers in our instruction set.
The following is the working principle of the register. The output of the register always remains in the current state until the rising edge of the clock, and the new input will become the current register state.
Register Operation
The register file can be regarded as such a functional block:
Register File.png
It has two read ports and one write port, and supports simultaneous reading and writing operations. It is worth noting that the read operation of the register file is immediate, while the write operation is clock-based. That is to say, the read values valA and valB change with the changes of srcA and srcB at any time, and the value valW to be written can only be written at the rising edge of clock. Think carefully, the read and write characteristics of the register file are exactly the same as those of the register, except that there is an additional address selection operation.
Phased Execution of Instructions
Although macroscopically, the instruction has become an indivisible basic element of the program. But in the processor, the execution of an instruction still needs to be divided into multiple stages, so as to improve the processing efficiency of the hardware. In the Y86 architecture, we divide the execution of each instruction into 6 stages.
Fetch: Take the current instruction to be executed from the PC, and decompose it according to the instruction encoding to get values such as icode, ifun, rA, rB, valC.
Decode: Take out the values valA and valB of the corresponding registers according to rA and rB.
Execute: The ALU performs different operations under different instructions, including simple operations, address addition and subtraction, etc. The operation result is valE, and the condition code will be affected during the operation.
Memory access: Read data from memory or write data to memory. The read value is valM.
Write back: Write the previously generated result back to the register file.
Update PC: Set the PC to the address of the next instruction.
These steps seem messy now, and I don't know what their use is. But upon careful analysis, it can be seen that each stage only does one or two things related to hardware, and the output is determined by the input, which can be completely done in one clock cycle. The connection between each stage is the input and output of various signals. For example, the output valA of the decoding stage can be used as the input of the execution stage, and the output of the execution stage can be used as the input of the write-back stage, so these hardware units can be connected with simple combinational circuits to realize the functions we need.
In order to help everyone understand the role of each stage more clearly, we use an example to explain in detail.
Phased Implementation of Instructions
The above figures are respectively the phased execution processes of the three instructions: OPl rA, rB, rrmovl rA, rB, and irmovl V, rB. In the fetch stage, M represents the memory, and M1 represents taking out 1-byte data from the memory with PC as the base address. Since the lengths of each instruction are different, the things done in the fetch stage are also different. At the end of this stage, the new value valP of the PC will be calculated. The decoding stage is to take out the value of the register from the register file, and R represents the value of register rA. For the OPl instruction, the execution stage will set the status code CC, while the latter two instructions will not affect the status code. There is no memory access involved in these three instructions in the memory access stage. The final update PC stage assigns the value of valP to the PC.
When I read this, I have a big question: it is said that each stage only does one simple thing, but the things done by different instructions in the same stage seem different. For example, in the three instructions just now, only the OPl instruction sets the status code in the execution stage, while the other two do not. Why is this? Including other examples given in the book later, the update PC stage does not necessarily assign the value of valP to the PC. Some instructions such as call and ret will assign the value of valC or valM to the PC. How is this done?
Have you also thought about these problems? Obviously, it is natural that each stage has different responses to different instructions, otherwise how to adapt to the different functions of each instruction. The HCL hardware control language mentioned earlier is to complete this task, controlling the tasks that each instruction should complete in each stage.
Well, before explaining in detail how to use HCL control logic, let's first give the complete hardware structure diagram.
SEQ Hardware Structure
We need to pay attention to the different colored squares and different thicknesses of lines in the figure, which represent different meanings. The green blocks represent basic hardware units, such as ALU, register file, PC, which we have basically come into contact with. The gray squares will be the key points of our next research. They are combinational logic circuits described by HCL, used to connect green blocks and implement specific selections or logical operations. The white circles have no special meaning, just used to identify the names of signal lines. There are also three types of lines in the figure. The thick solid line represents a signal line with a word length, the thin solid line represents a signal line with a width of 1 byte or narrower, and the dashed line represents a signal line with a single bit.
The figures from bottom to top are the fetch, decode (write back), execute, memory access, and update PC stages introduced just now. Since the decode and write back stages are both operations on the register file, they are drawn in the same position in the figure. The signals marked with circles are the intermediate values generated in each stage mentioned earlier. These values usually play different roles in different instructions, so there will be a situation where one signal branches into two signals. For example, after valA is generated in the figure, it is divided into two lines, one leading to the ALUB control logic and the other leading to the Data control logic. Another example is that after valM is generated in the figure, it is divided into two lines, one leading to the New PC control logic and the other leading to the input end of the register file. We need to understand that a signal is divided into two signals, which means that both receiving ends can read the value of the signal, but reading the value does not mean using the value. The control logic of the receiving end determines whether to use the value, which will be described in detail below.
State Change Cycles of SEQ
I didn't explain the title of the previous figure, but actually left a question. The meaning of SEQ is Sequential (sequential), "SEQ Hardware Structure" means "sequential hardware structure" or "sequential implementation of hardware structure". What! Is there any other way to implement it? The answer is of course, we will uncover the mystery later. The hardware structure of SEQ makes the instructions must be executed one by one in sequence, and the start of the next instruction must be later than the end of the previous instruction. This leads to extremely low processor efficiency, because an instruction must pass through all stages in one clock cycle, and due to the inherent factors of circuit delay, the time required to pass through all stages is very long, which also limits the clock cycle from being improved. However, why must an instruction pass through all stages in one clock cycle?
Because for sequential logic circuits, such as the memory, register file, CC, and program counter in SEQ, they only write data at the rising edge of the clock signal. When the previous instruction ends and the next instruction starts, the rising edge of the clock signal triggers the update of these hardware units. If the new value to be updated has not been generated before the rising edge of the next clock cycle, this instruction is equivalent to not being executed or executed halfway. Therefore, the clock cycle cannot be raised too high, otherwise it will cause instruction execution disorder.
The following figure shows the state changes of each hardware unit controlled by the clock during the process of two instruction cycles.
Tracking Two Execution Cycles of SEQ
It can be seen that in the figure, the other parts outside the four sequential logic circuits are regarded as a whole of the combinational logic circuit. When cycle 3 starts, the combinational logic circuit starts to run, and all results are obtained before cycle 3 ends, ready to be written to devices such as memory. When cycle 4 starts, the values of the memory, register file, CC, and program counter are updated, and at the same time, these new values are read by the combinational logic circuit and start to calculate the results, and so on. Therefore, the state of SEQ changes once every clock cycle.
Implementation of Each Stage of SEQ
The SEQ hardware structure diagram given earlier is only a general implementation, and some details are not given. Now, let's analyze the specific implementation of SEQ stage by stage.
Fetch Stage:
SEQ Fetch Stage
After the instruction is taken from the memory, it is divided into two parts by bytes: Split and Align. Split is further divided into icode and ifun. Align is divided into rA, rB, and valC, which are all easy to understand. The key is the logic of PC increment. How much the PC is incremented depends on the length of this instruction. The length of this instruction also depends on whether the instruction contains a register identifier and whether it contains a constant valC. The two combinational circuits Need valC and Need rigids in the figure are used to make this judgment.
Take Need rigids as an example. Its HCL language description is as follows:
bool need_rigids =
icode in { IRRMOVL, IOPL, IPUSHL, IPOPL, IIRMOVL, IRMMOVL, IMRMOVL };
That is to say, when icode is one of the 7 instruction codes in the parentheses, need_rigids is true. That is to say, these 7 instructions contain register identifiers. Similarly, need_valC can also be determined by this enumeration method. Just find the instructions containing valC from the previous instruction set encoding table and put them in the parentheses.
After need_rigids and need_valC are determined, the new PC value will be calculated according to the following formula for PC increment. In fact, it is adding the length of this instruction:
newPC = oldPC + 1 + need_rigids + 4*need_valC
Now we understand that the combinational circuit represented by the gray box can be described in HCL language. In the actual circuit, these HCL statements will be synthesized into real combinational logic circuits through synthesis. Here, HCL is a good abstraction, separating the principle from the specific implementation, which is convenient for our design.
Decode and Write Back Stages:
SEQ Decode and Write Back Stages
Both of these stages are related to the reading and writing of the register file. The signals icode, rA, and rB obtained from the fetch stage are used as input signals here, and some combinational circuits are used to generate the input of the register file. Our purpose is that in the decoding stage, for commands that need to use specific registers, take out the values of these registers from the register file, the address is determined by srcA and srcB, and the result output is valA and valB; in the write back stage, write the result valE of the execution stage or the result valM of the memory access stage back to the specific register, and the address of the register is determined by dstE and dstM. Take the combinational circuit srcA as an example, its HCL expression is:
int srcA =
;
The square brackets are similar to the switch statement in C language. When the condition before the first semicolon is satisfied, rA is returned, and the following two conditions are no longer considered; otherwise, it is judged whether the second condition is satisfied, and if satisfied, RESP is returned; otherwise, RNONE is returned, indicating that no register file needs to be read. It can be seen from this that in the decoding stage, when the instruction is one of the four before the first semicolon, the value of the rA register will be read and put into the result valA; when the instruction is one of the two before the second semicolon, the value of the RESP register will be read and put into the result valA; otherwise, no register needs to be read.
Similar to srcA, there are three combinational logic circuits: srcB, dstE, and dstM. Their HCL expressions can be analyzed from the SEQ hardware structure and instruction set encoding, and will not be described one by one.
Execution Stage:
SEQ Execution Stage
The ALU needs two operands and an alufun signal. The alufun signal is used to indicate what kind of logical operation (addition, subtraction, AND, XOR) the ALU performs on the two operators.
Take the first operand aluA as an example. Its HCL description is as follows:
int aluA =
;
It can be seen that the operand aluA sometimes takes valA, sometimes takes valC, sometimes takes -4 or 4, which is completely determined by the instruction type.
The HCL description of the alufun signal is as follows:
int alufun =
;
Only when the instruction is an IOPL instruction (that is, an arithmetic instruction), alufun is determined by ifun. In other cases, the ALU is used as an adder. This also explains why aluA will take -4 or 4 just now. Therefore, aluA is used as one addend of the adder, and the other addend can only come from valB as can be seen from the figure. Although we have not given the HCL of valB in the decoding stage, we can tell you that the output of valB in these four cases is RESP. Therefore, for ICALL and IPUSH, it is to make the stack pointer esp-4, and for IRET and IPOPL, it is to make the stack pointer esp+4.
Memory Access Stage:
SEQ Memory Access Stage
Mem read and Mem write determine whether the current instruction is a read or write operation on the memory. Mem addr and Mem data determine the address and data of the read and write operations. Take Mem addr as an example, the HCL description is as follows:
int mem_addr =
;
Update PC Stage:
SEQ Update PC Stage
The source of the new PC value can be selected from valC, valM, and valP. The HCL description of New PC is as follows:
int new_pc =
;
General Principles of Pipelining
So far, our prelude has just ended, and we are finally going to get to the main topic. (This prelude is indeed a bit long, haha.)
A foreshadowing was planted in "State Change Cycles of SEQ", and now we will uncover the mystery. Because the clock frequency of SEQ is too low, we need to find some ways to increase the clock frequency. Usually, two ways can be thought of. One is to shorten the execution time of each instruction, and the other is to let multiple instructions execute simultaneously. Method one is not feasible because the execution time of each instruction is difficult to compress, which is determined by the inherent nature of the circuit. Therefore, we can only adopt method two, that is, pipelining technology.
Let's first use an image metaphor to describe the pipelining technology. There is a self-service restaurant with a conveyor belt. The food is placed on the conveyor belt and passes by the customers. The customers can take away their favorite food at will. If we regard a plate of food as an instruction and the customers on both sides of the conveyor belt as each stage of instruction execution, then the implementation of SEQ is equivalent to putting only one plate of food on the conveyor belt each time. When this plate of food reaches the end of the conveyor belt, the next plate of food is put on. If the restaurant really does this, the customers will probably starve to death. The actual situation is that the food is placed on the conveyor belt one after another, and each customer sends away this plate of food and immediately welcomes the next plate of food, which greatly improves the efficiency.
The pipelining technology of the processor architecture is also like this. Each stage has an instruction being executed. There will be 6 instructions executed simultaneously in 6 stages, which increases the throughput to 6 times that of SEQ. Does this feel very powerful? However, things are far from being as simple as imagined. The most direct problem is whether multiple instructions will interfere with each other?
Let's review the SEQ hardware structure diagram. There are often cross-stage connections between different stages. For example, valC obtained in the fetch stage is directly connected to the New PC in the update PC stage. This will cause problems in the pipelining situation, because the subsequent instructions will overwrite the valC generated by the previous instructions. Therefore, when the previous instruction reaches the update PC stage and then goes back to get the value of valC, it is no longer the value generated by its own decoding stage. What to do?
The solution is also easy to think of. Just save the values that may be used by each instruction later. It is equivalent to adding a set of registers to each stage, and updating the values in these registers to the values supporting the current instruction at the beginning of the stage. In pipelining technology, these registers inserted between each stage are called pipeline registers.
Now our processor architecture is updated to PIPE- (Pipeline-, the minus sign means non-final version), as shown in the following figure.
PIPE- Hardware Structure
There are two changes compared with SEQ. One is that the update PC stage and the fetch stage are put together, and the PC is updated before fetching; the other is that pipeline registers are inserted between every two stages. These pipeline registers are updated based on the clock. The data in these registers will be updated at the beginning of each clock cycle, which is equivalent to passing the state of the current instruction to the next stage.
Pipeline Hazards
Is it all done now? No. When we carefully analyze PIPE-, we will find that there are still some problems. Although the pipeline registers isolate the data sharing between each instruction, there are still dependencies between multiple instructions, including two aspects:
Data dependency: The register or memory written by the previous instruction is exactly the register or memory that the subsequent instruction needs to read. In PIPE-, when the subsequent instruction reads the register in the decoding stage, the previous instruction has just reached the execution stage, so the new value has not been written to the register. If the subsequent instruction directly reads the register at this time, the old value will be read, which violates the rule of sequential execution of code.
Control dependency: When an instruction is jump, call, or return, the address of the next instruction cannot be determined in advance, and it depends on the execution result of the current instruction. Therefore, the pipeline may need to be interrupted.
These dependencies may cause the pipeline to produce calculation errors, and this phenomenon is called pipeline hazard. Let's first consider data hazard. The following figure shows the phased execution process of a segment of code.
Execution Process of prog1 Code Segment
There are three empty instructions inserted between irmovl $3, %eax and addl %edx, %eax. In this way, the former finishes the write back stage, and the latter just starts the decoding stage, ensuring that the register is read after it has been written. No data hazard occurs.
Now look at the following figure.
Execution Process of prog2 Code Segment
Now one empty instruction is removed, and the situation immediately deteriorates. The write back stage of instruction 0x006 and the decoding stage of instruction 0x00e occur at the same time, but since the operation of writing back the register will not take effect until the start of cycle 7, the value read in the decoding stage is still the old value, and a data hazard phenomenon occurs.
If the remaining two empty instructions are also removed, it goes without saying that more serious data hazards will definitely occur, which we will not verify here. Next, consider how to avoid data hazards.
There are still two solutions:
Suspension: Similar to inserting a nop empty instruction, the processor automatically inserts a bubble between the codes where data hazards may occur, so that the currently executing instruction pauses for one clock cycle.
Execution Process When prog2 Uses Suspension
As shown in the figure above, when the addl instruction reaches the decoding stage, it detects that a data hazard will occur, so it inserts a bubble, and the addl instruction repeats one clock cycle in the decoding stage.
If all nops are removed, the bubble insertion method can still be used to solve the data hazard, but multiple bubbles need to be inserted. As shown in the following figure.
Execution Process When prog4 Uses Suspension
Forwarding: Suspension has one disadvantage, which is that it will reduce the program execution efficiency, because a lot of useless instructions are added, which is purely a waste of time. Forwarding can make more use of the time of each cycle.
Still take the previous code segment as an example to explain how forwarding works.
Execution Process When prog2 Uses Forwarding
As shown in the figure, when addl reaches the decoding stage, irmovl reaches the write back stage. Since the register has not been written yet, a data hazard occurs when reading the data. However, we can use a clever method to avoid this hazard. Since the write back stage can only write the register until the start of the next cycle, it is better to directly forward the value to be written to the decoding stage. In this way, the decoding stage does not need to read from the register anymore, and can directly use the forwarded value.
Next, what about the prog3 code segment?
Execution Process When prog3 Uses Forwarding
The difference between prog3 and prog2 is that there is one less nop instruction, which causes that when addl reaches the decoding stage, the irmovl instruction only reaches the memory access stage. However, it seems that there is no impact on forwarding, because the irmovl instruction does not operate on the memory. The value to be written to the register in the next stage has been generated now, which is M_valE (it needs to be annotated. M_valE means the value of valE in the pipeline register of the M stage. Please refer to the previous PIPE- hardware structure diagram). So we can directly forward M_valE to the decoding stage.
Next, what about the prog4 code segment?
Execution Process When prog4 Uses Forwarding
Now, there is no nop instruction left. irmovl is followed by addl immediately. When addl reaches the decoding stage, irmovl only reaches the execution stage. But surprisingly, forwarding can still be used. First, we can find that the value of the register needed at the end is the result calculated in the execution stage. Secondly, we need to consider whether the time required for the execution stage to obtain the result will cause it to not be forwarded to the decoding stage on time. The answer is no. Because even if the decoding stage gets this value very early, it will not write it into the pipeline register of the execution stage until the start of the next cycle. Therefore, as long as this value is calculated before the start of the next cycle, it can always be satisfied.
Does it feel very magical? Forwarding can solve the data hazard problem without reducing the program efficiency. It's really great. But nothing is perfect. The previous example is only irmovl followed by addl and both use the same register. There are very many possible combinations in actual programs. Can forwarding solve all problems? Let's look at the following example.
Load/Use Data Hazard
The two lines of code 0x018 and 0x01e in the prog5 code segment are called load/use data hazard. mrmovl loads data from memory into register %eax, and then addl immediately uses the value of register %eax. Still using forwarding, forwarding the value of the execution stage of mrmovl to addl, but gets the wrong result. In fact, the reason is very easy to think of. Because the mrmovl instruction needs to reach the memory access stage to obtain the correct value and assign it to %eax, so it is completely impossible to forward from the execution stage to the decoding stage. How to solve this problem? We can combine the two methods of suspension and forwarding. First, suspend for one cycle, and then when mrmovl reaches the memory access stage, the value can be correctly forwarded to addl.
Well, after solving so many problems, we can finally give our final version of the hardware structure diagram.
PIPE Hardware Structure
The content added compared with PIPE- is the forwarding circuit added to solve the data hazard problem. The receiving ends of forwarding are basically in the decoding stage.
More Perfect Design
Everything pursues perfection. The PIPE we get now is not perfect enough, and some key details are not considered.
Exception Handling: An extremely important aspect of the processor is exception handling. Many instructions may encounter various exceptions during execution, such as invalid addresses when accessing memory, invalid instruction encodings, etc. When a program has an exception, it should immediately stop the program. The effect from the outside should be that the previous code has been fully executed, and the subsequent code has not been executed at all. What seems like a simple thing is not so easy in PIPE, because there are multiple instructions executing simultaneously in the pipeline. If an instruction has an exception in a certain stage, at this time, the subsequent code may have executed part of it. To get the effect of not having been executed at all, it is necessary to eliminate the already generated impact, which requires strengthening the function of the control logic.
Control Hazard: We mentioned control dependency in the previous pipeline hazard, which will lead to control hazard. When executing a conditional jump instruction, branch prediction needs to be performed. Once the prediction is wrong, several executed instructions need to be eliminated, and the instructions of the correct branch need to be executed again. When executing the subroutine return instruction, the return address needs to be taken from the memory, so the next instruction can only start to execute until the memory access stage. These special cases need our special consideration and implementation in the control logic.
If the specific implementation of these two parts is explained in detail, it will take a lot of space. Interested friends can visit the official website of this book for further understanding.
Gap with Real Instruction Set Architectures
This article describes the design process of the Y86 instruction set architecture. Although the description is already very rough, it still takes such a long space. However, compared with the complexity of the real instruction set architecture (such as x86), it is really small. We only specified a very simple instruction set and completed a simple implementation. The real instruction set will contain many instructions, including some multi-cycle instructions, such as floating-point arithmetic instructions. These instructions cannot be completed in one cycle, so some additional hardware units are needed. The memory in Y86 is regarded as an ideal storage unit by us, and we think that the data access operation can be completed in one clock cycle. However, the CPU speed and memory speed are actually thousands of times different. Usually, a complex memory hierarchy composed of multiple levels of cache is needed to speed up the access efficiency. Modern processors also use multi-issue and out-of-order execution technologies. They are no longer executed stage by stage as described in Y86. Instead, multiple instructions are executed simultaneously, and they are not related to the order in which they appear in the code. In recent years, processors have developed towards multi-core, and multiple cores have stronger processing capabilities, which also makes the parallel execution of instructions at the code level a trend. In the future, what new technologies processors will adopt we don't know, but they will definitely become more and more complex. However, no matter how it changes, understanding the basic principles of the processor and the instruction set can help us see through everything. No matter how complex the system is, it is gradually expanded from the basic form. Grasping the core is the most important thing.
Follow the author or the collection "Deep Understanding of Computer Systems" to get the latest published articles in the first time.
References
Deep Understanding of Computer Systems (4.2) --- The Charm of Hardware Zuo Xiaolong
Last edited by zzz19760225 on 2016-12-12 at 20:02 ]