Organizzazione pipeline della CPU

Transcript

1 Organizzazione pipeline della CPU Riferimento: D. A. Patterson, J.L. Hennessy, Struttura e Progetto dei Calcolatori L'interfaccia Hardware-Software -II edizione Zanichelli editore, ISBN: Calcolatori Elettronici 1

2 Esecuzione sequenziale 01_Esecuzione_sequenziale.exe Tempo = 4 * ( ) = 6 h Calcolatori Elettronici 2

3 Esecuzione pipeline 02_Esecuzione_pipeline.exe Calcolatori Elettronici 3

4 Pipeline Il pipeline non riduce la latenza del singolo task, aumenta il throughput dell intero workload Il pipeline rate è limitato dal più lento pipeline stage Più task operano simultaneamente Potenziale miglioramento = Numero di pipe stages Lunghezze non bilanciate dei pipeline stages riducono il miglioramento Calcolatori Elettronici 4

5 Esecuzione pipeline delle istruzioni T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 IFetch 0 IDec 0 IExe 0 Iem 0 IWrB 0 IFetch 1 IDec 1 IExe 1 Iem 1 IWrB 1 IFetch 2 IDec 3 T0 T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 IFetch 0 IDec 0 IExe 0 Iem 0 IWrB 0 IFetch 1 IDec 1 IExe 1 Iem 1 IWrB 1 IFetch 2 IDec 2 IExe 2 Iem 2 IWrB 2 IFetch 3 IDec 3 IExe 3 Iem 3 IWrB 3 IFetch 4 IDec 4 IExe 4 Iem 4 IWrB 4 IFetch 5 IDec 5 IExe 5 Iem 5 IWrB 5 IFetch 6 IDec 6 IExe 6 Iem 6 IWrB 6 IFetch 7 IDec 7 IExe 7 Iem 7 IWrB 7 Calcolatori Elettronici 5

6 Prestazione con pipeline Il tempo di exec della singola istruzione non diminuisce Il tempo medio di exec delle istruzioni si riduce di un fattore N (caso ideale) Il throughput migliora di un fattore N Una istruz. eseguita per CLK CPI Pipe = 1 Perché? (CPI Sequenziale =N) Calcolatori Elettronici 6

7 Prestazione con pipeline CPUtime Pipe = IC CPI Pipe T CK = IC 1 T CK CPUtime Seq = IC N T CK = N CPUtime Pipe CPU Speedup = CPU TIE TIE Sequenziale Pipeline = N a questo è vero solo nel caso ideale! Calcolatori Elettronici 7

8 8 DLX sequenziale multiciclo P C U X Ind Lettura Scrittura Dati Lettura Dati E R E G I S T R Reg. Let. 1 Reg. Scrittura Scrittura Dati Let. Dati1 U X U X Reg. Let. 2 Let. Dati2 U X U X A L U Est. Segno Shift S.2 bit Zero Ris. R E G F I L E Shift S.2 bit TA R G ET U X Unità di Controllo RegDest RegWrite ALUSelB ALUSelA ALUop TargetWrite PCSource em2reg IRWrite emwrite emread IorD PCWrite PCWriteCond A B AluOutput em Data 4

9 DLX sequenziale 03_Datapath_DLX_sequenziale.exe Calcolatori Elettronici 9

10 DLX pipeline IF/ID ID/EX EX/E E/WB Add U X Zero? Branch taken P C 4 em o r y IR IR IR R e g F i l e U X U X ALU em o r y U X 16 Sign 32 extend Calcolatori Elettronici 10

11 Esecuzione pipeline DLX 04_Datapath_DLX_pipeline.exe Calcolatori Elettronici 11

12 Pipeline DLX Stage Any instruction IF IF/ID.IR em[pc]; IF/ID.NPC,PC (if EX/E.cond {EX/E.ALUOutput} else {PC+4}); ID ID/EX.A Regs[IF/ID.IR ]; ID/EX.B Regs[IF/ID.IR ]; ID/EX.NPC IF/ID.NPC; IF/EX.IR IF/ID.IR; ID/EX.Imm (IF/ID.IR 16 ) 16 ##IF/ID.IR ; EX ALU instruction Load or store instruction Branch instructiom EX/E.IR ID/EX.IR; EX/E.IR ID/EX.IR EX/E.ALUOutput EX/E.ALUOutput ID/EX.A func ID/EX.B; ID/EX.A + ID/EX.Imm; or EX/E.ALUOuput ID/EX.A op ID/EX.Imm; EX/E.cond 0; EX/E.cond 0; EX/E.B ID/EX.B; EX/E.ALUOutput ID/EX.NPC+ID/EX.Imm; EX/E.cond (ID/EX.A op 0); Calcolatori Elettronici 12

13 Pipeline DLX ALU instruction Load or store instruction Branch instructiom E E/WB.IR EX/E.IR; E/WB.IR EX/E.IR; E/WB.ALUOutput E/WB.LD EX/E.ALUOutput; em[ex/e.aluoutput]; or em[ex/e.aluoutput] EX/E.B; WB Regs[E/WB.IR ] E/WB.ALUOutput; or Regs[E/WB.IR ] E/WB.ALUOutput; Regs[E/WB.IR ] E/WB.LD; Calcolatori Elettronici 13

14 Unità di Controllo per il pipeline WB Instruction Control WB EX WB IF/ID ID/EX EX/E E/WB L'unità di controllo nella fase ID produce i segnali di controllo che verranno utilizzati negli stadi successivi EX, e WB I segnali non utilizzati in uno stadio sono inoltrati agli stadi successivi. Calcolatori Elettronici 14

15 Unità di Elaborazione + Unità di controllo PCSrc ux 0 ID/EX WB EX/E 1 Control WB E/WB IF/ID EX WB Add PC 4 Address Instruction memory Instruction Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data RegWrite Shift left 2 0 ux 1 Add Add result ALUSrc Zero ALU ALU result Branch Write data emwrite Address Data memory Read data emtoreg ux 1 0 Instruction [15 0] 16 Sign 32 extend 6 ALU control emread Instruction [20 16] Instruction [15 11] 0 ux 1 RegDst ALUOp Calcolatori Elettronici 15

16 Risorse coinvolte 05_Risorse_coinvolte.exe Calcolatori Elettronici 16

17 Problemi Potenziali conflitti sulle risorse! 1. Con I e D separate, no conflict, ma a parità di ck cycle occorre una memoria 5X veloce! 2. Le fasi ID e WB usano lo stesso banco di registri nello stesso periodo di ck per istruzioni diverse! 3. Il PC va aggiornato ogni periodo di ck per fare un fetch/ck. Cosa accade per un branch il cui esito è noto soltanto nella fase di mem? I registri A, B, Imm, nello stesso ck sono utilizzati nello stage di ex. (istruz. I), ma scritti nello stage di decode dall istruz. I+1! Il registro IR viene scritto (nello stesso Ck) nello stage di IF (istruz. I) e il suo contenuto relativo all istruz. I-4 serve nello stage WB! Calcolatori Elettronici 17

18 Limiti alla esecuzione pipeline: i conflitti (Hazard) I conflitti (Hazard) impediscono che una istruzione venga eseguita nel ciclo di clock atteso -Structural hazards: Le risorse HW non supportano alcune combinazioni di istruzioni -Data hazards: Un istruzione dipende dal risultato di una istruzione che è ancora nella pipeline -Control hazards: Pipelining di branch e altre istruzioni che cambiano il PC La soluzione più semplice è introdurre dei cicli di clck di stallo nella pipeline fino a quando l hazard non è risolto, inserendo una o più bolle nella pipeline. Calcolatori Elettronici 18

19 Structural Hazard Esempio di structural hazard quando è presente un unica memoria per le istruzioni e i dati 06_Structural_hazard_bubble.exe Calcolatori Elettronici 19

20 Structural Hazard Soluzioni Introdurre una bolla, una nop (not operation), nella fase di ID e bloccare l istruzione nella fase di Fetch Duplicare le risorse hardware Calcolatori Elettronici 20

21 Structural Hazard: Stallo 08_Structural_hazard_Implementazione_stallo.exe Calcolatori Elettronici 21

22 Data Hazard Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 I Reg ALU D Reg Program ex ec uti on order (i ni nst ructi ons ) SUB R4, R1, R5 AND R6, R1, R7 I Reg I ALU Reg D ALU Reg D ALU OR R8, R1, R9 I Reg XOR R10, R1, R11 I Reg Calcolatori Elettronici 22

23 Data Hazard: esecuzione errata 09_Data_hazards_Esecuzione_errata.exe Calcolatori Elettronici 23

24 Data Hazard: esecuzione corretta 10_Data_hazards_Esecuzione_corretta-Introduzione_bolle.exe Calcolatori Elettronici 24

25 Data Hazard: introduzione degli stalli add r1, r2, r3 IF ID EX E WB sub r4, r1, r2 IF ID stall stall EX E WB add r1, r2, r3 IF ID EX E WB subi r3, r2, 10 IF ID EX E WB addi r4, r1, 5 IF ID stall EX E WB Calcolatori Elettronici 25

26 Data Hazard: introduzione degli stalli Una soluzione ai data hazard è l'introduzione di cicli di clock di stallo. Poichè il data hazard viene scoperto nella fase ID, quando viene introdotto uno stallo per i data hazard: Viene bloccata l'istruzione nella fase ID impedendo l'aggiornamento del registro IF/ID; Viene bloccata l'istruzione nella fase IF non aggiornando il PC Vengono scritti sul registro ID/EX i segnali di controllo relativi a una nop I cicli di stallo vengono ripetuti fino a quando non viene aggiornato il registro destinazione Il numero di cicli di clock di stallo dipende dalla distanza tra le istruzioni Calcolatori Elettronici 26

27 Introduzione stalli x data hazard I+1 I I-1 I-2 IF/ID ID/EX EX/E E/WB PC Add 4 x 0 ux NOP x Zero? Branch taken e mo r y IR x IR6..10 IR E/WS, IR R eg i s t e r s x x ux ux ALU e m o ry u x x 16 Sign 32 extend Calcolatori Elettronici 27

28 Data Hazard: introduzione stalli 11_Data_hazards_Datapath_Implementazione_introduzione_stalli.exe Calcolatori Elettronici 28

29 Data Hazard: introduzione degli stalli i : add r1, r2,r3 i+1: sub r4, r1,r2 Supponendo che la scrittura in r1 avvenga nel primo semiperiodo del clock e la lettura nel secondo semiperiodo tra le istruzioni a distanza 1 sono necessari 2 cicli di clock di stallo i : add r1, r2,r3 i+1: subi r3, r2, 10 i+2: addi r4, r1, 5 Supponendo che la scrittura in r1 avvenga nel primo semiperiodo del clock e la lettura nel secondo semiperiodo tra le istruzioni a distanza 2 sono necessari 1 cicli di clock di stallo Calcolatori Elettronici 29

30 Data Hazard: Forwarding Non si aspetta che il registro destinazione r D sia stato aggiornato per fare avanzare nella fase di execute l'istruzione che ha bisogno del risultato Il dato viene immediatamente utilizzato non appena è prodotto Calcolatori Elettronici 30

31 Forwarding per i Data Hazard Time (in clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 ADD R1, R2, R3 I Reg ALU D Reg Program exec ution order (i n instructions ) SUB R4, R1, R5 AND R6, R1, R7 I Reg I ALU Reg D ALU Reg D OR R8, R1, R9 I Reg ALU XOR R10, R1, R11 I Reg Calcolatori Elettronici 31

32 Implementazione del forwarding ID /E X E X / E E /W B u x R e g is te rs F o r w a rd A A L U u x D a ta m e m o ry u x R s R t R t R d F o rw a rd B u x E X / E.R e g is te rr d F o rw a rd in g u n it E /W B.R e g is te rr d b. W ith fo rw a rd in g Calcolatori Elettronici 32

33 Forwarding per lw T im e (in c lo c k c y c le s ) P ro g ra m e x e c u tio n o rd e r (in in s tru c tio n s ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 lw $ 2, 2 0 ($ 1 ) I R e g D R e g a n d $ 4, $ 2, $ 5 I R e g D R e g o r $ 8, $ 2, $ 6 I R e g D R e g a d d $ 9, $ 4, $ 2 I R e g D R e g s l t $ 1, $ 6, $ 7 I R e g D R e g Calcolatori Elettronici 33

34 Introduzione di uno stallo per lw P ro gra m e xe cu tion o rd e r (in in s tru ctio n s ) T im e (in c lo ck cy cles ) C C 1 C C 2 C C 3 C C 4 C C 5 C C 6 C C 7 C C 8 C C 9 C C 1 0 lw $ 2, 2 0 ($ 1 ) I R e g D R e g a n d $ 4, $ 2, $ 5 I R e g R e g D R e g o r $ 8, $ 2, $ 6 I I R e g D R e g b ubble a d d $ 9, $ 4, $ 2 I R e g D R eg slt $ 1, $ 6, $ 7 I R eg D R e g Calcolatori Elettronici 34

35 Software Scheduling Try producing fast code for a = b + c; d = e f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD SW LW Re,e LW Rf,f SUB SW d,rd Ra,Rb,Rc a,ra Rd,Re,Rf Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,ra SUB Rd,Re,Rf SW d,rd 35

36 Compiler Avoiding Load Stalls scheduled unscheduled gcc spice tex loads stalling pipeline 36

37 Control Hazard Un branch causa l introduzione di 3 cicli di clock di stallo in attesa dell esito del branch (l aggiornamento del PC viene eseguito nella fase di E). Branch instruction IF ID EX E WB Branch successor IF stall stall stall IF ID EX E WB Branch successor + 1 IF ID EX E WB Branch successor + 2 IF ID EX E Branch successor + 3 IF ID EX Branch successor + 4 IF ID Branch successor + 5 IF Calcolatori Elettronici 37

38 Control Hazard Il numero di cicli può essere ridotto a un solo ciclo anticipando la verifica nello stadio ID ADD ID/EX IF/ID EX/E E/WB 4 ADD u x Zero? IR6..10 PC Instruction memory IR IR E/WB.IR Registers u x ALU Data memory u x Sign extend Calcolatori Elettronici 38

39 Introduzione stalli x control hazard I+1 branch I-1 I-2 IF/ID ID/EX EX/E E/WB Add ux Zero? Branch taken PC 4 x 0 e mo r y IR x IR6..10 IR E/WS, IR R eg i s t e r s ux ux ALU e m o ry u x NOP 16 Sign 32 extend Calcolatori Elettronici 39

40 Control Hazard: stalli 12_Control_hazard_Datapath_Implementazione_introduzione_stalli.exe Calcolatori Elettronici 40

41 Frequenza branch compress eqntott espresso gcc Benchmark l i doduc ear hydro2d mdljdp su2cor Percentage of instructions executed Forward conditional branches Backward conditional branches Unconditional branches Calcolatori Elettronici 41

42 Calcolatori Elettronici 42 Frequenza branch Fraction of all conditional branches Backward taken Forward taken Benchma rk compress eqntott espresso gcc li doduc ear hydro2d mdljdp su2cor

43 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence Squash instructions in pipeline if branch actually taken Advantage of late pipeline state update 47 DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53 DLX branches taken on average But haven t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome Calcolatori Elettronici 43

44 Control Hazard The predict-not-taken scheme and the pipeline sequence when the branch is untaken (top) and taken (bottom) Untaken branch instr. IF ID EX E WB Instruction i+1 IF ID EX E WB Instruction i+2 IF ID EX E WB Instruction i+3 IF ID EX E WB Instruction i+4 IF ID EX E WB Taken branch instr. IF ID EX E WB Instruction i+1 IF idle idle idle idle Branch target IF ID EX E WB Branch target + 1 IF ID EX E WB Branch target + 2 IF ID EX E WB Calcolatori Elettronici 44

45 Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2... sequential successor n branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this Calcolatori Elettronici 45

46 Control Hazard-Delay slot The behavior of a delayed branch is the same wheter or not the branch is taken. Untaken branch instr. IF ID EX E WB Branch-delay instr. (i+1) IF ID EX E WB Instruction i+2 IF ID EX E WB Instruction i+3 IF ID EX E WB Instruction i+4 IF ID EX E WB Taken branch instr. IF ID EX E WB Branch-delay instr. (i+1) IF ID EX E WB Branch target IF ID EX E WB Branch target + 1 IF ID EX E WB Branch target + 2 IF ID EX E WB Calcolatori Elettronici 46

47 Control Hazard-Delay slot (a) From before ADD R1, R2, R3 if R2 = 0 then Delay slot (b) From target SUB R4, R5, R6 ADD R1, R2, R3 if R1 = 0 then Delay slot (c) From fall through ADD R1, R2, R3 if R1 = 0 then Delay slot SUB R4, R5, R6 Become s if R2 = 0 then ADD R1, R2, R3 Become s SUB R4,R5,R6 ADD R1, R2, R3 if R1 = 0 then SUB R4, R5, R6 Become s ADD R1, R2, R3 if R1 = 0 then SUB R4, R5, R6 Calcolatori Elettronici 47

48 Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allow more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60 of branch delay slots About 80 of instructions executed in branch delay slots useful in computation About 50 (60 x 80) of slots usefully filled Calcolatori Elettronici 48

49 Pipeline performance con Hazard Speedup rispetto alla versione sequenziale CPU TIE = IC CPI TC Speedup = CPU CPU TIE TIE Sequenziale Pipeline = CPI CPI Seq Pipe T T CK CK Seq Pipe CPI = CPI + Pipe Ideale Nr Cicli di clock di stallo per istruzione Hp: Stage bilanciati, no overhead x pipeline, T CK Seq = T CK Pipe Speedup = 1+ Nr CPI Seq Cicli Stallo per Istruzione = 1 + Stadi della Pipe Nr Cicli Stallo per Istruzione Calcolatori Elettronici 49

50 Pipeline performance: esempio1 (1/2) Dato un programma con seguente instruction mix Lw 20 Sw 10 Branch 20 ALU 50 ipotizzando di avere un unica memoria, utilizzare il forwarding, calcolare l esito del branch nello stadio di Decode e che il 50 delle Lw è seguito da un istruzione che da essa dipende calcolare lo speedup dell architettura pipeline rispetto a una sequenziale. CPI Pipe =CPI ideale +Nr CicliStalloPerIstruzione = =1+f StructHazard *Stalli StructHazard + f Branch *Stalli Branch + f DataHazard_Lw *Stalli DataHazard_Lw Calcolatori Elettronici 50

51 Pipeline performance: esempio1 (2/2) Poiché f StructHazard = f Lw + f Sw =0,2+0,1=0,3 Stalli StructHazard =1 f Branch =0,2 Stalli Branch =1 f DataHazard_Lw =20*50=0,1 Stalli DataHazard_Lw =1 CPI Pipe = 1+ 0,3*1+0,2*1+0,1*1=1+0,6=1,6 Speedup = CPI CPI Seq Pipe = 1+ Stadi della Pipe Nr Cicli Stallo per Istruzione = 5 1,6 = 3,125 Calcolatori Elettronici 51

52 Pipeline performance: esempio2 (1/2) Dato un programma con seguente instruction mix Lw 25 Sw 15 Branch 20 ALU 40 ipotizzando di avere due memorie, di non utilizzare il forwarding, di calcolare l esito del branch nello stadio di emaccess, che il 30 delle istruzioni ALU ha una dipendenza dati con una istruzione a distanza 1 e il 10 delle istruzioni ALU ha una dipendenza dati con una istruzione a distanza 2, calcolare lo speedup dell architettura pipeline rispetto a una sequenziale. CPI Pipe =CPI ideale +Nr CicliStalloPerIstruzione = =1+f Branch *Stalli Branch + f DataHazard_ALU_Distanza1 *Stalli DataHazard_ALU_Distanza1 + f DataHazard_ALU_Distanza2 *Stalli DataHazard_ALU_Distanza1 Calcolatori Elettronici 52

53 Pipeline performance: esempio2 (2/2) Poiché f Branch =0,2 Stalli Branch =3 f DataHazard_ALU_Distanza1 =40*30=0,12 Stalli DataHazard_ALU_Distanza1 =2 f DataHazard_ALU_Distanza2 =40*10=0,04 Stalli DataHazard_ALU_Distanza2 =1 CPI Pipe = 1+ 0,2*3+0,12*2+0,04*1=1+0,6+0,24+0,04=1,88 Speedup = CPI CPI Seq Pipe = 1+ Stadi della Pipe Nr Cicli Stallo per Istruzione = 5 1,88 = 2,66 Calcolatori Elettronici 53