Istituto Nazionale di Fisica Nucleare Italian National Institute for Nuclear Physics. Progetto PetAPE APE Group Development Team INFN - Roma

Save this PDF as:
 WORD  PNG  TXT  JPG

Dimensione: px
Iniziare la visualizzazioe della pagina:

Download "Istituto Nazionale di Fisica Nucleare Italian National Institute for Nuclear Physics. Progetto PetAPE APE Group Development Team INFN - Roma"

Transcript

1 Istituto Nazionale di Fisica Nucleare Italian National Institute for Nuclear Physics Progetto PetAPE APE Group Development Team INFN - Roma

2 Progetto APE nel futuro immediato Tre linee di ricerca complementari (e sinergiche) sono attive: Network custom per PC Cluster apenet+: RM1, RM2 Processori commodities di nuova generazione (INTEL) interconnessi con network toroidali 3D integrate su FPGA AURORA: MIB,FE,PR + partner industriale + PAT (prov. Trento) QPACE: IBM Cell (talk di Lele) Realizzazione di sistemi di calcolo scalabili a decine di Teraflops per armadio, basati su processori multi-tile custom caratterizzati da basso consumo ed alte prestazioni, con rete di interconnessione 3D torus integrata PetApe: RM1, RM2 + partner accademici europei, partner industriali

3 Da APE1 ad apenext (a+ib)*(a+ib)+(c+id) =(Aa-Bb+C)+i(Ab+Ba+D) C,A + * Aa a b B,D * Ab * Ba * Bb APE1 (1988) 1GF APEmille (1999) 128GF, SP, Complex APE100 (1992) 25GF, SP, REAL apenext (2004) 800GF, DP, Complex

4 apenet: 3D torus per PC Cluster

5 Progetto SHAPES e DNP (Distributed Network Processor) : EU funded (1ME) INFN Roma to design & develop a novel network interconnect for the FP6 Shapes project: Based on APENet 3D Torus topology 6 links 10 ports crossbar switch multi-hop packet routing RDMA HW support Custom SERDES A library of customizable components for ASIC and FPGA integration

6 LQCD Our guess for LQCD roadmap: today: 24 3 x48 or 32 3 x64, N f =2, m π =300Mev : 48 3 x96, N f =3, m π =250Mev : 64 3 x128, N f =3, m π =200Mev a custom MPP computer out in 2011 has to: target 64 3 x128 on a single rack with enough RAM for O(10) propagators (1TB) power efficient

7 LQCD Memory requirements Gauge conf: Vol4D*3*4*4 Propagator: Vol4D*3*4*3*4 = Gauge conf*4 ~ 10 propagators for both gauge production and measures Total Vol4D*3*3*4*(1+10*Prop) = 790GB on 64 3 x128 FP requirements on 64 3 x128 Dirac op: 1370FP*Vol4D = 46GFlop Trajectory: ~ 1000 Dirac op = ~ 50TFlop 33s on 1 rack APENET+

8 LQCD Mapping on future APE machine 2009 Cluster code Intel 8 cores Xeon processor 25% perf (Luescher SSE) APENET+ 3D Torus network on PCIe 2.0 At 3.0GHz (.25*4FP*8*Clk) = 12GFlops sustained 1 rack = (128*12) = 1.5TFlops sustained 2010 Current APE LQCD codes Pet-Ape 8 cores APOTTO processor > 50% perf At 500MHz (.5*8FP*8*Clk) = 16GFlops sustained 1 rack = (1024*16) = 16TFlops sustained

9 Industrial applications: APE exploitations Medical Ultrasound Scanner functional medical diagnosis through fast speed better quality imagining requires at least a Teraflops board complex domain Vocal command in noisy environment Cars, Airplanes, Railway, Public places, Home Low end applications need shapotto/apotto like chips Robotics ears and mouth of the robots The industrial revolution of the next decade (with energy and sustainable growth ) High quality audio entertainment (multi-loudspeaker wave field synthesis, i.e. sound hologram, complex domain) Cinema, home theaters

10 This Proposal Objective: provide adequate computational resources to INFN theoretical groups in (and beyond) 2 projects but synergic deliveries, unified research line (interconnection network) ApeNet+: addresses LQCD requirements Buy best cluster on market Add custom 3D Torus network (based on DNP components) PCIexpress card, based on APENet design Update PHY if needed Pet-Ape: addresses >2011 LQCD requirements Custom VLSI processor 3D Torus network (based on DNP components) Custom system engineering

11 PetApe&apeNET+ roadmap DNP Beta Release DNP First Release I D Nome attività dic gen feb mar apr mag giu lug ago set ott nov dic gen feb mar apr mag giu lug ago set ott nov dic gen feb mar apr mag giu lug ago set ott nov dic gen feb mar apr mag giu Distributed Network 1 Processor 2APEnet + 3APOTTO 4PetApe APEnet + RTL APOTTO RTL APEnet + ( 4 nodi) APOTTO - Sintesi APOTTO Prototipo APEnet + (128 nodi) APOTTO Place &Route DNP: 3D network processor IP library (outcome of the SHAPES FP6 project). PetApe Board PetApe Rack PetApe Board popolata (32 nodi) PetApe Rack popolato ( 1024 nodi) APENET+: PCI-express card based on the DNP IP library implementing the 3D network on commodity clusters (DNP testbed on a real production system). APOTTO: 32 GFlops single precision (8 GFlops double precision) multi-tile, very low power, VLIW processor. (Atmel MagicV & INFN APEnext derivative). Pet-Ape: PetaFlops range computer based on APOTTO and DNP processors.

12 PetApe&apeNET+ deliveries DNP Beta Release DNP First Release I D Nome attività dic gen feb mar apr mag giu lug ago set ott nov dic gen feb mar apr mag giu lug ago set ott nov dic gen feb mar apr mag giu lug ago set ott nov dic gen feb mar apr mag giu Distributed Network 1 Processor 2APEnet + 3APOTTO 4PetApe APEnet + RTL APOTTO RTL Q3 2009: APENET+ 6 TFlops rack APEnet + ( 4 nodi) APOTTO - Sintesi Q2 2010: APENET+ 100 TFlops system APOTTO Prototipo APEnet + (128 nodi) APOTTO Place &Route PetApe Board PetApe Rack PetApe Board popolata (32 nodi) PetApe Rack popolato ( 1024 nodi) Q1 2010: APOTTO Protos Q2 2010: PetApe Boards Q2 2011: Pet-Ape 32 TFlops rack Q3 2012: Pet-Ape 1 PFlops system

13 Why apenet+ Feasibility proven (previous generation) Natural DNP test-bed It's scalable, modular and cost effective interconnection technology Minimal cost and restricted time to market for system update PC update (procurement) APENet+ Firmware update (man power...) APENet+ HW update (minimal cost, reduced development effort) 3D Torus fits with the requirements of many scientific applications (DD-HMC, GROMACS, Gadget2) 3d torus is even better exploited in a multi-core environment,where a 4th dimension coordinate is assigned to each core

14 Custom vs PC Clusters PC Cluster 1U twin Pet-Ape DP 2011 SP 2011 GFlops Peak/Proc TFlops Peak/Rack TFlops Sust/Rack k /Rack k /TFlops Sust kw/rack kw/tflops Sust QCD sustain ratio PFlops Sust Machine (kw) PFlops Sust Machine area (m 2 ) PFlops Sust Machine (M ) PetApe - apenext: Fattore 100 sul sustained APE-PC: Fattore 4-10 sul sustained.

15 ApOtto FPU operation FP word Word size (bits) Flop per cycle Computing Power 500 MHz MAC V 2 S MAC C 2 S MAC D ApOtto Multi-tile (8+1) processor nm) RISC + 8 VLIW FP Core Complex MAC single Precision Real MAC Double precision Hierchical network with DNP-based network controller On-chip, high bandwidth, NOC Off-chip 3D Torus, point-to-point Hierarchical memory On-chip on-tile buffers: Multiport RF 128 KB mem (DDM) 1 (up to 4) Gb of local mem bank per tile (DXM) Shared on-chip mem (scratch pad) Design re-use apenext J&T, ATMEL MagicV

16 ApOtto key numbers APOTTO Tile Microarchitecture: Complex SP &Real DP for high efficiency, very low power, and dense systems Target clock frequency 0.5 GHz 32(40)/8 GFlops (SP/DP) aggregated perf. DXM aggregated peak bandwidth: 18(36) GB/s Tile to tile peak bandwidth (NoC): 18 GB/s 3DT aggregated peak bandwidth: 6 GB/s 3DT Chan. size: 20 wires (5(lines) *2 (bidir)*2 (diff.)) Die size preliminary estimation (45 nm)< 40mm2 Each tile less than 4 mm2 Power consumption estimation ~ 8W Pinout: 600 functional pins power pins 360 pins for memories interface, 120 pins for 3DT, 100 pins for I/O and peripheral Processor package: ~ 3.5*3.5 cm2 Complex Single (mm2) Complex Single + Real Double (mm2) Complex Double (mm2) Single nm Magic Core Reg. File PM (Single Port Mem) DM (Dual Port Mem) Single Tile Area Single nm Magic Core Reg. File PM (Single Port Mem) DM (Dual Port Mem) Single Tile Area Multi (8) DNP Size MultiTile Total Area (inc. DNP and P&R offset) Power Consumption Multi Tile Power (W) MultiTile Module Power (inc. Memory banks and DC/DC Efficiency) (W) System Parameters Rack Performances SP(DP) (TFlops) 32 (-) 32 (8) 32 (32) Rack Consumption (kw) W/GFlops

17 Modul8+: APOTTO integration Modul8+ is the elementary building block hosting 1 multi-tile processor + DXM memory chips + glue logic L tot Spare area for future enhancement Double-side board 12cm x 4cm (L x H) Procs + glue + spare area on upper side Mems + connectors on the motherboard side 6 full-bidir, differential, LVDS-based channels Total of 60 diff. 1Gb/s (120 pins) Feasibility demonstrated using SAMTEC connectors QTE family 70 diff. pairs (140 pins) on 10 cm connectors Tested up to 8.5 GHz 2 lanes allows to host 3DT + General I/O Estimated power consumption less than 13 W ApOtto ~ 8 W DDRx -> 400 mw Glue Logic Modul8+ upper side Multi-tile Processor Off-module, on-motherboards connectors area DXM DXM DXM DXM RXM DXM DXM DXM DXM Modul8+ motherboard side Spare area for modul8+ customization Off-module, on-motherboards connectors area H tot

18 System integration: TeraMotherboard+ TeraMotherboard+: 32 Modul8+, 1 TFlops of peak performances Board size 50cm x 48cm, 32 Modul8+ Simple system: DC/DC converter + modules Regular system: very effective 3D signals routing on a limited number of layers (estimated 6-7 layers for LVDS routing) System integration Motherboard stacking assembly Single face (Modul8+ placed only on one side, female stacking connectors on other side, vertical or horizontal stacking) Topology: TB+: 1x8x4; System: 32x8x4 X off-board connections placed on TB +: total of 32 X+ and 32 X- -> 1280 wire/tb+ SAMTEC HD-MEZZ: 50 pins for linear cm -> stacking connector < 30 cm Traditional APE way: backplane + front cables Backplane connector density to be verified Back connectors area (Power Supply) M8+ (0,7) M8+ (1,7) M8+ (2,7) M8+ (3,7) DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC M8+ (0,6) M8+ (1,6) M8+ (2,6) M8+ (3,6) M8+ (0,5) M8+ (1,5) M8+ (2,5) M8+ (3,5) DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC M8+ (0,4) M8+ (1,4) M8+ (2,4) M8+ (3,4) 3DT connectors area for TeraMotherBoard stacking M8+ (0,3) M8+ (1,3) M8+ (2,3) M8+ (3,3) DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC M8+ (0,2) M8+ (1,2) M8+ (2,2) M8+ (3,2) M8+ (0,1) M8+ (1,1) M8+ (2,1) M8+ (3,1) DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC DC/DC M8+ (0,0) M8+ (1,0) M8+ (2,0) M8+ (3,0) Front connectors area (I/O) X+ Y+ Z+ (0,7) (1,7) (2,7) (3,7) (0,6) (1,6) (2,6) (3,6) (0,5) (1,5) (2,5) (3,5) (0,4) (1,4) (2,4) (3,4) (0,3) (1,3) (2,3) (3,3) (0,2) (1,2) (2,2) (3,2) (0,1) (1,1) (2,1) (3,1) (0,7) (1,0) (2,0) (3,0)

19 System power consumption TB+ Power consumption 0.4 KW 32 TB+ system -> 13KW. (Relatively) High but many low power devices rather than a few high power devices (no HOT SPOT!!) main heat sources (Modul8+ and DC/DC converters) are homogenously distributed on the whole surface of the motherboard Power Consumption Multi Tile Power (W) MultiTile Module Power (inc. Memory banks and DC/DC Efficiency) (W) System Parameters Rack Performances SP(DP) (TFlops) 32 (-) 32 (8) 32 (32) Rack Consumption (kw) W/GFlops

20 System Cooling Analysis PetAPE system: 32 TeraMotherboard+ arranged in parallel with a pitch of 35.0mm. i.e. TeraMotherboard+ hosts a mid-plane of 32 Modul8+ and 32 DC/DC converters for Modul8+ power supply Analysis of cooling requirements (Ellison Equation) using simplified profile of TB+ (3 air-flow channels) with real dimensions : HP: only components with significant volume (DC/DC and connectors), low profile components (height < 1.5mm) merged into the board, total dissipation by air-flow convection Power Consump-on (W) Channel Airflow Resistance Power Dissipa-on Percentage Airflow Percentage TeraMotherboard Channel 1 (under Modul8+) % 17.76% Channel 2 (over Modul8+) % 60.91% Channel 3 (over DC/DC) % 21.22% Perfect matching between percentage of Air-flow and Power dissipations 1 m 3 /s total air-flow required

21 PetApe SW Development Environment Single Program Multiple Data C/C++ & MPI programming environment Painless recompilation of legacy code Programmer can focus on the optimization of critical computational kernels Code annotation to manage 2 level memory hierarchy Pipelines friendly coding of computational loops Explicit usage of predication statement Use of intrinsics (compiler should support auto-vectorization but...) Libraries Blas, Lapack QCD Libraries: DD-HMC, QDP++?, FermiQCD?, Chroma? Advanced Tools (from EU Shapes project): Optimizing task scheduler Parallel platform simulator

22 PetAPE Collaboration INFN Roma 1, INFN Roma 2 APE group apenet development collaboration, technological and scientific staff SHAPES Partnership ATMEL Roma, shares with INFN interests and people RISC+DSP architectures, Industrial applications Provide silicon proven FP Engine (MagicV)+ RISC integration UniRoma1, Dipartimento Ing. Elettronica, Prof. Olivieri e Trifiletti Silicon BackEnd (Floorplan, Synthesys and P&R) Chip testability experts Access to ST advanced (65-45 nm) sub-nano silicon technology ST (Grenoble)+ Universita di Cagliari, Dip. Ingeneria Elettronica Prof. Raffo SPIDERGON NoC Architecture Silicon Foundry access

23 PetAPE Collaboration(2) SHAPES Partnership ( continues) ETHZ Zurich, Prof. Thiele coarse grain parallelism and automatic mapping /scheduling TIMA and THALES HDS (Hardware dependant software) e RTOS integration TARGET Compiler Tech. Retargetable Compilers: fine grain parallelism RWTH Aachen Univ. Fast Simulation of Heterogeneous Multi Proc. Systems (SystemC) Up to Q the collaboration is funded by EU Participation to next EU FP7 call to get additional funding Eurotech, APW, etc Can be used as engineering service providers Preliminary contacts with Finmeccanica INFN+Finmeccanica could be an opportunity to drain government funding

24 PetApe cost (NRE + Mass Production) Previsione di spesa per NRE macchina custom (cifre in KEuro) Sviluppo ASIC Anno 1 Anno 2 Anno 3 Totale Acquisizione Tools (HW + SW) ASIC MASK Supporto di fonderia Produzione e test ASIC (40 protos) Front_End Senior Engineer (3 anni uomo) Backend Senior Eng. (2 anni uomo) Totale progetto per classe attivita' Totale NRE ASIC 775 Sviluppo sistema NRE Elettronica (PB, Module) NRE Rack (+ Proto) Totale sviluppo sistema 350 Software Acquisizione SW tools Costi per personale aggiuntivo (5 anni uomo) Totale Software 400 INFN Ricercatori junior 6 ricercatori * 3 anni Totale personale aggiuntivo INFN 600 ST 45 nm shuttle cost = 10 KEuro/mm2 Senior engineer chip RTL design e test 1 back-end expert (chipdesign phase only) Engineering (outsourcing ) CAD/Simulator Licenses Compiler, OS, libraries etc (NO outsourcing) VHDL and HW/SW and SYSTEM design Totale progetto per anno Totale progetto 2125 Costo sistema in produzione (1024 nodi, 32 TFlops) Numerosita' Costo unitario (Euro) Totale (Euro) Processore Memoria, 4 Gb chip Modulo processore PB Rack Costo Totale sistema (Euro) Costo per GFlops (Euro) 4.79

25 PetaFlop computing centre operating cost QuickTime and a decompressor are needed to see this picture.

26 Only a joke Prof. Bachem (spokesman of PRACE project..) Petaflops Computing Center: 120 ME HW investment 60 ME Infrastructure i.e. 50% of machine cost ApeNEXT Computing Center 6 ME HW investment less than 1 ME for infrastructure (15%) Pet-Ape Petaflops Center 6 ME HW investment According to Experts you have to pay 3ME for infrastructure According to APE experience you can do it with ME You gain 1.5 ME (to spend for NRE) NRE ALMOST FOR FREE!!!!

27 alla fine ne rimarrà soltanto uno? (Highlander ) Commissione di Referees su Super Calcolo (Parisi, Fucci, Santangelo) La commissione considera un segno di grande vitalità scientifica la presentazione di due proposte d ottimo livello per la costruzione di supercalcolatori. Ritiene importante per l Ente continuare ad effettuare ricerca in questo campo d avanguardia, sia per il valore formativo che per le ricadute di vario tipo nel mondo produttivo italiano. Le decisioni dovrebbero essere prese non solo tenendo conto della mera convenienza finanziaria per l Ente, ma anche di una prospettiva culturale a lungo termine. Ma ci sono pochi soldi Nonostante il grande interesse tecnologico del progetto del gruppo APE, non ci sentiamo di raccomandare fin da adesso un impegno di 9 (in realta 6) Milioni di Euro ma pensiamo che la proposta del gruppo APE (petape) contenga notevoli punti d interesse scientifico e tecnologico E quindi Per quanto riguarda il progetto AURORA si raccomanda di decidere di finanziare la prima parte del progetto, lasciando ad una verifica da fare il prossimo anno la decisione intermedia di costruire una macchina da 130 Tflop La costruzione di una macchina a 40 bit (PetApe e a 40 bit + supporto HW per DP), se giustificata dal programma scientifico, sarebbe molto interessante anche perché sarebbe più economica (e in generale più leggera) di circa un fattore 2 rispetto alle macchine a 64 bit, allargando quindi di un fattore significativo il vantaggio di una macchina custom rispetto a processori commerciali. In ogni caso proponiamo all ente di approvare fin da adesso il progetto Apenet+ del gruppo APE per la costruzione di un prototipo di cluster a quattro nodi per lo studio delle comunicazioni con il finanziamento richiesto 390K euro, perché riteniamo che l estrazione di by-products da questi progetti di ricerca dovrebbe essere sempre incoraggiata ad ogni stadio

28 Budget 2009 PetApe-ApeNet PetApe Consumo Personale Inventario Missioni Sviluppo Processore ApOtto NRE Schede di elettronica PetApe 100 ApeNet+ Sviluppo Firmware e Software 60 5 NRE schede elettronica ApeNet+ 40 Produzione 4 prototipi 20 Totale Cofinanziamento: Progetto SHAPES: Totale per INFN 768 Keuro (di cui 128 K overhead interamente versati ) Fino ad inizio 2008, 320 KEuro; a fine progetto (end 2009) 320 Keuro 6 Ricercatori, spese per disseminazione (CASTNESS) e limitato R&D Accesso ad IP di elevato valore (MagicV, ST NoC, ) Call 2Q2009 FET Proactive Initiative: Massive ICT systems (MASS-ICT) Contatti in corso con Finmeccanica, Filas,

29 Milestones 2009 Milestone 1, Aprile 2009 Petape: Esecuzione di un kernel significativo di LQCD su simulatore SystemC del processore multi-tile. Il codice verrà eseguito in maniera distribuita sulle 8 tile del processore ApeNet+: Completamento sviluppo firmware ApeNet (V3) per architetture multi-core e porting su prototyping board per PCIe Milestone 2, Luglio 2009 Petape: Completamento del modello VHDL del processore ApOtto ApeNet+: Produzione di 4 prototipi di scheda ApeNet+ Milestone 3, Novembre 2009 Petape: Tape Out del chip ApOtto. Il chip viene portato in fonderia per essere prodotto in un numero limitato di prototipi (circa 10). ApeNet+: Realizzazione di un PC cluster prototipale basato su 4 server interconnessi con ApeNet+ e verificato con test elementari d interconnessione e kernel minimali di applicazioni di fisica Obiettivi 1H 2010: Marzo ApOtto out of fab Aprile-Maggio inizio test PetApe board

30 Dove siamo Richiesta economica per 2009 e troppo alta Trattativa in corso con managment INFN per quantificare il finanziamento Rumors dicono che nel 2009/2010 saranno rose e fiori In questo framework: non possiamo andare a protipare il chip nel 2009 ma all inizio del 2010 qualche mese di ritardo si riduce l overlap temporale con il co-finanziamento EU si potrebbero perdere persone chiave del progetto Per la meta del 2009 bisognera : chiarire l impegno INFN in questo progetto verificare/valutare le quote di cofinanziamento esterno Post Scriptum: Alla fine del 2008 scadono gli impegni contrattuali con Eurotech per il centro di calcolo apenext ( 1 tecnico per assistenza HW on site) Ad oggi impegno (anche economico!!) del gruppo APE di Roma per mantenere operativo il centro Se si vuole mantenere in vita il centro abbiamo bisogno per il 2009 di: 1 contratto per tecnico HW per la macchina 1 contratto per manutenzione ordinaria e straordinaria software di sistema