slides

Loading...

Lustre Performance Investigations on Theta Francois Tessier, George Brown, Preeti Malakar, Rick Zamora, Venkat Vishwanath, Paul Coffman (ftessier, gbrown, pmalakar, rzamora, venkat, pcoffman)@anl.gov ALCF 1

Argonne Leadership Computing Facility

Overview

— Theta Lustre Overview — Performance Characterizations using Cray MPI-IO within IOR — HDF5 ECP Work - Custom Collective IO VFD — Operations Metrics

2

Argonne Leadership Computing Facility

Theta Overview

3

Argonne Leadership Computing Facility

Argonne Leadership Computing Facility

4

2004 2005 2006

Blue Gene/L at LLNL: 90-600 TF system #1 on Top 500 for 3.5 years Argonne accepts 1 rack (1024 nodes) of Blue Gene/L (5.6 TF) Argonne Leadership Computing Facility (ALCF) created

2008 2009

ALCF accepts 40 racks (160k cores) of Blue Gene/P (557 TF) ALCF approved for 10 petaflop system to be delivered in 2012

2012

48 racks of Mira Blue Gene/Q (10 PF) in production at ALCF

2014

Development partnership for Theta and Aurora begins

2016

ALCF accepts Theta (10 PF) Cray XC40 with Xeon Phi (KNL)

2021

Aurora (>1 EF) will be delivered

Argonne Leadership Computing Facility

Theta System Overview Architecture: Cray XC40 Processor: 1.3 GHz Intel Xeon Phi 7230 SKU Cores/node: 64 Racks: 24 Nodes: 4,392 Memory/node: 192 GB DDR4 SDRAM High bandwidth memory/node: 16 GB MCDRAM SSD/node: 128 GB Aries interconnect with Dragonfly configuration Total cores: 281,088 Total MCDRAM: 70 TB Total DDR4: 843 TB Total SSD: 562 TB 10 PB Lustre file system Peak performance of 11.69 petaflops

5

Argonne Leadership Computing Facility

LUSTRE Specifications on Theta – lfs 2.7.2.26 – Sonexion Storage • 4 cabinets • 10 PB usable RAID storage • Total Lustre Performance Write BW 172 GB/s Read BW 240 GB/s ▪ 56 OSS (1 OST per OSS) - Peak Performance of 1 OST is 6 GB/s ▪ Lustre client cache effects only for much higher BW

- OSS cache disabled by Sonexion - Cray has seen issues with RAID array bitmap being pushed out of memory due to OSS cache consuming memory on the OSS nodes

6

Argonne Leadership Computing Facility

Performance Characterizations using MPI-IO within IOR • Lustre as a component of MPI-IO performance • Collective vs Independent, cache effects, shared files vs fpp • HPC-IOR version • Enhanced for MPIIO –e fsync support (MPI_File_sync) • http://xgitlab.cels.anl.gov/ExaHDF5/HPC-IOR • All results show MAX Bandwidth (best times) for each experiment

7

Argonne Leadership Computing Facility

Shared File Stripe Size vs Count Affect on Performance (Independent I/O - No Lustre Client Cache Effects)

8

Argonne Leadership Computing Facility

Shared File Stripe Size vs Count Affect on Performance (Independent I/O - No Lustre Client Cache Effects) 8MB/proc

9

Argonne Leadership Computing Facility

Shared File Stripe Size vs Count Affect on Performance (Independent I/O - No Lustre Client Cache Effects) 8MB/proc 48 OST 8MB Stripe -- ~15 GB/s at 1MB/Proc vs ~60 GB/s at 8MB/proc

10

Argonne Leadership Computing Facility

Shared File Stripe Size vs Count Affect on Performance (Independent I/O – with Lustre Client Cache Effects)

11

Argonne Leadership Computing Facility

File Per Process - Stripe Size vs Count Affect on Performance (Independent I/O – with Lustre Caching) 8MB/proc

12

Argonne Leadership Computing Facility

Shared File Stripe Size vs Count Affect on Performance (Collective I/O - No Lustre Client Cache Effects)

13

Argonne Leadership Computing Facility

Shared File Stripe Size vs Count Affect on Performance (Collective I/O – with Lustre Client Cache Effects)

14

Argonne Leadership Computing Facility

Impact of data size on Lustre Cache Performance for 48 OST / 16 MB Stripe – Collective vs Independent IO

15

Argonne Leadership Computing Facility

Mitigation of Extent lock contention within Cray MPI-IO • Each rank (client) needs its own lock when accessing striped data for a given file on an OST • If more than one rank concurrently accesses same file on OST, causes extent lock contention, cancels out performance improvement • Concurrent access improves storage bandwidth • Cray MPI-IO has a current limited mitigation for this (cray_cb_write_lock_mode=1 – shared lock locking mode) • A single lock is shared by all MPI ranks that are writing the file. • Lock ahead locking mode (cray_cb_write_lock_mode=2) not yet supported by Sonexion • Following slide run with: MPICH_MPIIO_HINTS=*:cray_cb_write_lock_mode=1:cray_cb_nodes_multiplier=:romi o_no_indep_rw=true

16

Argonne Leadership Computing Facility

IOR MPI-IO Collective Shared Lock Performance Tests IOR on 256 nodes 16 ppn 48 OSTs 1MB Stripe 1 MB Transfer size IOR Lustre Shared Lock MPI-IO BW 60

BW (GB/s)

50 40 30 20 10 0 1

2 3 cray_cb_nodes_multiplier MPI-IO Write

17

Argonne Leadership Computing Facility

4

Raw File Write

‘Raw File Write’ times taken from MPICH_MPIIO_TIMERS=1 trace Raw File write linearly better - MPI-IO 1.5x faster at 4

HDF5 ECP Work – Custom Collective IO Virtual File Driver • Lustre as a component of HDF5 performance • CCIO VFD vs MPI VFD

18

Argonne Leadership Computing Facility

HDF5 Virtual File Layer

CCIO VFD

19

Argonne Leadership Computing Facility

Parallel HDF5 CCIO VFD 

 



20

Custom Collective IO Virtual File Driver  Many HDF5 data access patterns get better performance with collective vs independent (large scale, discontiguous data)  Clone of most commonly used H5FD_MPI VFD to support customized collective IO file algorithms outside of MPI Highly instrumented for detailed performance profililng Current performance enhancements over MPI VFD  Avoids performance overhead of construction/deconstruction of MPI_Datatype  MPI constructs MPI_Datatype from dataset selection, MPI-IO implementation then needs to deconstruct to get offset/len pairs  For highly discontiguous data this can be expensive  Implementation of one-sided collective aggregation algorithm – detailed in following slides Many targets for future optimization

Argonne Leadership Computing Facility

Standard Two-Phase Collective MPI-IO



21

Standard two-phase algorithm as exists in MPICH MPI-IO (ROMIO) rd (0th) initial collective meta-data planning phase where  Actually has a 3 aggregators determine what data goes where and when  Involves send/recv and/or collectives (eg MPI_Alltoall)  Data movement for aggregation phase done with send-recv or collectives (MPI_Alltoallv)  Done in ‘rounds’ defined by collective buffer size * number of aggregators Argonne Leadership Computing Facility

One-Sided Two-Phase Collective MPI-IO 

22

Currently implemented in MPICH-MPI-IO (ROMIO) with support for lustre (write only)  No collective meta-data planning phase  Data movement phase does RMA (MPI_Put) from computes into aggregator collective buffers  Dependent on architecture RMA implementation for performance  Aggregator memory footprint for standard algorithm can be significant at scale  Applications can run out of memory  Can be detrimental to lustre client cache-effects  Various performance improvement - depending on IO pattern and architecture can see 10x speedup or no speedup  Vendors evaluating it

Argonne Leadership Computing Facility

Parallel HDF5 Exerciser 

 





23

Performance profiling c code exercising most intensive HDF5 functions in common user scenarios for both meta-data and raw data Created by ExaHDF5 ECP team Includes concepts from other HDF5 Performance benchmarks (IOR, VPIC-IO, FLASH-IO) and expands on them Highly customizable via many run-time options  Independent/Colletive IO for raw and meta-data, contiguous/chunked storage, multidimensions , discontiguous buffers/strides  Can craft many complex data access patterns  Can run at small and large scale Working on getting various HDF5 data access patterns from applications into the HDF5 Exerciser to reproduce and solve performance issues

Argonne Leadership Computing Facility

HDF5 Exerciser One-Sided CCIO VFD vs Cray MPI-IO VFD

At 1-d smaller messages benefit because of relative increase in overhead for the collective metadata planning phase, for larger message sizes degradation from 4mb to 16mb may be some sort of synchronization issue – each agg does 86 rounds of 16mb writes, syncing (MPI_Barrier) with each round

3+ dimensional datasets benefit the most from one-sided aggregation 24

Argonne Leadership Computing Facility

Operations Metrics • A few examples of the types of Lustre metrics being collected on Theta • Working on direct correlation to IO performance on job basis • Currently indirect correlation we will show in later slides

25

Argonne Leadership Computing Facility

Lustre Metrics • Operations team records the majority of Lustre stats but focus on monitoring a subset of them • MDS • Monitor all typical metadata operations, e.g. opens, creates, unlinks, renames, (get|set)(x)attr • OSS • Monitor reads/writes grouped by OST and OSS • Monitor number of files and space • Eventually be able to directly tie back to job id and direct correlation with user achieved performance

26

Argonne Leadership Computing Facility

MDT Metrics Dashboard

27

Argonne Leadership Computing Facility

OST metrics during IOR large data test 18:36 to 19:05

28

Argonne Leadership Computing Facility

HDF5 Exerciser metrics for 26 jobs on Cori KNL

29

Argonne Leadership Computing Facility

OST, OSS, MDS Statistics via NERSC pytokio API

30

Argonne Leadership Computing Facility

WRITE RATE

OSS Load

READ RATE

MDS Load

Acknowledgements This research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357. This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration.

31

Argonne Leadership Computing Facility

Questions?

32

Argonne Leadership Computing Facility

Appendix

33

Argonne Leadership Computing Facility

Lustre Architecture On Theta

34

Argonne Leadership Computing Facility

• IO Forwarding from compute node to LNET Service Node / Router • LNet Aries NIC on compute side, 2 IB links on Object Storage Server (OSS) side • OSS handles communication from LNet Router to Object Storage Target (OST) which is the physical storage device • Although there are 4 MDTs only 1 currently has directories placed on it

Loading...

slides

Lustre Performance Investigations on Theta Francois Tessier, George Brown, Preeti Malakar, Rick Zamora, Venkat Vishwanath, Paul Coffman (ftessier, gbr...

2MB Sizes 1 Downloads 0 Views

Recommend Documents

Slides
A dynamic Å¿on difim ad i a conditional node in The exem Hon tree, and to am ius tam ced asfaltic ordifonal. Each he tur

slides
Motivation Overview Practical matters Next. NLP and ambiguity fun with newspaper headlines. • FARMER BILL DIES IN HOUS

Slides - Hemakim
positive and negative inoculated well and space for concurrent staining of culture ... the label contains a droplet of a

PDF Slides
Based on the simulation results, use decision tree method to decide which crop to grow ... Extended Pearson- Tukey Metho

Lecture slides
What is an LED? Packaged Blue LED. Size: 0.4 mm x 0.4 mm. Actual Blue LED. A Light Emitting Diode (LED) produces light o

Presentation slides
Apr 5, 2011 - Tuxpan power plant. Ecosistemas importantes en un radio de 150km: pastizales, agricultura, selva baja, bos

Presentation Slides
Judging Model Reduction of Chaotic Systems via Shadowing Criteria. Erik M. Bollt. Department of Mathematics & Computer S

Slides - CS224d
Mar 31, 2016 - Milestone: 5% (2% bonus if you have your data and ran an experiment!) • A end at least 1 project advice

Lecture Slides
Lecture Slides. These slides are adapted from the slides accompanying the text: Computer Networking: A Top-Down Approach

PP1 - Presentation Slides - Education
Research has been done that Malaysian construction industry has suffered high proportion of business failure during econ