Tizona, A tool for managing workloads in HPC environments

What is Tizona

Tizona is a tool for launching experiments in clusters with a job manager and collect their results.

Tizona allows users to specify their set of experiments in a single json file and automatically generates and submits the corresponding jobs to the cluster batch system.

Tizona is being developed at the Barcelona Supercomputing Center.

Using Tizona with your application

Tizona can use custom models to deal with each workload unique characteristics. Models obbey a simple interface and can implement functionality such as writing a file configuration, or download a set of required files before launching the experiments.

Tizona configuration

Tizona is configured using the config.json at the root dir. in this file models and hosts specific configuration parameters are written.

Tizona experiments

Tizona runs experiments using json files containing a "params" dict. Within this params dict, the values for the parameters are specified as scalar values or lists. If multiple parameters have a list as their value, Tizona will obtain all the possible combinations of all the lists. It is up to the module code to detect valid or invalid configurations of parameters within the job factory method. The example.json file shows how to create experiments.

Hosts

The folder hosts/ contains descriptions of different hosts. Each host class is in charge of creating the corresponding job script and execute it.

The config.json file determines the host type where you want to run your job.

Launching jobs

Launching a experiment:

$ python launch.py --file experiments/example.json

The --file admits multiple files at once:

$ python launch.py --file experiments/example1.json experiments/example2.json

Packing and batching

Multiple Experiments can be packed in one or few jobs by using the --pack-params and --pack-size options

When using --pack-params, supply a list of params as specified in the experiments json params field. Experiments with the the same values for those params will be coalesced in the same pack.

To control the maximum number of experiments per job --pack-size is used. This argument can be used alone or together with --pack-size

Pack several experiments according to the number of nodes they need, and with a maximum of 50 experiments per pack:

$ python launch.py --file experiments/example1.json experiments/example2.json --pack-params nodes --pack-size 50

Collecting Job Results

Parsing job statistics for CSV generation

The stats field on the experiment configuration allows to specify bash commands to retrieve metrics from the output files.

    "stats" : {
        "time" : "grep Time %(stdout)s | rev | cut -d' ' -f1 | rev"
    }

Here we add a stat called time whose value is retrieved from the stdout of each experiment using that bash command.

The following placeholders will be replaced with the experiment specific values:

stdout
working_dir
name
app_dir

Create a CSV with all the experiments

CSV files can be created with the stats values defined in the json "stats" field as it will be described later.

The following line reads all the files containing experiments and creates a csv file organized by the nmess, comp params with the stat time values:

$ python csv.py --file experiments/examples*json --csv-params nmess comp --csv-stats time --csv-out output.csv

It is also possible to use SQL to process the csv files:

$ python csv.py --file experiments/examples*json --csv-params nmess comp --csv-stats time --csv-out output.csv --csv-query "SELECT * from output"

Complex SQL queries involving other files can be done by using the --csv-extra argument. The SQL query will be able to use csv data stored in other files with join clausules or subqueries

$ python csv.py --file experiments/examples*json --csv-params nmess comp --csv-stats time --csv-out output.csv --csv-extra other_data.csv --csv-query "SELECT * from output INNER JOIN other_data ON output.param = other_data.param"

Related Projects

CheckTime

CheckTime it's a plugin for Icinga to predict the time left until a certain metric reach the max....

29 Jun 2019 5

mitosis

Reproduce Machine Learning experiments easily.

27 Jul 2022 3

ismir2018

Zero-Mean Convolutions for Level-Invariant Singing Voice Detection

15 Jun 2018 11

spython

Making sbatch more user-friendly (for python users of Jean-Zay).

27 Apr 2023 7

any2dataset

Turn any collection of files into a dataset

06 Jan 2023 42

ML-devops

Helper scripts I use to run many experiments in the morning to check at night

02 Oct 2020 20

soops

Utilities to run parametric studies.

05 Feb 2020 4

cookiecutter-ml-research

A logical, reasonably standardized, but flexible project structure for conducting ml research 🍪

23 Jan 2023 13

CASS-PROPEL

Complete evaluation of traditional "SK-learn like" machine learning models for post-operative com...

11 Sep 2023 4

pybatchintory

Middleware for generating batches of data items enabling incremental processing and backfill scen...

02 Mar 2023 4

Sparse-GP

23 Sep 2015 11

qsuite

command line tool to provide an easy framework to submit computer simulations of arbitrary variet...

22 Jan 2016 8

SGGM

20 Oct 2020 3

doepy

Design of Experiment Generator. Read the docs at: https://doepy.readthedocs.io/en/latest/

21 Jul 2019 144

expsuite

PyExperimentSuite is an open source software tool written in Python, that supports scientists, en...

26 Jun 2009 34