Adoptable Cookbooks List

Looking for a cookbook to adopt? You can now see a list of cookbooks available for adoption!
List of Adoptable Cookbooks

Supermarket Belongs to the Community

Supermarket belongs to the community. While Chef has the responsibility to keep it running and be stewards of its functionality, what it does and how it works is driven by the community. The chef/supermarket repository will continue to be where development of the Supermarket application takes place. Come be part of shaping the direction of Supermarket by opening issues and pull requests or by joining us on the Chef Mailing List.

Select Badges

Select Supported Platforms

Select Status

RSS

slurm (34) Versions 1.2.4

Installs/Configures slurm workload manager

Policyfile
Berkshelf
Knife
cookbook 'slurm', '= 1.2.4', :supermarket
cookbook 'slurm', '= 1.2.4'
knife supermarket install slurm
knife supermarket download slurm
README
Dependencies
Changelog
Quality 100%

Build Status
Cookbook Version

slurm

Wrapper cookbook that can prepare a full slurm cluster, controller, compute and accounting nodes

Requirements

Requires the following cookbooks:

  • mariadb Cookbook Version
  • shifter Cookbook Version

Platforms

The following platforms are supported:

  • Ubuntu 18.04
  • Debian 9

Other Debian family distributions are assumed to work, as long as the slurm version from the package tree
is at least 17.02 due to hostname behaviour of slurmdbd.

Chef

  • Chef 14.0+

TODO

  • Support for RHEL family
  • Make cgroup.conf file dynamic
  • Add recipe to setup a dynamic resource allocation cluster
  • Install slurm from static stable sources, i.e 17.11-latest, 18.08-latest
  • Refactor and remove code that can be used as a resource instead of a recipe
  • Remove static types of nodes and partitions and support static generation, maybe by passing the Hash directly
  • Complete spec files

Usage

Check the .kitchen.yml file for the run_list, this can be applied with:

$ kitchen converge [debian|ubuntu|all]

The use case for this run_list is to setup a monolith which contains all of the slurm components.

Recipes

slurm::_disable_ipv6

  • Disable ipv6 on a Linux system.

slurm::_systemd_daemon_reload

  • Makes available forcing a daemon-reload on systemd, in order to refresh service unit files.

slurm::accounting

  • Installs and configures slurmdbd, slurms' accounting service.

slurm::cluster

  • TODO sets up a dynamic resource allocation cluster.

slurm::compute

  • Installs and configures slurmd, slurms' compute service.

slurm::database

  • Installs and configures a MariaDB service.

slurm::default

  • Sets up slurm user and group
  • Installs packages common to all slurms' services.

slurm::munge

  • Sets up munge user and group
  • Installs and configures munge authentication service.

slurm::plugin_shifter

  • Sets up shifter plugin for slurm.

slurm::server

  • Installs and configures slurmctld, slurms' controller service.

This is where the common configuration file shared between slurmctld and slurmd services is generated.
Take a close look at attributes below.

Attributes

The attributes are presented here in order of importance for assembling a whole infrastructure.

Common

# ========================= Data bag configuration =========================
default['slurm']['secret']['secrets_data_bag']                 # The name of the encrypted data bag that stores openstack secrets

default['slurm']['secret']['service_passwords_data_bag']       # The name of the encrypted data bag that stores service user passwords, with
                                                               # each key in the data bag corresponding to a named Slurm service, like
                                                               # "slurmdbd", "slurmctl", "slurmd" (this may not be needed for slurm).

default['slurm']['secret']['db_passwords_data_bag']            # The name of the encrypted data bag that stores database passwords, with
                                                               # each key in the data bag corresponding to a named Slurm database, like
                                                               # "slurmdbd", "slurmctl", "slurmd"

default['slurm']['secret']['user_passwords_data_bag']          # The name of the encrypted data bag that stores general user passwords, with
                                                               # each key in the data bag corresponding to a user (this may not be needed for slurm).

# ========================= Slurm specific configuration =========================
default['slurm']['common']['conf_dir']                         # slurm configuration directory, usually '/etc/slurm-llnl'

default['slurm']['custom_template_banner']                     # String that is prepended to each slurm configuration file

default['slurm']['user']                                       # username to configure slurm as, usually 'slurm'

default['slurm']['group']                                      # group to configure slurm as, usually 'slurm'

default['slurm']['uid']                                        # Slurm user ID, common to all nodes, our default is 999, just before user land id's 

default['slurm']['gid']                                        # Slurm group ID, common to all nodes, our default is 999, just before user land id's

default['proxy']['http']                                       # proxy address for use with apt, mariadb, and system environment

Munge

default['slurm']['munge']['key']                               # munge key location

default['slurm']['munge']['env_file']                          # munge environment file, to be used by systemd

default['slurm']['munge']['auth_socket']                       # munge communication socket location

default['slurm']['munge']['user']                              # username to configure munge as, usually 'munge'

default['slurm']['munge']['group']                             # group name to configure munge as, usually 'munge'

default['slurm']['munge']['uid']                               # MUNGE user ID, common to all nodes, our default is 998, just before Slurm's

default['slurm']['munge']['gid']                               # MUNGE user ID, common to all nodes, our default is 998, just before Slurm's

Monolith

default['slurm']['control_machine']                            # fqdn of the machine where slurmctld is running

default['slurm']['nfs_apps_server']                            # fqdn of the machine where the apps directory is made available through nfs

default['slurm']['nfs_homes_server']                           # fqdn of the machine where the home directory is made available through nfs

default['slurm']['apps_dir']                                   # path to the apps directory

default['slurm']['homes_dir']                                  # path to the home directory

default['slurm']['monolith_testing']                           # tells the cookbook if the setup should be that of a monolith or not, usually for testing, either true or false

Database

default['mysql']['bind_address']                               # CIDR to where the mariadb server should listen to connections, defaults to '0.0.0.0'

default['mysql']['port']                                       # port to where the mariadb server should listen to connections, defaults to '3306'

default['mysql']['version']                                    # MariaDB version lock, defaults to '10.1'

default['mysql']['character-set-server']                       # database character set, defaults to 'utf8'

default['mysql']['collation-server']                           # database collation, defaults to 'utf8_general_ci'   

default['mysql']['user']['slurm']                              # user which slurm accounting service uses to connect to the database

Accounting

default['slurm']['accounting']['conf_file']                    # path to the slurmdbd configuration file, defaults to '/etc/slurm-llnl/slurmdbd.conf'

default['slurm']['accounting']['env_file']                     # path to the slurmdbd environment file location, defaults to '/etc/default/slurmdbd'

default['slurm']['accounting']['bin_file']                     # path to the slurmdbd binary, defaults to '/usr/sbin/slurmdbd'

default['slurm']['accounting']['pid_file']                     # path to the slurmdbd pid file, defaults to '/var/run/slurm-llnl/slurmdbd.pid'

default['slurm']['accounting']['systemd_file']                 # path to the slurmdbd systemd service unit file, defaults to '/lib/systemd/system/slurmdbd.service'

default['slurm']['accounting']['debug']                        # debug level, valid values from 0-7, defaults to '3'

default['slurm']['accounting']['conf']                         # Hash representing the slurmdbd configuration options

The default for ['slurm']['accounting']['conf'] is:
```
{
AuthType: 'auth/munge',
AuthInfo: node['slurm']['munge']['auth_socket'],
DbdHost: node['hostname'],
DebugLevel: node['slurm']['accounting']['debug'],
LogFile: '/var/log/slurm-llnl/slurmdbd.log', # default is syslog
MessageTimeout: '10',
PidFile: node['slurm']['accounting']['pid_file'],
SlurmUser: node['mysql']['user']['slurm'],
StorageHost: node['hostname'],
StorageLoc: 'slurm_acct_db',
StoragePort: node['mysql']['port'],
StorageType: 'accounting_storage/mysql',
StorageUser: node['mysql']['user']['slurm'],
}


take into account that when overriding `['slurm']['accounting']['conf']` you will override *all* of its options. 


### Server

default['slurm']['server']['conf_file'] # path to the slurmctld and slurmd configuration file, defaults to '/etc/slurm-llnl/slurm.conf'

default['slurm']['server']['env_file'] # path to the slurmctld environment file, defaults to '/etc/default/slurmctld'

default['slurm']['server']['bin_file'] # path to the slurmctld binary file, defaults to '/usr/sbin/slurmctld'

default['slurm']['server']['pid_file'] # path to the slurmctld pid file, defaults to '/var/run/slurm-llnl/slurmctld.pid'

default['slurm']['server']['systemd_file'] # path to the slurmctld systemd service unit file, defaults to '/lib/systemd/system/slurmctld.service'

default['slurm']['server']['service_req'] # name of the storage service(s) that the slurm service should depend on to start
# this should be either empty or the name of the storage service client(s) that slurm might depend on (ceph, beegfs, lustre)

default['slurm']['server']['cgroup_dir'] # path to the cgroup plugin directory, defaults to '/etc/slurm-llnl/cgroup'

default['slurm']['server']['cgroup_conf_file'] # path to the cgroup configuration file, defaults to '/etc/slurm-llnl/cgroup.conf'

default['slurm']['server']['plugstack_dir'] # path to the slurm plugin directory, defaults to '/etc/slurm-llnl/plugstack.conf.d'

default['slurm']['server']['plugstack_conf_file'] # path to the slurm plugin configuration file, defaults to '/etc/slurm-llnl/plugstack.conf'

default['slurm']['shifter'] # Boolean, if true shifter will be installed

default['shifter']['imagegw'] # Boolean, if true the shifter image gateway will be installed and configured (assumes default['slurm']['shifter'] == true

default['shifter']['imagegw_fqdn'] # String, Image Gateway FQDN, accessible hostname or ip address, defaults node['slurm']['control_machine']
```

Compute nodes

In the computes.rb attribute file you can see an example for the various slurm cluster settings.

For now we assume three types of partitions (and nodes):

  • small
  • medium
  • large

representing the capacity (memory) for each group. The nodes in each group are assumed to be homogeneous.

Each group properties can be passed via the following attributes

default['slurm']['conf']['nodes'][type]['count']
default['slurm']['conf']['nodes'][type]['properties']['cpus'] # amount of CPUs available in the node group, Integer
default['slurm']['conf']['nodes'][type]['properties']['mem'] # amount of RAM available in the node group, Megabytes
default['slurm']['conf']['nodes'][type]['properties']['sockets'] # number of sockets in node group, on private cloud systems it is usually the number of cpus
default['slurm']['conf']['nodes'][type]['properties']['cores_per_socket'] # number of cores per socket, on private cloud systems it is usually one
default['slurm']['conf']['nodes'][type]['properties']['threads_per_core'] # number of threas per core, on private cloud systems it is usually one
default['slurm']['conf']['nodes'][type]['properties']['weight'] # preference for being allocated work to, the lower the weight the highest the preference

At this time, this cookbook is designed to work either as a monolith (PoC) or to be deployed in a private cloud environment.

Data Bags

From the previous section we can see which data bags are required to exist. Each of the items must have a key with the same name as the data bag, where the secret value should be stored.
Within those databags we have to create the following items:

DataBag Item Keys
slurm_db_passwords mysqlroot ---
slurm_db_passwords node['mysql']['user']['slurm'] ---
slurm_secrets munge ---

Any of the slurm_db_passwords items should be text passwords, generated with your favorite tool.

The munge key should be a base64 key, based on binary data generated from running either of the following:

  • $ create-munge-key -r on a system with munge installed (note that it will try to overwrite any existing key in /etc/munge/munge.key)
  • $ dd if=/dev/random bs=1 count=1024 > munge.key
  • $ dd if=/dev/urandom bs=1 count=1024 > munge.key

For more information on generating a munge key see the munge documentation.

Authors

Dependent cookbooks

mariadb ~> 2.0
shifter ~> 1.0

Contingent cookbooks

There are no cookbooks that are contingent upon this one.

slurm CHANGELOG

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog and this project adheres to Semantic Versioning.

This file is used to list changes made in each version of the slurm cookbook.

1.2.4

Fixed

  • node['shifter'] being hardcoded and overwritten the attribute set by wrapper cookbooks

1.2.3

Changed

  • ruby code to match a non-empty attribute

1.2.2

Changed

  • ruby code to match a Boolean

1.2.1

Changed

  • how we decide if the image manager is installed. It is now via an attribute

1.1.2

Fixed

  • count can now be a String, an integer or some type that can be converted to an integer

1.1.1

Fixed

  • wrong count comparison.

1.1.0

Removed

Fixed

  • slurm.conf node list appeared [1-1] when the type count was 1, still worked but not very appealing
  • slurm.conf node list appeared [1-0] when the type count was 1, which made the slurmctld service not start

1.0.6

Added

  • forgotten with_slurm option to shifter resources to generate the shifter_slurm.so file

1.0.5

Added

  • Edge case to not export nfs shares if testing monolith, there seem to be some issues with nfs exports when using dokken

Changed

  • compute verifications to more friendly boolean expressions
  • reordered resource notifications

1.0.4

Fixed

  • chef service resource action

1.0.3

Added

  • NFS Kernel service explicit start, it is a bad practice to expect services to be running after the respective packages are installed

1.0.2

Changed

  • control machine address should be just the hostname, the name resolution is assumed to be solved locally in each node

1.0.1

Removed

  • support for Ubuntu Xenial

1.0.0

Changed

  • shifter dependency to major version 1

0.6.2

Fixed

  • Linting

0.6.1

Changed

  • Slurm and munge users are now regular user so that we can force the uid and gid values

Added

  • home directory for both slurm and munge users

0.6.0

Added

  • MUNGE user and group with pre-established uid and gid
  • SLURM user and group with pre-established uid and gid
  • Updated documentation

0.5.6

Removed

  • munge service nfs mount due to user uid mismatch between the controller and the compute nodes

0.5.5

Changed

  • Now using supermarket sources for all dependent cookbooks

0.5.4

Added

  • Chef logging (info) for compute information on mount stage

0.5.3

Fixed

  • Ruby syntax error on assignment

0.5.2

Added

  • subnet filtering to exports file, via the node['slurm']['nfs_network'] attribute
  • enabled option to chef mount resources
  • proper update to exportfs

Fixed

  • slurm.conf newlines and definitions
  • exports file generation
  • slurm variable apps_dir deprecated

0.5.1

Changed

  • apps directory is now slurm directory, making nodes mount the nfs share to the correct path

0.4.1

Added

  • TESTING.md

0.4.0

Added

  • Shifter support and dependency
  • Kitchen suite with shifter support
  • Older Ubuntu/Debian images

0.3.9

Changed

  • proxy is now passed as attribute
  • action for slurm services to :start

0.3.8

Changed

  • proxy string not ending with ";" anymore, gave false negatives in InSpec

0.3.7

Fixed

  • plugin_shifter recipe, had default instead of node

0.3.6

Changed

  • now using appropriate attribute names instead of node['fqdn']

0.3.5

Changed

  • now passing root password to reflect changes in mariadb cookbook, node['mariadb']['server_root_password'] is no longer used as default.

0.3.4

Changed

  • translating base64 munge key into binary

0.3.3

Removed

  • support for Ubuntu 16.04. The slurm version from apt repos is < 16 so slurmdbd fails to start because of hostname issues.

0.3.2

Added

  • support for monolith testing, setting node['slurm']['monolith_testing'] attribute to true configures slurm.conf file with an entry for the slurmctl too

Fixed

  • cgroup_allowed_devices_file.conf missing error
  • nfs mount resource does not apply to monolith
  • typo in slurm.conf property
  • Service resource commands for Slurm server

0.3.1

Added

  • Added apt_repository variable to mariadb_repository, changed its mirror to http://mirrors.up.pt/pub/mariadb/repo

Removed

  • Fully removed support for CentOS

0.3.0

Added

  • slurm controller automatic registration with the slurm accounting
  • NFS package installation for the slurm controller and compute nodes
  • NFS configuration for the slurm controller and compute nodes

Removed

  • disable ipv6 on the chef run list

Modified

  • .kitchen.yml sets up a mariadb database, a slurmdb daemon and a slurm controller in one single controller machine
  • changed proxy address to its fqdn, so it will either resolve in ipv5 ou ipv6

Fixed

  • added some redundant apt update commands as in some cases the apt cache didn't seem to be updated

0.2.0

Added

  • working database recipe
  • recipe to disable ipv6 on linux systems

0.1.0

Initial release.

Added

  • created skeleton for the recipes of the different slurm components
  • created initial inspec tests
  • created initial chefspec tests
  • created a modified version of openstack-common get_password library
  • created test data bag skeleton and changed usual location for them, as well as the data bag secret
  • created some attributes, the data structure's structure is still not set in stone

Known Issues

1.1.0

  • when running in travis, Ubuntu 18.04 vms do not start the munge service:
  dokken systemd[1]: Starting MUNGE authentication service...
  -- Subject: Unit munge.service has begun start-up
  -- Defined-By: systemd
  -- Support: http://www.ubuntu.com/support
  -- 
  -- Unit munge.service has begun starting up.
  dokken systemd[1]: munge.service: New main PID 3335 does not belong to service, and PID file is not owned by root. Refusing.
  dokken systemd[1]: munge.service: New main PID 3335 does not belong to service, and PID file is not owned by root. Refusing.
  dokken systemd[1]: munge.service: Start operation timed out. Terminating.
  dokken systemd[1]: munge.service: Failed with result 'timeout'.
  dokken systemd[1]: Failed to start MUNGE authentication service.

the user is created properly, has the right uid and guid, the systemd unit file is executing with user defined by name.
When running locally, with docker, vagrant or launching on openstack it runs fine...

Besides, the Debian 9 run in travis runs just fine. A mystery...

Collaborator Number Metric
            

1.2.4 passed this metric

Contributing File Metric
            

1.2.4 passed this metric

Foodcritic Metric
            

1.2.4 passed this metric

No Binaries Metric
            

1.2.4 passed this metric

Testing File Metric
            

1.2.4 passed this metric

Version Tag Metric
            

1.2.4 passed this metric