Ansible roles to install an Spark Standalone cluster (HDFS/Spark/Jupyter Notebook) or Ambari based Spark cluster
APACHE-2.0 License
This repository defines multiple ansible roles to help deploying different modes of a Spark cluster and Data Science Platform based on Anaconda and Jupyter Notebook stack
You will need a driver machine with ansible installed and a clone of the current repository:
curl -O https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -i epel-release-latest-7.noarch.rpm
sudo yum update -y
sudo yum install -y ansible
pip install --upgrade ansible
In order to have variable overriding from host inventory, please add the following configuration into your ~/.ansible.cfg file
[defaults]
host_key_checking = False
hash_behaviour = merge
Ansible uses 'host inventory' files to define the cluster configuration, nodes, and groups of nodes that serves a given purpose (e.g. master node).
Below is a host inventory sample definition:
[all:vars]
ansible_connection=ssh
#ansible_user=root
#ansible_ssh_private_key_file=~/.ssh/ibm_rsa
gather_facts=True
gathering=smart
host_key_checking=False
install_java=True
install_temp_dir=/tmp/ansible-install
install_dir=/opt
python_version=2
[master]
lresende-elyra-node-1 ansible_host=IP ansible_host_private=IP ansible_host_id=1
[nodes]
lresende-elyra-node-2 ansible_host=IP ansible_host_private=IP ansible_host_id=2
lresende-elyra-node-3 ansible_host=IP ansible_host_private=IP ansible_host_id=3
lresende-elyra-node-4 ansible_host=IP ansible_host_private=IP ansible_host_id=4
lresende-elyra-node-5 ansible_host=IP ansible_host_private=IP ansible_host_id=5
Some specific configurations are:
Note: ansible_host_id is only used when deploying a "Spark Standalone" cluster. Note: Ambari is currently only supporting Python 2.x
In this scenario, a minimal blueprint is used to deploy the required components to run YARN and Spark.
The sample playbook below can be used to deploy an Spark using an HDP distribution
- name: ambari setup
hosts: all
remote_user: root
roles:
- role: common
- role: ambari
ansible-playbook --verbose <deployment playbook.yml> -i <hosts inventory>
Example:
ansible-playbook --verbose setup-ambari.yml -c paramiko -i hosts-fyre-ambari
In this scenario, a Standalone Spark cluster will be deployed with a few optional components.
- name: spark setup
hosts: all
remote_user: root
roles:
- role: common
- role: hdfs
- role: spark
- role: spark-cluster-admin
Note: When deploying Kafka, the Zookeeper role is required
ansible-playbook --verbose <deployment playbook.yml> -i <hosts inventory>
Example:
ansible-playbook --verbose setup-spark-standalone.yml -c paramiko -i hosts-fyre-spark
In this scenario, an existing Spark cluster is updated with necessary components to build a data science platform based on Anaconda and Jupyter Notebook stack.
- name: anaconda
hosts: all
vars:
anaconda:
update_path: true
remote_user: root
roles:
- role: anaconda
- name: notebook platform dependencies
hosts: all
vars:
notebook:
use_anaconda: true
deploy_kernelspecs_to_workers: false
remote_user: root
roles:
- role: notebook
Playbook Configuration
The Ambari role will install MySQL community edition which is available under GPL license.
The Notebook role will install R which is available under GPL2 | GPL 3
By deploying these packages via the ansible utility scripts in this project you are accepting the license terms for these components.