Overview

This is an Ansible Playbook designed for running a Raycrafter Master Server. It installs and configures applications that are needed for production deployments.

It deploys a Django project and sets up Gunicorn and Nginx to serve your site. PostgreSQL is used as database backend for Django. Celery (with RabbitMQ as message broker) for asynchronous task queue/job queue.

GridFTP is used for transferring files to/from the cluster.

On top of that a logging server is deployed. In this case it is Graylog, which depends on Elasticsearch and MongoDB.

Stack:

Tested with OS: Ubuntu 14.04 LTS x64

Architecture

This graph shows the interaction between Master Server and the Cray XC40 'Hornet'.

_images/architecture.png

Django has most of the business logic on the server side. It is served with Gunicorn and Nginx. You can add new jobs or query information via a REST API.

When the user submits a new job to Django, the job is stored in the PostgreSQL database. Because submitting jobs to the cluster takes long, the work is pushed to Celery workers. The workers ssh onto the cluster to submit jobs, or transfer input and output data. GridFTP is a secure, high-performance protocol, which can be used for data transfer. Django uses a message broker (RabbitMQ) to send tasks to Celery. The results of the tasks are stored in the database. The Celery workers have to live on another machine for security reasons. According to the security guidelines of the HLRS, an external server cannot have ssh access (see “Richtlinien zur Anbindung externer Server 3.0 HWW und SICOS Version 1.0 vom 19.11.2013”).

Submitting a job to the cluster works by logging onto the cluster via ssh and executing the qsub command. The job stays in a Queue until a MOM node processess it. aprun executes a programm on the compute nodes.

On the compute nodes, a Python client is started. The client asks Django what job it should render and then executes the renderer. The output of the renderer can be transfered back by Celery workers.

All logs are aggregated by a Graylog server. The Python client on the cluster inspects the logs of the renderer and sends them back. Graylog also has a web interface, where you can create metrics and dashboards to quickly inspect the status of the jobs and servers. Logs are processed and indexed by Elasticsearch. MongoDB is used for some metadata by Graylog.

Note

The HLRS does not allow external servers with UDP connections. Logs have to be sent via TCP.

Usage in Production

There are some important things to note and do before using this project in production. The current development status is more of a proove of concept.

Security

The current setup is insecure! I’m not a security expert. Let a professional check your setup. Submit improvements via pull requests or issues for extra kudos.

Some hints for where to look:

  • All default passwords and keys should be changed and encrypted!
  • Make sure that your firewall limits access. Especially to your logging inputs! There is no authentication for these inputs right now.
  • Check Nginx ssl settings. Use TLS and strong cipher suites.
  • Check that no unnecessary ports are available to the public.
  • Use encryption for all communications with the outside world. For now Graylog UDP inputs are not encrypted!
  • Check how your passwords are stored in Django and in the Graylog database. Use strong salt and slow hash algorithms.

Performance

The server might get very slow. A lot of the software is meant be used in distributed systems. A single machine might not be able to handle the workload. Especially the logging server might need a lot of conifguration and might need to be distributed.

Nginx and Gunicorn settings are important for the performance of the website and REST API. The default settings are not optimal.

Check the Celery worker amount and settings. You might need more of them.

Reliability

Some services are not yet monitored by Supervisor. Also you can configure more programms to log to the logging server.