Deploy MLFlow for on EC2 using Terraform with authentication

This tutorial guide aims to help the reader setup an MLFlow server running on an EC2 instance with an S3 bucket for storing artifacts remotely and MLFlows basic authentication enabled, utilizing terraform to deploy the infrastructure via code.

After completing this guide, you will have:

  • An MLFlow Server running on an EC2 instance

  • https redirection from a custom domain to access the ui

  • An S3 bucket to remotely store artifacts

  • Persistent EBS volume for storing MLFlow backend files

  • MLFlows basic authentication method will be enabled on the MLFlow Server

Pre-Requisites

  • Ansible CLI should be installed on your machine

  • Terraform CLI should be installed on your machine

  • The AWS CLI should be installed on your machine

  • The AWS Profile you wish to utilize to deploy the configuration should be configured on your machine

  • The GitHub Repository is cloned to your machine

Setting up Environment Variables

In the repository root directory, create a .env file.

  • Copy and paste the .env.template file contents into the .env file.

  • Replace the placeholder values with the values you want to use.

  • Export the environment variables.

    • Either source this file, or copy and paste the content in the terminal.

Bootstrapping terraform state tracking

Init terraform project

In your terminal navigate to the terraform/bootstrap directory of the repository:

cd terraform/bootstrap

Init the project:

terraform init

Deploy the state tracking resources

Plan the resources to check for errors:

terraform plan

Deploy the resources:

terraform apply -auto-approve

Setup remote backend

In the terraform/bootstrap directory, create a backend.tf file.

Copy and paste the backend.tf.template file contents from the terraform/bootstrap directory into the backend.tf file.

Replace the placeholder values with your values, from the .env file

  • These cannot be passed as variables, so we must add the values here.

Re-init the project and re-apply:

terraform init
terraform apply -auto-approve

Deploying the mlflow server

Setup remote backend

In the terraform/mlflow directory, create a backend.tf file.

Copy and paste the backend.tf.template file contents from the terraform/mlflow directory into the backend.tf file.

Replace the placeholder values with your values, from the .env file.

  • The main difference between the two backend files is the key field.

  • It is crucial that you use the correct key for each backend or there will be state issues.

Init terraform project

terraform init

Using targeted deployments

We will be using terraform apply -target=module.module_name -auto-approve pattern throughout. This will throw warnings.

Ignore these warnings as we want to be able to apply each resource independently as a module.

This approach will allow us to destroy resources temporarily without destroying the entire configuration.

This will be useful if we want to temporarily destroy the ec2 instance and only have it running when we need it.

Deploying the network module

Plan the network resources to check for errors:

terraform plan -target=module.network

Deploy the network resources:

terraform apply -target=module.network -auto-approve

Take note of the mlflow_ec2_eip_public_ip output. We will use this later.

Deploying the s3_bucket module

Plan the s3_bucket resources to check for errors:

terraform plan -target=module.s3_bucket

Deploy the s3_bucket resources:

terraform apply -target=module.s3_bucket -auto-approve

Deploying the iam module

Plan the iam resources to check for errors:

terraform plan -target=module.iam

Deploy the iam resources:

terraform apply -target=module.iam -auto-approve

Deploy the ec2_storage module

Plan the ec2_storage resources to check for errors:

terraform plan -target=module.ec2_storage

Deploy the ec2_storage resources:

terraform apply -target=module.ec2_storage -auto-approve

Deploy the ec2_instance module

Plan the ec2_instance resources to check for errors:

terraform plan -target=module.ec2_instance

Deploy the ec2_instance resources:

terraform apply -target=module.ec2_instance -auto-approve

Deploy the mlflow_server module

In the code editor of your choice, lets make the following changes within the mlflow_server module:

  • In the terraform/mlflow/mlflow_server/ansible directory, create a hosts.ini file.

  • Copy and paste the hosts.ini.template file contents into the hosts.ini file.

  • Replace xx.xxx.xxx.xxx with your mlflow_ec2_eip_public_ip that you took note of earlier.

  • In the terraform/mlflow/mlflow_server/ansible directory, create a start_mlflow_server.yml file.

  • Copy and paste the start_mlflow_server.yml.template file contents into the start_mlflow_server.yml file.

  • Navigate to the Create MLFlow auth configuration file block.

  • Replace <some-secure-password> value of the admin_password key to a suitable password.

    • This sets the default admin password of the server.

    • It is good practice not to use the system default (which is password)

    - name: Create MLFlow auth configuration file
        ansible.builtin.copy:
            dest: /home/ec2-user/mount/custom_auth_config.ini
            content: |
                [mlflow]
                default_permission = READ
                database_uri = sqlite:///basic_auth.db
                admin_username = admin
                admin_password = <some-secure-password>
                authorization_function = mlflow.server.auth:authenticate_request_basic_auth
            owner: ec2-user
            group: ec2-user
            mode: '0644'
  • Navigate to the Activate virtual environment and start MLflow server block.

  • Replace <mlflow_artifacts_bucket_name> value for the --default-artifact-root and --artifacts-destination options.

    • Use the value you chose earlier for your mlflow_artifacts_bucket_name environment variable.

    • This tells mlflow where to store and retrieve its artifacts to/from.

    - name: Activate virtual environment and start MLflow server
        shell: |
            source /home/ec2-user/mount/.venv/bin/activate
            export MLFLOW_TRACKING_URI=http://127.0.0.1:8080
            export MLFLOW_AUTH_CONFIG_PATH=/home/ec2-user/mount/custom_auth_config.ini
            cd /home/ec2-user/mount
            mlflow server --backend-store-uri mlruns \
                    --default-artifact-root s3://<mlflow_artifacts_bucket_name> \
                    --artifacts-destination s3://<mlflow_artifacts_bucket_name> \
                    --host 0.0.0.0 \
                    --app-name basic-auth \
                    --serve-artifacts \
                    --port 8080 &

Plan the mlflow_server resources to check for errors:

terraform plan -target=module.mlflow_server

Deploy the mlflow_server resources:

terraform apply -target=module.mlflow_server -auto-approve

When prompted, type yes for adding the server fingerprint to known hosts.

Deploy the load_balancer module [optional]

If you want to be able to interact with your server over https and use a custom domain/subdomain, then follow along the below instructions.

If you do not need this (and as a note, load balancer costs ~$17USD/month) then you can skip this section and utilize the elastic ip public ip.

In the code editor of your choice, lets make the following changes within the load_balancer module:

  • In the load_balancer module folder, create a acm_certificate.tf and route_53_records.tf file.

  • Copy and paste the acm_certificate.tf.template file contents into the acm_certificate.tf file.

  • Replace <your.domain.or.subdomain.here> value of the domain_name key to a valid domain/subdomain.

    • This has to be a value that can be created within your hosted zone which is not already in use.

  • Copy and paste the route_53_records.tf.template file contents into the route_53_records.tf file.

  • Replace <your.domain.or.subdomain.here> value of the name key to a valid domain/subdomain.

    • Make this the same as the value that you added above to acm_certificate.tf

Plan the load_balancer resources to check for errors:

terraform plan -target=module.load_balancer
Deploy the load_balancer resources:
terraform apply -target=module.load_balancer -auto-approve

Working with the Python API

Now that our server is up and running (assuming the above steps executed successfully) we will want to start adding users and modifying permissions.

I have included some sample python scripts for carrying out basic tasks like creating and deleting users and updating passwords.

For a comprehensive overview of the python api for user management:

To get up and running start a virtual environment and install the requirements.

Do this by navigating to the python directory in your terminal and executing the commands below:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

To start using the python scripts included, you must export some additional environment variables.

Create a .env file in the python directory and copy the contents of the python/.env.template in the same directory in to it.

Adjust the values to match your configuration. The tracking uri for example can be either:

Export these environment variables by either sourcing the file or copy and pasting the export commands in to the terminal.

Assuming you could install the dependencies, try modify the python scripts to start making adjustments to your users.

With the virtual environment active and the scripts modified you can start running the scripts.

  • python update_user_password.py: updates the specified users password.

  • python create_user.py: creates the specified user and password.

  • python delete_user.py: deletes the specified user.

Cleannig Up

Temporarily removing the EC2 intsance

Use the destroy command with the -target option to just take down the ec2 instance.

terraform destroy -target=module.ec2_instance -auto-approve

Re-deploy the instance again with:

terraform apply -target=module.ec2_instance -auto-approve

This can be useful if you want to make some changes to the ec2 instance.

Alternatively stop the instance temporarily with the aws cli:

aws ec2 stop-instances --instance-ids instance_id

And restart the instance with:

aws ec2 start-instances --instance-ids instance_id

The instance_id should be replaced with the value that will get output after you have deployed the mlflow_instance module.

Pernamently destroynig the infrastructure

To pernamently remove all resources, simply run the destroy command without the -target option.

terraform destroy

This cannot be undone, and you will need to re-configure the ip address in the terraform/mlflow/mlflow_server/hosts.ini when redeploying.

Conclusion

Congrats! If you followed the instructions successfully, you now have a mlflow server running on a custom domain for tracking your ML experiments with other users. This is a fairly basic implementation of the mlflow service on ec2.

We will continue to build on this in future posts where we will demonstrate features such as custom authentication, central database integration, dockerized deployments to utilize serverless architectures and more.

Next
Next

Lean AI Development: Transform Ideas into MVMs into MVPs