UCSD Datahub

Overview

A resource provided by UCSD that allows us to take advantage of UCSD's powerful GPUs to speed up our various machine learning and computer vision tasks. We can remotely connect these computers to access their power no matter where we are.

Setup

Fill out this form to request access to use Datahub for Triton UAS. If you have any questions about the form fields, contact a software lead.

Usage

Connecting and accessing

There are 3 ways of connecting to and accessing Datahub:

Web Interface
- Navigate to https://datahub.ucsd.edu and login with your UCSD active directory credentials
- Select an environment with a GPU
- Click "New" in the top right to launch a terminal to run your code
- Create/upload/edit files in the graphical file manager and text editor
VS Code SSH Extension
- Install VS Code
- Install the Remote - SSH VS Code extension
- Generate an SSH key locally if you haven't already:
  - Use the command ssh-keygen -t rsa to generate one
- Copy the contents of your local public key
  - If you used the ssh-keygen -t rsa command then your key will be located at ~/.ssh/id_rsa.pub on your local machine
  - If you used a diffent key generation algorithm then your public key will be named something else. Just make sure the file ends with the .pub extension
- Open a terminal and SSH into Datahub with the command ssh UCSD-USERNAME@dsmlp-login.ucsd.edu
- Paste your public key into the file at ~/.ssh/authorized_keys on Datahub
- Open the ~/.ssh/config file on your local machine
- Add a Host entry to the ~/.ssh/config file that will tell VS Code to spawn a new container on Datahub and run the VS Code Remote Server inside it. (Note that this config does not request a GPU)
```
Host datahub
    User UCSD-USERNAME
    ProxyCommand ssh -i ~/.ssh/id_rsa UCSD-USERNAME@dsmlp-login.ucsd.edu /opt/launch-sh/bin/launch-scipy-ml.sh -P Always -i tritonuas/cv-docker:master -H -N datahub-vscode
```
  - Note that the path to your local ssh private key may vary depending on which key generation algorithm you used
- Add a similar host entry that will request a GPU. If you're not training a model with a GPU then fallback to the previous host entry.
```
Host datahub-gpu
    User UCSD-USERNAME
    ProxyCommand ssh -i ~/.ssh/id_rsa UCSD-USERNAME@dsmlp-login.ucsd.edu /opt/launch-sh/bin/launch-scipy-ml.sh -P Always -g 1 -i tritonuas/cv-docker:master -H -N datahub-vscode
```
- In the menubar click View -> Command Pallete
- Select "Connect to Host"
- Select one of the host entries you created
- Open your project folder and use the integrated VS Code terminal to run commands
  - You should be inside our container and be able to use pipenv for example
SSH (Terminal)
- Enter the command: ssh UCSDUSERNAME@dsmlp-login.ucsd.edu
- Use a prebuilt docker container or a custom one you've created (more info here)

Long-running tasks/training (background containers)

Accessing background containers via SSH ProxyCommand currently doesn't work, so you won't be able to use VS Code to work in a background container. To get around this, after finishing development through VS Code, you'll have to exit and shut down the container, and then SSH in from the terminal directly to start and access a background container. You'll be running the training commands directly like this.

Before you train your model properly on a full dataset and higher epochs, make sure the code as a whole works on a smaller dataset and lower epochs (so an exception doesn't happen after several hours of training) in order to have metrics and results generate properly at the end.

Simply add the -b flag to to the command you used to launch the container. This allows your container to run in the background for up to 6 hours (the default is much lower).

launch-scipy-ml.sh -b -P Always -g 1 -i tritonuas/cv-docker:master -H -N datahub-vscode

To enter the shell of a running container:

kubesh datahub-vscode

Note that once you run a command to train through the terminal, the shell will lock up. In order to avoid this (and not accidentally kill the training in the terminal), you can use nohup to send the training task to the background and free up the shell.

The way to use nohup is

nohup COMMAND &

The "&" at the end is crucial as this actually frees the shell and pipes all the output from the command into a file called nohup.out. Make sure to gitignore this file!

For example, if you were running it with a pipenv command

nohup pipenv run main configs/template-config.yaml &

Now the shell is safe to close, and the model will train in the background!

Datahub Containers

While accessing Datahub from the terminal, you will probably need to use a Docker container to run the software you're intersted in. By default, Datahub doesn't have things like Python3, Pytorch, CUDA, etc. enabled. We can use Docker images to access this software and make our own.

Premade Container

There are two ways we recommend to pull and launch premade containers:

1. Use our premade container

Run this command (recommended to place this command in a script):

launch-scipy-ml.sh -g 1 -i tritonuas/cv-docker:master

If you're feeling spicy, you can actually specify which GPU on Datahub you want to use (make sure it's actually available on the status page). Example, if you want to request a 2080TI:

launch-scipy-ml.sh -g 1 -v 2080ti -i tritonuas/cv-docker:master

2. Select one of the prebuilt UCSD ETS containers

Select a container here and launch the container on datahub using the instructions here

Background Containers

If you would like your processes to not be killed when your host machine loses connection to the remote Datahub server, you can launch Docker contains to stay alive in the background.

Simply add the -b flag to to the command you used to launch the container. This allows your container to run in the background for up to 6 hours.

If you are using our premade container the command will look something like this:

launch-scipy-ml.sh -g 1 -i tritonuas/cv-docker:master -b

Here are a few helpful commands for interfacing with background containers:

Get a list of all running containers:

kubectl get pods

Enter the shell of a running container (POD_NAME is from the output of the previous command):

kubesh POD_NAME

Stopsa and deletes a running container:

kubectl delete pod POD_NAME

Custom Container

Write your own Dockerfile based on our image
Follow the instructions here to build the container.
Launch the container on datahub using the instructions here