An attempt to overhaul and modernize the infrastructure to run aframe
. Unifying multiple threads of research
Please see the ml4gw quickstart for help on setting up your environment
on the LIGO Data Grid (LDG) and for configuring access to Weights and Biases, and the Nautilus hypercluster.
This quickstart includes a Makefile and instructions for setting up all of the necessary software, environment variables, and credentials
required to run aframe
.
NOTE: this repository is a WIP. You will encounter bugs, quirks, and undesired behavior. If you have any suggestions on making the development process easier, please open up an issue!
Each sub-task in aframe
is implemented as a containerized application, whose environment and Apptainer definition file live with the code they’re meant to deploy. These live under the projects
sub-directory. The projects include
data
: Querying strain data and generating waveforms for training and testing.train
: Pytorch lightning code for training neural-networks.export
: Exporting trained networks as accelerated executables for inference.infer
: Launching triton inference servers and deploying inference clients to analyze timeslides and injections.utils
: General utilites used by all projects (TODO: move this under libs
)You can build and execute code inside these containers locally. As an example, let’s go through the process of generating data for training aframe
.
First, you will need to build the data
project container:
mkdir ~/aframe/images
cd projects/data
apptainer build ~/aframe/images/data.sif apptainer.def
Once that is complete, let’s query for open science segments containing high-quality data:
mkdir ~/aframe/data/
apptainer run ~/aframe/images/data.sif \
python -m data query --flags='["H1_DATA", "L1_DATA"]' --start 1240579783 --end 1241443783 --output_file ~/aframe/data/segments.txt
Inspecting the output, (vi ~/aframe/data/segments.txt
) it looks like there are quality data segments (1240579783, 1240587612)
and (1240594562, 1240606748)
. Let’s fetch strain data during those segments. One will be used for training, the other for validating
apptainer run ~/aframe/images/data.sif \
python -m data fetch \
--start 1240579783 \
--end 1240587612 \
--channels='["H1", "L1"]' \
--sample_rate 2048 \
--output_directory ~/aframe/data/background/
apptainer run ~/aframe/images/data.sif \
python -m data fetch \
--start 1240594562 \
--end 1240606748 \
--channels='["H1", "L1"]' \
--sample_rate 2048 \
--output_directory ~/aframe/data/background/
Finally, lets generate some waveforms:
apptainer run ~/aframe/images/data.sif \
python -m data waveforms \
--prior data.priors.priors.end_o3_ratesandpops \
--num_signals 10000 \
--waveform_duration 8 \
--sample_rate 2048 \
--output_file ~/aframe/data/signals.hdf5
Great! We are now ready to train a model! In the same fashion, let’s build the training container:
mkdir ~/aframe/images
cd projects/train
apptainer build ~/aframe/images/train.sif apptainer.def
and launch a training job!
mkdir ~/aframe/results
APPTAINERENV_CUDA_VISIBLE_DEVICES=<ID of GPU you want to train on> apptainer run --nv ~/aframe/images/train.sif \
python -m train \
--config /opt/aframe/projects/train/config.yaml \
--data.ifos=[H1,L1] \
--data.data_dir ~/aframe/data/train \
--trainer.logger=WandbLogger \
--trainer.logger.project=aframe \
--trainer.logger.name=my-first-run \
--trainer.logger.save_dir=~/aframe/results/my-first-run
If you inspect the possible arguments for the train project by running
apptainer run python -m train --help
This will infer most of your training arguments from the YAML config that got put into the container at build time. If you want to change this config, or if you change any code and you want to see those changes reflected inside the container, you can simply update the start of the command to read apptainer run --nv --bind .:/opt/aframe
.
Once your run is started, you can go to wandb.ai and track your loss and validation score. If you don’t want to track your run with W&B, just remove all the first three --trainer
arguments above. This will save your training metrics to a local CSV in the save_dir
.
You can even train using multiple GPUS, simply by specifying a list of comma-separated GPU indices to APPTAINERENV_CUDA_VISIBLE_DEVICES
.
luigi
and law
That command above is simple enough, but it might be nice to 1) specify arguments with configs, and 2) Incorporate tasks as steps in a larger pipeline.
To do this, this repo takes advantage of a library called luigi
(and a slightly higher-level wrapper, law
) to construct configurable, modular tasks that can be strung into pipelines.
To understand the structure of luigi
tasks, it is reccommended to read the docs.
The top level aframev2
repository contains the environment that is used to launch tasks with luigi
and law
.
to install this environment, simply run
poetry install
in the root of this repository.
To run a local training job you can now run
poetry run law run aframe.TrainLocal \
--gpus <ID of GPUs to train on> \
--image ~/aframe/images/train.sif \
--config /opt/aframe/projects/train/config.yaml \
--data-dir ~/aframe/data/train \
--run-dir ~/aframe/results/my-first-luigi-run \
--use-wandb \
--wandb-name my-first-luigi-run
This has taken care of setting some sensible defaults for you, and allows for simpler syntax like the --gpus
arg and --use-wandb
which will configure most of your W&B settings for you.
All tasks also come with a built-in --dev
arg which will automatically map your current code into the container for super low-friction development.
To see all the parameters a task has to offer, you can run e.g.
poetry run law run aframe.tasks.TrainLocal --help
As mentioned, luigi
and law
allow for the construction of large scale pipelines. The aframe/pipelines/
directory contains common analysis pipelines.
Currently, only the sandbox
pipeline is available. This pipeline will launch a single end-to-end aframe
workflow consisting of training / testing data generation, model training, model export, and inference using a triton server. The easiest way to run the pipeline is to use the config file, which is specified in law
via the LAW_CONFIG_FILE
environment variable:
LAW_CONFIG_FILE=aframe/pipelines/sandbox/sandbox.cfg poetry run law run aframe.pipelines.sandbox.Sandbox --gpus <GPU IDs>
To search over hyperparameters, you can launch a local hyperparameter tuning job by running
APPTAINERENV_CUDA_VISIBLE_DEVICES=<IDs of GPUs to tune on> apptainer run --nv --bind .:/opt/aframe ~/aframe/images/rain.sif \
python -m train.tune \
--config /opt/aframe/projects/train/config.yaml
--data.ifos=[H1,L1]
--data.data_dir ~/aframe/data/train
--trainer.logger=WandbLogger
--trainer.logger.project=aframe
--trainer.logger.save_dir=~/aframe/results/my-first-tune \
--tune.name my-first-tune \
--tune.storage_dir ~/aframe/results/my-first-tune \
--tune.temp_dir ~/aframe/results/my-first-tune/ray \
--tune.num_samples 8 \
--tune.cpus_per_gpu 6 \
--tune.gpus_per_worker 1 \
--tune.num_workers 4
This will launch 8 hyperparameter search jobs that will execute on 4 workers using the Asynchronous Successive Halving Algorithm (ASHA). All the runs will be given the same group ID in W&B, and will be assigned random names in that group.
NOTE: for some reason, right now this will launch one job at a time that takes all available GPUs. This needs sorting out
The cool thing is that if you already have a ray cluster running somewhere, you can distribute your jobs over that cluster by simply adding the --tune.endpoint <ip address of ray cluster>:10001
command line argument.
This isn’t implemented at the luigi
level yet, but the skeleton of how it will work is in aframe/tasks/train.py
.
name
: name the run will be assignedgroup
: group to which the run will be assigned. This is useful for runs that are part of the same experiment but execute in different scripts, e.g. a hyperparameter sweep or maybe separate train, inferenence, and evaluation scriptstags
: comma separate list of tags to give your run. Makes it easy to filter in the dashboard e.g. for autoencoder
runsproject
: the workspace consisting of multiple related experiments that your run is a part of, e.g. aframe
entity
: the group managing the experiments your run is associated, e.g. ml4gw
. If left blank, the project and run will be associated with your personal accounttemp_dir
when tuning remotely, this is just a consequence of ray
trying to write to a root directory for temp files that breaks on LDGstorage_dir
should be a remote S3 bucket that all your workers can access. You’ll need to specify an AWS_ENDPOINT_URL
environment variable for those workers so they know where your bucket livesml4gw
branch with autoencoder libraryaframe/base.py
that uses ray-kube to spin up a cluster in a law Task
configure_cluster
method to do task-specific cluster configuration before launching, e.g. setting secrets, environment variables, etc.AWS_ENDPOINT_URL
to the desired target
There’s tons of TODOS
littering the code that cover stuff I’ll have missed here.
One major one is the ability to log plots of model predictions to W&B during validation. See my comments on it here.
Basically you should be able to define callbacks for various tasks that have an on_validation_score
method to pass model inputs and outputs that you can log to W&B.
I think this will be particularly important for the autoencoder work, where visualizing what it’s learning will be instructive.
More broadly, it will be useful to start up-leveling some of the training framework utilities to library that sits one level above ml4gw
. I’m thinking of
Dataset
hereluigi
/law
stuff, which would probably be its own library even one layer above this.