How to handle Machine Learning Project lifecycle using Github branches?

Original Source Here


The organization of the project structure can be set up in the following way:

.github/workflows/ — github workflows definitions for the CI.

data/ — Contains all of the I/O data. Usually, split further into sub_folders for clarity: raw and processed. Data Files are NOT version controlled (VC) by Git, but we use Data Version Control (DVC) system for every file on every branch separately.

logs/ — all logs from the code altogether with model training losses information in a form of a CSV file (i.e. using TensorFlow CSV callbacks). This folder is not VC by Git, but we can optionally use DVC here.

mlruns/ — used by MLFlow package with automatic logging of all TenforFlow related training with all the artifacts and configs. This folder is not VC by Git -> It should always be kept up to date, especially for trainings on the remote servers -> to see the full project development history across all branches.

models/ — storing models from all branches which typically contain branch name information in their name string to easily distinguish (I will typically keep only 1 model per branch). NOT VC by Git nor DVC.

sql/ — keeping all queries/code to reproduce easily the raw_input_data.csv for the project/branch (VC by Git)

tools/ —the source code for all helper functions/classes used in all steps of the project (train, analysis, evaluation, packaging, monitoring, etc). This is the general tool-set developed for the project. We need to keep everything that can be generalized and reused here instead of some pure definitions separately in notebooks.

imgs/ — all the images related to the current branch (model architecture, training losses, etc)(VC by Git) — information about the current model, approach and changes related to the current branch (VC by Git). — configuration of the project for the current branch (VC by Git).

train.ipynb — base minimalist training pipeline (VC by Git| or NOT — to be decided)

eval.ipynb — base minimalist evaluation pipeline (VC by Git| or NOT — to be decided)

requirements.txt — python packages requirements for the project (mainly for the CI)

.env — safekeeping of all secrets, tokens — NOT VC by Git!!! Keep this one safe!

.gitignore — base Git setup on what not to follow.

.pre-commit-config.yaml — pre-commit hooks set up for the project (typically isort, black, etc) -> allows maintaining coherence in the python style and format the code automatically.

You can find the link to this project: HERE


Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: