A well-organized structure for machine learning projects facilitates comprehension and modification. Furthermore, employing a consistent structure across multiple projects minimizes confusion. Since there is no one-size-fits-all solution, we will look at three methods; a manual folder and file creation, a custom-made template.py
file and the Cookiecutter package to establish a machine-learning project structure.
... where human hands dance and minds orchestrate, we embark on a journey devoid of automation.
The manual execution, in short
1. Project Root: This is the main folder that contains your entire machine learning project.
2. Data: This folder is dedicated to storing your datasets and any relevant data files. It can be further divided into subfolders such as:
- Raw: Contains the original, unprocessed data files.
- Processed: Contains preprocessed data that has undergone cleaning, transformation, and feature engineering.
- External: Store any external data sources that you use for your project.
3. Notebooks: This folder is for Jupyter notebooks or any other interactive notebooks you use for experimentation, analysis, and model development. You can organize it with subfolders like:
- Exploratory: Notebook(s) for data exploration and visualization.
- Modeling: Notebook(s) for model development, training, and evaluation.
- Inference: Notebook(s) for deploying and using trained models for predictions.
4. Scripts: This folder contains reusable code scripts or modules that you use in your project. It may include:
- Preprocessing: Scripts for data cleaning, transformation, and feature engineering.
- Model: Scripts for defining and training machine learning models.
- Evaluation: Scripts for model evaluation, metrics calculation, and validation.
- Utilities: General-purpose utility scripts or helper functions.
- Models: This folder is dedicated to storing trained models or model checkpoints. It can be further organized into subfolders based on different experiments, versions, or architectures.
Documentation: Include any project-related documentation, such as README files, data dictionaries, or project specifications.
Results: Store output files, reports, or visualizations generated by your models or experiments.
Config: Store configuration files or parameters used in your project, such as hyperparameters, model configurations, or experiment settings.
Environment: Include files related to the project environment, such as requirements.txt
or environment.yml
, specifying the dependencies and packages required to run your project.
Tests: If you have unit tests or integration tests for your code, you can create a folder to store them.
Logs: Store log files or output logs generated during training or inference.
Saved Objects: If your project involves saving intermediate objects or serialized data, such as pickled files or serialized models, you can create a folder to store them.
...where machines command and algorithms dictate, we venture into a realm free from human intervention.
Template.py
The template.py
file serves as a foundational blueprint or starting point for developing Python code within a machine learning project. It typically contains a set of predefined structures, functions, and placeholders that you can customize and expand upon to build specific functionality.
Below is an example that I commonly use. Copy the code, save it as template.py, then run it.
import os
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO, format='[%(asctime)s]: %(message)s:')
project_name = "textSummarizer"
list_of_files = [
".github/workflows/.gitkeep",
f"src/{project_name}/__init__.py",
f"src/{project_name}/conponents/__init__.py",
f"src/{project_name}/utils/__init__.py",
f"src/{project_name}/utils/common.py",
f"src/{project_name}/logging/__init__.py",
f"src/{project_name}/config/__init__.py",
f"src/{project_name}/config/configuration.py",
f"src/{project_name}/pipeline/__init__.py",
f"src/{project_name}/entity/__init__.py",
f"src/{project_name}/constants/__init__.py",
"config/config.yaml",
"params.yaml",
"app.py",
"main.py",
"Dockerfile",
"requirements.txt",
"setup.py",
"research/trials.ipynb",
]
for filepath in list_of_files:
filepath = Path(filepath)
filedir, filename = os.path.split(filepath)
if filedir != "":
os.makedirs(filedir, exist_ok=True)
logging.info(f"Creating directory:{filedir} for the file {filename}")
if (not os.path.exists(filepath)) or (os.path.getsize(filepath) == 0):
with open(filepath,'w') as f:
pass
logging.info(f"Creating empty file: {filepath}")
else:
logging.info(f"{filename} is already exists")
Your folder structure should resemble something like thisđ
- Make sure that you have the latest python and pip installed in your environment.
- Install cookiecutter
pip install cookiecutter
3: Create a sample repository on github.com (e.g., my-test)
Note: Donât check any options under âInitialize this repository with:â while creating a repository.
4: Create a project structure
Go to a folder where you want to set up the project in your local system and run the following:
cookiecutter -c v1 https://github.com/drivendata/cookiecutter-data-science
Run the above command and it would ask you the following:
You've downloaded \.cookiecutters\cookiecutter-data-science before. Is it okay to delete and re-download it? [yes]:yes
It will ask the following options:
project_name [project_name]: my-testrepo_name [my-test]: my-testauthor_name [Your name (or your organization/company/team)]: Your namedescription [A short description of the project.]: This is a test projSelect open_source_license:
1 - MIT
2 - BSD-3-Clause
3 - No license file
Choose from 1, 2, 3 [1]: 1s3_bucket [[OPTIONAL] your-bucket-for-syncing-data (do not include 's3://')]:aws_profile [default]:Select python_interpreter:
1 - python3
2 - python
Choose from 1, 2 [1]: 1
You can ignore the âs3_bucketâ and âaws_profileâ options.
- Add project to the git repository
cd my-test// Initialize the git
git init// Add all the files and folder
git add .// Commit the files
git commit -m "Initialized the repo with cookiecutter data science structure"// Set the remote repo URL
git remote add origin https://github.com/your_user_id/my-test.git
git remote -v// Push to changes from local repo to github
git push origin master
The final structure should look like below:
The data folder will be in your local folder and wonât appear in github. This is becous it will be in the .gitignore
file.
Remember, these are just but suggested structures, and you can modify them according to your specific needs and preferences. The key is to maintain a logical and organized layout that makes it easy to navigate and understand your project.
Cover photo from ccjk.com