Step-by-Step Guide: Loading a HuggingFace ControlNet Dataset from a Local Path

Daniel Gattringer - Aug 15 - - Dev Community

Huggingface provides different options for loading a dataset. When loading a local image dataset for your ControlNet, it's important to consider aspects such as dataset structure, file paths, and compatibility with Huggingface's data handling tools.


Let's assume you've already created your conditioning images and you have the following folder structure:

my_dataset/
├── README.md
└──data/
   ├── captions.jsonl
   ├── conditioning_images
   │   ├── 00001.jpg
   │   └── 00002.jpg
   └── images
       ├── 00001.jpg
       └── 00002.jpg
Enter fullscreen mode Exit fullscreen mode

In this structure, the conditioning_images folder stores your conditioning images, while the images folder contains the target images for your ControlNet. The captions.jsonl file holds the captions linked to these images.

{"image": "images/00001.jpg", "text": "This is the caption of the first image."}
{"image": "images/00002.jpg", "text": "This is the caption of the second image."}
Enter fullscreen mode Exit fullscreen mode

Note
The caption file (or the following metadata file) can also be a csv file. However, if you opt for CSV, be careful of the value separator, as the text may contain commas, which could lead to parsing issues.

Create Metadata File

A metadata file is a good way to provide additional information about your dataset. It can include various types of data, such as bounding boxes, categories, text, or in our case, a path to the conditioning image.

Let's create the metadata.jsonl file:

import json
from pathlib import Path

def create_metadata(data_dir, output_file):
    metadata = []
    try:
        with open(f"{data_dir}/captions.jsonl", "r") as f:
            for line in f:
                data = json.loads(line)
                file_name = Path(data["image"]).name
                metadata.append(
                    {
                        "image": data["image"],
                        "conditioning_image": f"conditioning_images/{file_name}",
                        "text": data["text"],
                    }
                )

        with open(f"{data_dir}/metadata.jsonl", "w") as f:
            for line in metadata:
                f.write(json.dumps(line) + "\n")
    except (FileNotFoundError, json.JSONDecodeError) as e:
        print(f"Error processing data: {e}")

# Example usage:
data_dir = "my_dataset/data"
create_metadata(data_dir)
Enter fullscreen mode Exit fullscreen mode

This will create a metadata.jsonl containing all the information we need for our ControlNet. Each line in the file corresponds to an image, a conditioning image, and the associated text caption.

{"image": "images/00001.jpg", "conditioning_image": "conditioning_images/00001.jpg", "text": "This is the caption of the first image."}
{"image": "images/00002.jpg", "conditioning_image": "conditioning_images/00002.jpg", "text": "This is the caption of the second image."}
Enter fullscreen mode Exit fullscreen mode

Once you've created the metadata.jsonl file, your file structure should look like this:

my_dataset/
├── README.md
└──data/
   ├── captions.jsonl
   ├── metadata.jsonl
   ├── conditioning_images
   │   ├── 00001.jpg
   │   └── 00002.jpg
   └── images
       ├── 00001.jpg
       └── 00002.jpg
Enter fullscreen mode Exit fullscreen mode

Create a Loading Script

Finally, we must create a loading script that handles all the data in the metadata.jsonl file. The script should be located in the same directory as the dataset and should have the same name.

Your directory structure should look like this:

my_dataset/
├── README.md
├── my_dataset.py
└──data/
   ├── captions.jsonl
   ├── metadata.jsonl
   ├── conditioning_images
   │   ├── 00001.jpg
   │   └── 00002.jpg
   └── images
       ├── 00001.jpg
       └── 00002.jpg
Enter fullscreen mode Exit fullscreen mode

For the script, we need to implement a class that inherits from GeneratorBasedBuilder and contains these three methods:

  • _info stores information about your dataset.
  • _split_generators defines the splits.
  • _generate_examples generates the images and labels for each split.
import datasets

class MyDataset(datasets.GeneratorBasedBuilder):

    def _info(self):

    def _split_generators(self, dl_manager):

    def _generate_examples(self, metadata_path, images_dir, conditioning_images_dir):
Enter fullscreen mode Exit fullscreen mode

Adding Dataset Metadata

There are many options for specifying information about your dataset, but the most important ones are:

  • features specify the dataset column types.
    • image is an Image feature
    • conditioning_image is an Image feature
    • text is a string value
  • supervised_keys that specify the input features.
# Global variables
_DESCRIPTION = "TODO"
_HOMEPAGE = "TODO"
_LICENSE = "TODO"
_CITATION = "TODO"

_FEATURES = datasets.Features(
    {
        "image": datasets.Image(),
        "conditioning_image": datasets.Image(),
        "text": datasets.Value("string"),
    },
)
Enter fullscreen mode Exit fullscreen mode

As you can see above, I've set some variables to "TODO". These options are for informational purposes only and do not affect loading.

def _info(self):
    return datasets.DatasetInfo(
                description=_DESCRIPTION,
                features=_FEATURES,
                supervised_keys=("conditioning_image", "text"),
                homepage=_HOMEPAGE,
                license=_LICENSE,
                citation=_CITATION,
            )
Enter fullscreen mode Exit fullscreen mode

Define the Dataset Splits

dl_manager is used to download a dataset from a Huggingface repo but here we use it to get the data directory path which is passed in the load_dataset function.

Here we define the local paths to our data

  • metadata_path path of the metadata.jsonl file
  • images_dir path to the images
  • conditioning_images_dir path to the conditioning images

Note
If you've chosen different names for your folder structure, you may need to adjust the metadata_path, images_dir, and conditioning_images_dir variables.

def _split_generators(self, dl_manager):
    base_path = Path(dl_manager._base_path).resolve()
    metadata_path = base_path / "data" / "metadata.jsonl"
    images_dir = base_path / "data"
    conditioning_images_dir = base_path / "data"

    return [
        datasets.SplitGenerator(
            name=datasets.Split.TRAIN,
            # These kwargs will be passed to _generate_examples
            gen_kwargs={
                "metadata_path": str(metadata_path),
                "images_dir": str(images_dir),
                "conditioning_images_dir": str(conditioning_images_dir),
            },
        ),
    ]
Enter fullscreen mode Exit fullscreen mode

The last method loads the matadata.jsonl file and generates the image and its associated conditioning image and text.

@staticmethod
def load_jsonl(path):
    """Generator to load jsonl file."""
    with open(path, "r") as f:
        for line in f:
            yield json.loads(line)

def _generate_examples(self, metadata_path, images_dir, conditioning_images_dir):
    for row in self.load_jsonl(metadata_path):
        text = row["text"]

        image_path = row["image"]
        image_path = os.path.join(images_dir, image_path)
        image = open(image_path, "rb").read()

        conditioning_image_path = row["conditioning_image"]
        conditioning_image_path = os.path.join(
            conditioning_images_dir, row["conditioning_image"]
        )
        conditioning_image = open(conditioning_image_path, "rb").read()
        yield row["image"], {
            "text": text,
            "image": {
                "path": image_path,
                "bytes": image,
            },
            "conditioning_image": {
                "path": conditioning_image_path,
                "bytes": conditioning_image,
            },
        }
Enter fullscreen mode Exit fullscreen mode

Following these steps, you can load a ControlNet dataset from a local path.

# with the loading script, we can load the dataset
ds = load_dataset("my_dataset")

# (optional)
# pass trust_remote_code=True to avoid the warning about custom code
# ds = load_dataset("my_dataset", trust_remote_code=True)
Enter fullscreen mode Exit fullscreen mode

If you have any questions, feel free to leave a comment below.


Full code for the loading script:

import os
import json
import datasets
from pathlib import Path

_VERSION = datasets.Version("0.0.2")

_DESCRIPTION = "TODO"
_HOMEPAGE = "TODO"
_LICENSE = "TODO"
_CITATION = "TODO"

_FEATURES = datasets.Features(
    {
        "image": datasets.Image(),
        "conditioning_image": datasets.Image(),
        "text": datasets.Value("string"),
    },
)

_DEFAULT_CONFIG = datasets.BuilderConfig(name="default", version=_VERSION)

class MyDataset(datasets.GeneratorBasedBuilder):
    BUILDER_CONFIGS = [_DEFAULT_CONFIG]
    DEFAULT_CONFIG_NAME = "default"

    def _info(self):
        return datasets.DatasetInfo(
            description=_DESCRIPTION,
            features=_FEATURES,
            supervised_keys=("conditioning_image", "text"),
            homepage=_HOMEPAGE,
            license=_LICENSE,
            citation=_CITATION,
        )

    def _split_generators(self, dl_manager):
        base_path = Path(dl_manager._base_path)
        metadata_path = base_path / "data" / "metadata.jsonl"
        images_dir = base_path / "data"
        conditioning_images_dir = base_path / "data"

        return [
            datasets.SplitGenerator(
                name=datasets.Split.TRAIN,
                # These kwargs will be passed to _generate_examples
                gen_kwargs={
                    "metadata_path": metadata_path,
                    "images_dir": images_dir,
                    "conditioning_images_dir": conditioning_images_dir,
                },
            ),
        ]

    @staticmethod
    def load_jsonl(path):
        """Generator to load jsonl file."""
        with open(path, "r") as f:
            for line in f:
                yield json.loads(line)

    def _generate_examples(self, metadata_path, images_dir, conditioning_images_dir):
        for row in self.load_jsonl(metadata_path):
            text = row["text"]

            image_path = row["image"]
            image_path = os.path.join(images_dir, image_path)
            image = open(image_path, "rb").read()

            conditioning_image_path = row["conditioning_image"]
            conditioning_image_path = os.path.join(
                conditioning_images_dir, row["conditioning_image"]
            )
            conditioning_image = open(conditioning_image_path, "rb").read()
            yield row["image"], {
                "text": text,
                "image": {
                    "path": image_path,
                    "bytes": image,
                },
                "conditioning_image": {
                    "path": conditioning_image_path,
                    "bytes": conditioning_image,
                },
            }
Enter fullscreen mode Exit fullscreen mode
.
Terabox Video Player