How to analyze document layout by YOLO

ppaanngggg - Jun 13 - - Dev Community

Why we need document layout analysis

Analyzing document layout is critical because it aids in the proper interpretation and understanding of document content. It becomes more significant with the rise of RAG, which relies heavily on the ability to parse documents accurately.

RAG systems frequently interact with a variety of documents. Scientific papers, for instance, typically have a complex layout that includes figures, tables, references, and structured sections. Proper parsing is important to avoid content disarray. If not done correctly, the LLM could fail due to the 'garbage in, garbage out' principle.

However, due to the complexity of this problem, it's impossible to apply handcrafted rules as a solution. The best approach is to train a machine learning model.

My solution

You can find my solution in yolo-doclaynet. After examining several models and datasets, I've chosen YOLO as the base model and DocLayNet as the training data. Let's delve into more details.

  1. YOLO is the most advanced vision detection model. It is maintained by Ultralytics, a leading computer vision team. The model is easy to train, evaluate, and deploy. Plus, its size is compact enough to run in a browser or on a smartphone.
  2. DocLayNet is a human-annotated document layout segmentation dataset, containing 80,863 pages from a wide variety of document sources. To the best of my knowledge, it is the highest-quality dataset for document layout analysis. You can download and find more information from this link.

Live Demo

  1. Download the pretrained model from huggingface. Your options include yolov8n-doclaynet, yolov8s-doclaynet, and yolov8m-doclaynet.
  2. Install Ultralytics by executing pip install ultralytics. If you encounter any issues, please refer to https://docs.ultralytics.com/quickstart/#install-ultralytics.
  3. Copy and modify this code snippet to output the detection result:

    from ultralytics import YOLO
    
    img = cv2.imread(your_image_path, cv2.IMREAD_COLOR)
    model = YOLO(your_model_path)
    result = model.predict(img)[0]
    print(result)
    
  4. Debugging the model can be challenging when you can only check the plain-text output. Fortunately, visualizing the results is simple:

    from ultralytics.utils.plotting import Annotator, Colors
    
    colors = Colors()
    annotator = Annotator(img, line_width=line_width, font_size=font_size)
    for label, box in zip(result.boxes.cls.tolist(), result.boxes.xyxyn.tolist()):
        label = int(label)
        annotator.box_label(
            [box[0] * width, box[1] * height, box[2] * width, box[3] * height],
            result.names[label],
            color=colors(label, bgr=True),
        )
    annotator.save(
        os.path.join(os.path.dirname(image), "annotated-" + os.path.basename(your_image_path))
    )
    
  5. Examine the annotated image. Here's an example:

yolo-doclaynet output example

Benchmark

I also evaluate the mAP50-95 performance of the entire yolov8 series on the DocLayNet test set.

label images boxes yolov8n yolov8s yolov8m yolov8l yolov8x
Caption 4983 1542 0.682 0.721 0.746 0.75 0.753
Footnote 4983 387 0.614 0.669 0.696 0.702 0.717
Formula 4983 1966 0.655 0.695 0.723 0.75 0.747
List-item 4983 10521 0.789 0.818 0.836 0.841 0.841
Page-footer 4983 3987 0.588 0.61 0.64 0.641 0.655
Page-header 4983 3365 0.707 0.754 0.769 0.776 0.784
Picture 4983 3497 0.723 0.762 0.789 0.796 0.805
Section-header 4983 8544 0.709 0.727 0.742 0.75 0.748
Table 4983 2394 0.82 0.854 0.88 0.885 0.886
Text 4983 29917 0.845 0.86 0.876 0.878 0.877
Title 4983 334 0.762 0.806 0.83 0.846 0.84
All 4983 66454 0.718 0.752 0.775 0.783 0.787

Here is an overview of the mAP50-95 performance with different model sizes.

performance between yolov8 model sizes

. . . . . . . . .
Terabox Video Player