Why we need document layout analysis

Analyzing document layout is critical because it aids in the proper interpretation and understanding of document content. It becomes more significant with the rise of RAG, which relies heavily on the ability to parse documents accurately.

RAG systems frequently interact with a variety of documents. Scientific papers, for instance, typically have a complex layout that includes figures, tables, references, and structured sections. Proper parsing is important to avoid content disarray. If not done correctly, the LLM could fail due to the 'garbage in, garbage out' principle.

However, due to the complexity of this problem, it's impossible to apply handcrafted rules as a solution. The best approach is to train a machine learning model.

My solution

You can find my solution in yolo-doclaynet. After examining several models and datasets, I've chosen YOLO as the base model and DocLayNet as the training data. Let's delve into more details.

YOLO is the most advanced vision detection model. It is maintained by Ultralytics, a leading computer vision team. The model is easy to train, evaluate, and deploy. Plus, its size is compact enough to run in a browser or on a smartphone.
DocLayNet is a human-annotated document layout segmentation dataset, containing 80,863 pages from a wide variety of document sources. To the best of my knowledge, it is the highest-quality dataset for document layout analysis. You can download and find more information from this link.

Live Demo

Download the pretrained model from huggingface. Your options include yolov8n-doclaynet, yolov8s-doclaynet, and yolov8m-doclaynet.
Install Ultralytics by executing pip install ultralytics. If you encounter any issues, please refer to https://docs.ultralytics.com/quickstart/#install-ultralytics.

Copy and modify this code snippet to output the detection result:

from ultralytics import YOLO

img = cv2.imread(your_image_path, cv2.IMREAD_COLOR)
model = YOLO(your_model_path)
result = model.predict(img)[0]
print(result)

Debugging the model can be challenging when you can only check the plain-text output. Fortunately, visualizing the results is simple:

from ultralytics.utils.plotting import Annotator, Colors

colors = Colors()
annotator = Annotator(img, line_width=line_width, font_size=font_size)
for label, box in zip(result.boxes.cls.tolist(), result.boxes.xyxyn.tolist()):
    label = int(label)
    annotator.box_label(
        [box[0] * width, box[1] * height, box[2] * width, box[3] * height],
        result.names[label],
        color=colors(label, bgr=True),
    )
annotator.save(
    os.path.join(os.path.dirname(image), "annotated-" + os.path.basename(your_image_path))
)

Examine the annotated image. Here's an example:

Benchmark

I also evaluate the mAP50-95 performance of the entire yolov8 series on the DocLayNet test set.

label	images	boxes	yolov8n	yolov8s	yolov8m	yolov8l	yolov8x
Caption	4983	1542	0.682	0.721	0.746	0.75	0.753
Footnote	4983	387	0.614	0.669	0.696	0.702	0.717
Formula	4983	1966	0.655	0.695	0.723	0.75	0.747
List-item	4983	10521	0.789	0.818	0.836	0.841	0.841
Page-footer	4983	3987	0.588	0.61	0.64	0.641	0.655
Page-header	4983	3365	0.707	0.754	0.769	0.776	0.784
Picture	4983	3497	0.723	0.762	0.789	0.796	0.805
Section-header	4983	8544	0.709	0.727	0.742	0.75	0.748
Table	4983	2394	0.82	0.854	0.88	0.885	0.886
Text	4983	29917	0.845	0.86	0.876	0.878	0.877
Title	4983	334	0.762	0.806	0.83	0.846	0.84
All	4983	66454	0.718	0.752	0.775	0.783	0.787

Here is an overview of the mAP50-95 performance with different model sizes.

How to analyze document layout by YOLO

Why we need document layout analysis

My solution

Live Demo

Benchmark