Why we need document layout analysis
Analyzing document layout is critical because it aids in the proper interpretation and understanding of document content. It becomes more significant with the rise of RAG, which relies heavily on the ability to parse documents accurately.
RAG systems frequently interact with a variety of documents. Scientific papers, for instance, typically have a complex layout that includes figures, tables, references, and structured sections. Proper parsing is important to avoid content disarray. If not done correctly, the LLM could fail due to the 'garbage in, garbage out' principle.
However, due to the complexity of this problem, it's impossible to apply handcrafted rules as a solution. The best approach is to train a machine learning model.
My solution
You can find my solution in yolo-doclaynet. After examining several models and datasets, I've chosen YOLO
as the base model and DocLayNet
as the training data. Let's delve into more details.
-
YOLO
is the most advanced vision detection model. It is maintained by Ultralytics, a leading computer vision team. The model is easy to train, evaluate, and deploy. Plus, its size is compact enough to run in a browser or on a smartphone. -
DocLayNet
is a human-annotated document layout segmentation dataset, containing 80,863 pages from a wide variety of document sources. To the best of my knowledge, it is the highest-quality dataset for document layout analysis. You can download and find more information from this link.
Live Demo
- Download the pretrained model from huggingface. Your options include
yolov8n-doclaynet
,yolov8s-doclaynet
, andyolov8m-doclaynet
. - Install
Ultralytics
by executingpip install ultralytics
. If you encounter any issues, please refer to https://docs.ultralytics.com/quickstart/#install-ultralytics. -
Copy and modify this code snippet to output the detection result:
from ultralytics import YOLO img = cv2.imread(your_image_path, cv2.IMREAD_COLOR) model = YOLO(your_model_path) result = model.predict(img)[0] print(result)
-
Debugging the model can be challenging when you can only check the plain-text output. Fortunately, visualizing the results is simple:
from ultralytics.utils.plotting import Annotator, Colors colors = Colors() annotator = Annotator(img, line_width=line_width, font_size=font_size) for label, box in zip(result.boxes.cls.tolist(), result.boxes.xyxyn.tolist()): label = int(label) annotator.box_label( [box[0] * width, box[1] * height, box[2] * width, box[3] * height], result.names[label], color=colors(label, bgr=True), ) annotator.save( os.path.join(os.path.dirname(image), "annotated-" + os.path.basename(your_image_path)) )
Examine the annotated image. Here's an example:
Benchmark
I also evaluate the mAP50-95
performance of the entire yolov8
series on the DocLayNet
test set.
label | images | boxes | yolov8n | yolov8s | yolov8m | yolov8l | yolov8x |
---|---|---|---|---|---|---|---|
Caption | 4983 | 1542 | 0.682 | 0.721 | 0.746 | 0.75 | 0.753 |
Footnote | 4983 | 387 | 0.614 | 0.669 | 0.696 | 0.702 | 0.717 |
Formula | 4983 | 1966 | 0.655 | 0.695 | 0.723 | 0.75 | 0.747 |
List-item | 4983 | 10521 | 0.789 | 0.818 | 0.836 | 0.841 | 0.841 |
Page-footer | 4983 | 3987 | 0.588 | 0.61 | 0.64 | 0.641 | 0.655 |
Page-header | 4983 | 3365 | 0.707 | 0.754 | 0.769 | 0.776 | 0.784 |
Picture | 4983 | 3497 | 0.723 | 0.762 | 0.789 | 0.796 | 0.805 |
Section-header | 4983 | 8544 | 0.709 | 0.727 | 0.742 | 0.75 | 0.748 |
Table | 4983 | 2394 | 0.82 | 0.854 | 0.88 | 0.885 | 0.886 |
Text | 4983 | 29917 | 0.845 | 0.86 | 0.876 | 0.878 | 0.877 |
Title | 4983 | 334 | 0.762 | 0.806 | 0.83 | 0.846 | 0.84 |
All | 4983 | 66454 | 0.718 | 0.752 | 0.775 | 0.783 | 0.787 |
Here is an overview of the mAP50-95
performance with different model sizes.