Use the YOLO v3 (ONNX) model for object detection in C# using ML.Net

Another case study, based on this YOLO v3 model is available here.

See here for YOLO v4 use.

YOLO v3 in ML.Net

Use the YOLO v3 algorithms for object detection in C# using ML.Net. We start with a Torch model, then converting it to ONNX format and use it in ML.Net.

This is a case study on a document layout YOLO trained model. The model can be found in the following Medium article: Object Detection — Document Layout Analysis Using Monk AI.

Main differences

  • The ONNX conversion removes 1 feature which is the objectness score, pc. The original model has (5 + classes) features for each bounding box, the ONNX model has (4 + classes) features per bounding box. We will use the class probability as a proxy for the objectness score when performing the Non-maximum Suppression (NMS) step. This is a known issue, more info here.
  • Image resizing is not optimised, and will always yield 416x416 size image. This is not the case in the original model (see this issue: RECTANGULAR INFERENCE).

Export to ONNX in Python

This is based on this article Object Detection — Document Layout Analysis Using Monk AI.

Load the model

import os
import sys
from IPython.display import Image
from infer_detector import Infer

gtf = Infer()

f = open("dla_yolov3/classes.txt")
class_list = f.readlines()

model_name = "yolov3"
weights = "dla_yolov3/dla_yolov3.pt"
gtf.Model(model_name, class_list, weights, use_gpu=False, input_size=(416, 416))

Test the model

img_path = "test_square.jpg"
gtf.Predict(img_path, conf_thres=0.2, iou_thres=0.5)

Export the model

You need to set ONNX_EXPORT = True in ...\Monk_Object_Detection\7_yolov3\lib\models.py before loading the model.

We name the input layer image and the 2 ouput layers classes, bboxes. This is not needed but helps the clarity.

import torch
import torchvision.models as models

dummy_input = torch.randn(1, 3, 416, 416) # Create the right input shape (e.g. for an image)
dummy_input = torch.nn.Sigmoid()(dummy_input) # limit between 0 and 1 (superfluous?)
                  output_names=["classes", "bboxes"],

Check exported model with Netron

The ONNX model can be viewed in Netron. Our model looks like this: neutron

  • The input layer size is [1 x 3 x 416 x 416]. This corresponds to 1 batch size x 3 colors x 416 pixels height x 416 pixel width (more info about fixed batch size here).

As per this article:

For an image of size 416 x 416, YOLO predicts ((52 x 52) + (26 x 26) + 13 x 13)) x 3 = 10,647 bounding boxes.

  • The bboxes output layer is of size [10,647 x 4]. This corresponds to 10,647 bounding boxes x 4 bounding box coordinates (x, y, h, w).
  • The classes output layer is of size [10,647 x 18]. This corresponds to 10,647 bounding boxes x 18 classes (this model has only 18 classes).

Hence, each bounding box has (4 + classes) = 22 features. The total number of prediction in this model is 22 x 10,647.

NB: The ONNX conversion removes 1 feature which is the objectness score, pc. The original model has (5 + classes) features for each bounding box. We will use the class probability as a proxy for the objectness score.

More information can be found in this article: YOLO v3 theory explained

Load model in C#

Predict in C#


