The top team of this challenge stands to win $500 worth of Luxonis devices.
The aim of this challenge is to develop obstacle segmentation methods suitable for deployment on embedded devices. For this reason your methods will be run, benchmarked and evaluated on a real-world device. The device is an upcoming next-gen device from Luxonis based on Robotic Vision Core 4 (RVC4).
Create a semantic segmentation method that classifies the pixels in a given image into one of three classes: sky, water or obstacle. An obstacle is everything that the USV can crash into or that it should avoid (e.g. boats, swimmers, land, buoys). Refer to the main challenge page for more details.
In the evaluation we will consider the same obstacle detection quality metrics as in the main challenge. In addition, we will also evaluate the throughput of the methods in terms of frames-per-second (FPS) processed on the evaluation device.
To be considered for the challenge, the method must run faster than a set threshold of 30 FPS with 384x768 input shape. The throughput will be evaluated in regular (balanced) mode.
To determine the winner of the challenge the aggregate Q (Quality) metric will be used. In case of a tie, FPS of the methods will be used to determine a winner (faster is better).
To participate in the challenge follow these steps:
tools/pytorch2onnx.pyscript in the mmsegmentation-macvi repository as a starting point.
Since hardware might present certain limitations, we present guidelines for model development. This section describes the allowed operations, model requirements, input definitions, good practices, and expected throughput and accuracy drop.
Before running the model on the device, it gets compiled and quantized to appropriate format. While the quantization to INT8 and part of compilation will be executed automatically, users are still required to perform the first step - conversion of the trained Pytorch model to ONNX.
Below is a short example of Pytorch to ONNX conversion, that would create a
model.onnx with fixed
input shape, which could be submitted to MaCVi:
Some general guidelines to ensure that your model will be exportable:
Feel free to refer to the official Pytorch documentation for more instructions on export, limitations, and good practices. Please note that custom operations are not supported.
Furthermore, there might be some additional limitations on the device itself, where certain ONNX operations might not be supported. Please follow this spreadsheet for un/supported ONNX operations. If a certain operation is missing, it is suggested to check whether the model compiles and executes correctly through a submission of an un-trained model.
Because models can expect differently processed inputs and evaluation on the device does not expose the pre-processing options to the participants, it is important that model expects the following input:
mean = [0.485, 0.456, 0.406]
std = [0.229, 0.224, 0.225]
1x3x384x768(N, C, H, W), while it can be arbitrary during the training.
Since images in the training set can have different sizes, they will be reshaped (using bilinear interpolation) to fit the above shape (with preserved aspect ratio), centered into the shape, and padded using mirror padding.
Example of Python code using PIL and Numpy to read and normalize the image:
Such images will be fed to the model on device, the outputs will be automatically evaluated in the same manner as the classic segmentation track. Note that example does not include the padding and resizing.
The output should be:
1x1x384x768tensor of predictions (argmaxed logits).
Please note that as part of the postprocessing step, the predicted logits will be argmaxed to obtain segmentation mask, then cropped and upscaled (using nearest interpolation) to the original input shape.
While the model will be quantized to INT8, it is recommended to use FP32 during the training, as it is more stable and allows for better convergence. It is also recommended to use the same input shape during the training as it will be during inference.
In the table bellow, we report the F1 detection score performance drop and FPS for several common segmentation methods. These benchmarks show what kind of performance drop and throughput to expect from your methods.
The models have been exported using the MMSegmentation toolbox. Input shape of 768x384 was used for all deployed models.
Note: Most of the performance drop is not due to quantization, but due to using lower input resolution during inference. To improve the performance we suggest training specifically for the target resolution.
|FCN (ResNet-50)||orig (768x384)||52.8 (-5.2)||-|
|FCN (ResNet-50)||quantized||54.0 (-3.9)||19.7|
|FCN (ResNet-101)||orig (768x384)||52.7 (-10.6)||-|
|FCN (ResNet-101)||quantized||53.8 (-9.6)||16.8|
|DeepLabv3+ (ResNet-101)||orig (768x384)||58.0 (-6.0)||-|
|DeepLabv3+ (ResNet-101)||quantized||57.4 (-6.6)||16.6|
|BiSeNetv1 (ResNet-50)||orig (768x384)||45.1 (+2.3)||-|
|BiSeNetv1 (ResNet-50)||quantized||45.6 (+2.8)||28.7|
|BiSeNetv2 (-)||orig (768x384)||42.9 (-11.8)||-|
|BiSeNetv2 (-)||quantized||46.0 (-8.7)||44.6|
|STDC1 (-)||orig (768x384)||48.5 (-13.3)||-|
|STDC1 (-)||quantized||47.9 (-13.9)||45.8|
|STDC2 (-)||orig (768x384)||50.8 (-13.5)||-|
|STDC2 (-)||quantized||49.9 (-14.4)||38.3|
|SegFormer (MiT-B2)||orig (768x384)||61.8 (-8.2)||-|
|SegFormer (MiT-B2)||quantized||58.0 (-12.0)||15.0|