Challenges / USV-based Embedded Obstacle Segmentation
Quick links: Dataset download Validate ONNX Submit Leaderboards Ask for help
The top team of this challenge stands to win $500 worth of Luxonis devices.
The aim of this challenge is to develop obstacle segmentation methods suitable for deployment on embedded devices. For this reason your methods will be run, benchmarked and evaluated on a real-world device. The device is an upcoming next-gen device from Luxonis based on Robotic Vision Core 4 (RVC4).
Create a semantic segmentation method that classifies the pixels in a given image into one of three classes: sky, water or obstacle. An obstacle is everything that the USV can crash into or that it should avoid (e.g. boats, swimmers, land, buoys).
LaRS consists of 4000+ USV-centric scenes captured in various aquatic domains. It includes per-pixel panoptic masks for water, sky and different types of obstacles. On a high level, obstacles are divided into i) dynamic obstacles, which are objects floating in the water (e.g. boats, buoys, swimmers) and ii) static obstacles, which are all remaining obstacle regions (shoreline, piers). Additionally, dynamic obstacles are categorized into 8 different obstacle classes: boat/ship, row boat, buoy, float, paddle board, swimmer, animal and other. More information >
This challenge is based on the semantic segmentation sub-track of LaRS: the annotations include semantic segmentation masks, where all obstacles are assigned into a single "obstacle" class.
LaRS evaluation protocol is designed to score the predictions in a way meaningful for practical USV navigation. Methods are evaluated in terms of:
Besides prediction accuracy, we will also evaluate the throughput of the methods in terms of frames-per-second (FPS) processed on the evaluation device.
To be considered for the challenge, the method must run faster than a set threshold of 30 FPS with 384x768 input shape. The throughput will be evaluated in regular (balanced) mode.
To determine the winner of the challenge, we use an aggregate metric Q (Quality) = mIoU x F1, combining aspects of general segmentation quality measured by the mIoU and detection quality measured by the F1 score.
In case of a tie, FPS of the methods will be used to determine a winner (faster is better).To participate in the challenge follow these steps:
tools/pytorch2onnx.py script in the mmsegmentation-macvi repository as a starting point.Since hardware might present certain limitations, we present guidelines for model development. This section describes the allowed operations, model requirements, input definitions, good practices, and expected throughput and accuracy drop.
Before running the model on the device, it gets compiled and quantized to appropriate format. While the quantization to INT8 and part of compilation will be executed automatically, users are still required to perform the first step - conversion of the trained Pytorch model to ONNX.
Below is a short example of Pytorch to ONNX conversion, that would create a model.onnx with fixed
input shape, which could be submitted to MaCVi:
Some general guidelines to ensure that your model will be exportable:
Feel free to refer to the official Pytorch documentation for more instructions on export, limitations, and good practices. Please note that custom operations are not supported.
Furthermore, there might be some additional limitations on the device itself, where certain ONNX operations might not be supported. Please follow this spreadsheet for un/supported ONNX operations. If a certain operation is missing, it is suggested to check whether the model compiles and executes correctly through a submission of an un-trained model.
Because models can expect differently processed inputs and evaluation on the device does not expose the pre-processing options to the participants, it is important that model expects the following input:
mean = [0.485, 0.456, 0.406]std = [0.229, 0.224, 0.225]1x3x384x768 (N, C, H, W), while it can be arbitrary during the training.Since images in the training set can have different sizes, they will be reshaped (using bilinear interpolation) to fit the above shape (with preserved aspect ratio), centered into the shape, and padded using mirror padding.
Example of Python code using PIL and Numpy to read and normalize the image:
Such images will be fed to the model on device, the outputs will be automatically evaluated in the same manner as the classic segmentation track. Note that example does not include the padding and resizing.
The output should be:
1x1x384x768 tensor of predictions (argmaxed logits).Please note that as part of the postprocessing step, the predicted logits will be argmaxed to obtain segmentation mask, then cropped and upscaled (using nearest interpolation) to the original input shape.
While the model will be quantized to INT8, it is recommended to use FP32 during the training, as it is more stable and allows for better convergence. It is also recommended to use the same input shape during the training as it will be during inference.
In the table bellow, we report the F1 detection score performance drop and FPS for several common segmentation methods. These benchmarks show what kind of performance drop and throughput to expect from your methods.
The models have been exported using the MMSegmentation toolbox. Input shape of 768x384 was used for all deployed models.
Variants:
Note: Most of the performance drop is not due to quantization, but due to using lower input resolution during inference. To improve the performance we suggest training specifically for the target resolution.
| method | variant | F1 | FPS |
|---|---|---|---|
| FCN (ResNet-50) | orig | 57.9 | - |
| FCN (ResNet-50) | orig (768x384) | 52.8 (-5.2) | - |
| FCN (ResNet-50) | quantized | 54.0 (-3.9) | 19.7 |
| FCN (ResNet-101) | orig | 63.4 | - |
| FCN (ResNet-101) | orig (768x384) | 52.7 (-10.6) | - |
| FCN (ResNet-101) | quantized | 53.8 (-9.6) | 16.8 |
| DeepLabv3+ (ResNet-101) | orig | 64.0 | - |
| DeepLabv3+ (ResNet-101) | orig (768x384) | 58.0 (-6.0) | - |
| DeepLabv3+ (ResNet-101) | quantized | 57.4 (-6.6) | 16.6 |
| BiSeNetv1 (ResNet-50) | orig | 42.8 | - |
| BiSeNetv1 (ResNet-50) | orig (768x384) | 45.1 (+2.3) | - |
| BiSeNetv1 (ResNet-50) | quantized | 45.6 (+2.8) | 28.7 |
| BiSeNetv2 (-) | orig | 54.7 | - |
| BiSeNetv2 (-) | orig (768x384) | 42.9 (-11.8) | - |
| BiSeNetv2 (-) | quantized | 46.0 (-8.7) | 44.6 |
| STDC1 (-) | orig | 61.8 | - |
| STDC1 (-) | orig (768x384) | 48.5 (-13.3) | - |
| STDC1 (-) | quantized | 47.9 (-13.9) | 45.8 |
| STDC2 (-) | orig | 64.3 | - |
| STDC2 (-) | orig (768x384) | 50.8 (-13.5) | - |
| STDC2 (-) | quantized | 49.9 (-14.4) | 38.3 |
| SegFormer (MiT-B2) | orig | 70.0 | - |
| SegFormer (MiT-B2) | orig (768x384) | 61.8 (-8.2) | - |
| SegFormer (MiT-B2) | quantized | 58.0 (-12.0) | 15.0 |