Research in Computer Vision

Selected Papers

CSIM: A Copula-based similarity index sensitive to local changes for Image quality assessment, (2024)

Ghazouali, S. E., Michelucci, U., Hillali, Y. E., & Nouira, H. (2024). CSIM: A Copula-based similarity index sensitive to local changes for Image quality assessment. arXiv preprint arXiv:2410.01411.

This paper introduces the Copula-based Similarity Index (CSIM), a novel metric designed for precise image quality assessment. Traditional metrics like SSIM and PSNR often fail to detect subtle local distortions, which CSIM addresses by using Gaussian Copula to analyze pixel dependencies in small image patches. This method enhances sensitivity to minor changes, making it ideal for fields such as medical imaging, where detecting fine details is critical.

CSIM's methodology involves dividing images into patches, calculating pixel intensity dependencies, and creating a similarity map by measuring Euclidean distances between copula-based vectors. In experiments, CSIM outperformed existing metrics under various distortion scenarios, such as noise and blurring, showcasing its robustness and adaptability. This metric also demonstrated its effectiveness in video-based assessments, consistently detecting subtle changes across frames, highlighting its potential for dynamic monitoring in applications like surveillance and diagnostics.

The paper concludes by analyzing CSIM’s algorithmic complexity, balancing accuracy with computational efficiency through optimized patch sizes. This innovation marks a significant advancement in image quality assessment, promising broader applications where nuanced similarity detection is essential, and is available as an open-source tool for the research community.

Class-Conditional self-reward mechanism for improved Text-to-Image models (2024)

Ghazouali, S. E., Gucciardi, A., & Michelucci, U. (2024). Class-Conditional self-reward mechanism for improved Text-to-Image models. arXiv preprint arXiv:2405.13473.

This paper presents a novel approach to enhancing Text-to-Image (T2I) models using a self-rewarding mechanism adapted from natural language processing. Traditional reinforcement learning with human feedback (RLHF) is replaced by a self-judgment system, where T2I models generate multiple images from a prompt, then filter them for the best match based on a visual assessment.

The methodology involves generating prompts with language models (LLMs) like Mistral-7B and using open-vocabulary object detection (e.g., YOLO) to enhance object identification in generated images. A unique class-conditional approach allows models to improve specificity in following complex instructions. Images are evaluated with scoring based on image-to-text models, and fine-tuning is done using LoRA techniques, which ensure efficient, targeted model adjustments without affecting pre-trained knowledge.

This self-rewarding pipeline offers automation in T2I model improvement, demonstrating an accuracy increase of up to 60% in experimental results compared to baseline models. This framework holds potential for advancing T2I applications where precise, autonomous image generation is essential, reducing dependency on human input while enhancing the model's ability to produce accurate, contextually relevant images.

FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything (2024)

El Ghazouali, S., Mhirit, Y., Oukhrid, A., Michelucci, U., & Nouira, H. (2024). FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything. Sensors, 24(9), 2889.

This paper introduces the FusionVision pipeline, a method for real-time 3D object segmentation and reconstruction. This approach combines YOLO for object detection and FastSAM for segmentation, adapted to RGB-D data. The integration of RGB and depth maps through Intel RealSense cameras enables precise 3D object isolation. By refining 3D point clouds, FusionVision enhances applications like robotics, augmented reality, and autonomous driving, offering an efficient solution for real-time scene understanding.

The FusionVision pipeline comprises several stages: data acquisition, YOLO training, real-time object detection, and FastSAM deployment for segmentation, followed by RGB-depth alignment. The pipeline's RGB-D processing allows the accurate conversion of image data into 3D point clouds, even in complex visual environments. This advanced fusion of RGB-D imaging with point cloud denoising techniques enables high-precision segmentation with minimal computational demand.

FusionVision's efficacy lies in its optimized use of hardware and software, demonstrated by its real-time frame rates and precise 3D reconstructions. When applied to practical cases like object localization and SLAM, the FusionVision framework shows promise for improving object segmentation across various industries.

FlightScope: A Deep Comprehensive Assessment of Aircraft Detection Algorithms in Satellite Imagery (2024)

Ghazouali, S. E., Gucciardi, A., Venturi, N., Rueegsegger, M., & Michelucci, U. (2024). FlightScope: A Deep Comprehensive Assessment of Aircraft Detection Algorithms in Satellite Imagery. arXiv preprint arXiv:2404.02877.

This paper provides a detailed comparative analysis of various deep learning models tailored for detecting aircraft in satellite images. Utilizing the HRPlanesV2 and GDIT datasets, the study benchmarks eight leading object detection models, including YOLOv5, YOLOv8, Faster RCNN, and DETR. Among these, YOLOv5 emerged as the most effective, achieving high precision, mean average precision, and recall rates. The study's findings underscore YOLOv5's robustness and adaptability to satellite data, making it a preferred choice for aerial object detection applications.

The research delves into the unique challenges of satellite imagery, such as atmospheric interference and scale variation, that complicate small object detection compared to ground-based images. By training each model from scratch, the study ensures unbiased performance assessment across models, highlighting each algorithm's strengths and limitations. YOLOv5's architecture, optimized for speed and precision, showed consistently superior performance, particularly in detecting small, complex structures in varied aerial conditions.

To support the community and foster innovation, the authors have released a comprehensive benchmarking toolkit on GitHub, enabling researchers and practitioners to replicate and expand upon these results. This work sets a high standard for evaluating object detection in satellite imagery, providing a reliable framework for selecting effective models for applications in surveillance, environmental monitoring, and air traffic management.