Short Title: Int. J. Mech. Eng. Robot. Res.
Frequency: Bimonthly
Impact Factor 2024:
1.0
CiteScore 2024:
SJR 2024:
Professor of School of Engineering, Design and Built Environment, Western Sydney University, Australia. His research interests cover Industry 4.0, Additive Manufacturing, Advanced Engineering Materials and Structures (Metals and Composites), Multi-scale Modelling of Materials and Structures, Metal Forming and Metal Surface Treatment.
2025-06-18
2025-08-21
2025-07-14
Manuscript received February 10, 2025; revised March 17, 2025; accepted April 23, 2025; published September 19, 2025
Abstract—Object detection and grasping is one of the critical challenges in robotic research, particularly when working in complex environments with diverse objects in terms of shape and position. Although methods using RGB images have shown promising results in simpler scenarios, they still face numerous issues in more complex scenes, especially when objects overlap. Furthermore, prior research has primarily focused on object grasping, without focusing on addressing the interaction capabilities between robots and users during the grasping process. Recent advancements in vision-language models have opened up significant potential for the development of human-robot interaction systems based on multimodal data. This paper presents an integrated model combining computer vision and language models to enhance object detection and grasping capabilities in real-world environments. The proposed approach consists of three key steps (1) identifying the locations of objects and generating segmentation masks using a visual-language model; (2) grasp candidates are predicted from the generated masks and bounding boxes via the Grasp Detection Head; and (3) the candidates are optimized and refined using the Grasp Refinement Head. The integration of vision-language models in the proposed approach not only enhances the ability of robot to understand the semantics of language, enabling more accurate grasping decisions, but also strengthens the interaction capabilities of robot with users. Experimental results demonstrate that the proposed model achieves higher grasping accuracy compared to existing methods, particularly in complex scenes with multiple objects. Additionally, the model also shows its ability to understand complex contexts through Interactive Grasp experiments. Keywords—robot grasping, robot grasping detection, grasp refinement, vision-language integration, image-text integration Cite: Nguyen Khac Toan and Nguyen Truong Thinh, "Integrating Vision-Language Models for Enhanced Robotic Grasping and Interaction Using RGB Image and Prompt," International Journal of Mechanical Engineering and Robotics Research, Vol. 14, No. 5, pp. 500-510, 2025. doi: 10.18178/ijmerr.14.5.500-510Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).