Home > Articles > All Issues > 2025 > Volume 14, No. 5, 2025 >
IJMERR 2025 Vol.14(5):500-510
doi: 10.18178/ijmerr.14.5.500-510

Integrating Vision-Language Models for Enhanced Robotic Grasping and Interaction Using RGB Image and Prompt

Nguyen Khac Toan and Nguyen Truong Thinh *
Institute of Intelligent and Interactive Technologies, University of Economics Ho Chi Minh City—UEH, Vietnam
Email: toannk@ueh.edu.vn (T.N.K.); thinhnt@ueh.edu.vn (N.T.T.)
*Corresponding author

Manuscript received February 10, 2025; revised March 17, 2025; accepted April 23, 2025; published September 19, 2025

Abstract—Object detection and grasping is one of the critical challenges in robotic research, particularly when working in complex environments with diverse objects in terms of shape and position. Although methods using RGB images have shown promising results in simpler scenarios, they still face numerous issues in more complex scenes, especially when objects overlap. Furthermore, prior research has primarily focused on object grasping, without focusing on addressing the interaction capabilities between robots and users during the grasping process. Recent advancements in vision-language models have opened up significant potential for the development of human-robot interaction systems based on multimodal data. This paper presents an integrated model combining computer vision and language models to enhance object detection and grasping capabilities in real-world environments. The proposed approach consists of three key steps (1) identifying the locations of objects and generating segmentation masks using a visual-language model; (2) grasp candidates are predicted from the generated masks and bounding boxes via the Grasp Detection Head; and (3) the candidates are optimized and refined using the Grasp Refinement Head. The integration of vision-language models in the proposed approach not only enhances the ability of robot to understand the semantics of language, enabling more accurate grasping decisions, but also strengthens the interaction capabilities of robot with users. Experimental results demonstrate that the proposed model achieves higher grasping accuracy compared to existing methods, particularly in complex scenes with multiple objects. Additionally, the model also shows its ability to understand complex contexts through Interactive Grasp experiments. 

Keywords—robot grasping, robot grasping detection, grasp refinement, vision-language integration, image-text integration

Cite: Nguyen Khac Toan and Nguyen Truong Thinh, "Integrating Vision-Language Models for Enhanced Robotic Grasping and Interaction Using RGB Image and Prompt," International Journal of Mechanical Engineering and Robotics Research, Vol. 14, No. 5, pp. 500-510, 2025. doi: 10.18178/ijmerr.14.5.500-510

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions