Integrating Vision-Language Models for Enhanced Robotic Grasping and Interaction Using RGB Image and Prompt

General Information

ISSN: 2278-0149 (Online)
Short Title: Int. J. Mech. Eng. Robot. Res.
Frequency: Bimonthly
DOI: 10.18178/ijmerr
Managing Editor: Ms. Jennifer Zeng
Abstracting/Indexing: ESCI (Web of Science), Scopus, CNKI, Google Scholar, etc.
E-mail questions to IJMERR Editorial Office.
Acceptance Rate: 27%
APC: 1000 USD
Average Days to Accept: 92 days

Journal Metrics

Impact Factor 2024:

1.0

CiteScore 2024:

3.1

2024CiteScore

55th percentile

SJR 2024:

Editor-in-Chief

Prof Richard (Chunhui) Yang

Professor of School of Engineering, Design and Built Environment, Western Sydney University, Australia. His research interests cover Industry 4.0, Additive Manufacturing, Advanced Engineering Materials and Structures (Metals and Composites), Multi-scale Modelling of Materials and Structures, Metal Forming and Metal Surface Treatment.

What's New

2025-06-18

Good News! IJMERR received the Impact Factor 2024 with 1.0, JIF ranked #151/182 in Category: Engineering, Mechanical, and #43/48 in Category: Robotics.

2025-10-17

All papers in Volume 14, No. 5 have been published online.

2025-08-21

All papers in Volume 14, No. 4 have been published online.

Home > Articles > All Issues > 2025 > Volume 14, No. 5, 2025 >

IJMERR 2025 Vol.14(5):500-510
doi: 10.18178/ijmerr.14.5.500-510

Nguyen Khac Toan and Nguyen Truong Thinh *

Institute of Intelligent and Interactive Technologies, University of Economics Ho Chi Minh City—UEH, Vietnam
Email: toannk@ueh.edu.vn (T.N.K.); thinhnt@ueh.edu.vn (N.T.T.)
*Corresponding author

Manuscript received February 10, 2025; revised March 17, 2025; accepted April 23, 2025; published September 19, 2025

Abstract—Object detection and grasping is one of the critical challenges in robotic research, particularly when working in complex environments with diverse objects in terms of shape and position. Although methods using RGB images have shown promising results in simpler scenarios, they still face numerous issues in more complex scenes, especially when objects overlap. Furthermore, prior research has primarily focused on object grasping, without focusing on addressing the interaction capabilities between robots and users during the grasping process. Recent advancements in vision-language models have opened up significant potential for the development of human-robot interaction systems based on multimodal data. This paper presents an integrated model combining computer vision and language models to enhance object detection and grasping capabilities in real-world environments. The proposed approach consists of three key steps (1) identifying the locations of objects and generating segmentation masks using a visual-language model; (2) grasp candidates are predicted from the generated masks and bounding boxes via the Grasp Detection Head; and (3) the candidates are optimized and refined using the Grasp Refinement Head. The integration of vision-language models in the proposed approach not only enhances the ability of robot to understand the semantics of language, enabling more accurate grasping decisions, but also strengthens the interaction capabilities of robot with users. Experimental results demonstrate that the proposed model achieves higher grasping accuracy compared to existing methods, particularly in complex scenes with multiple objects. Additionally, the model also shows its ability to understand complex contexts through Interactive Grasp experiments.

Keywords—robot grasping, robot grasping detection, grasp refinement, vision-language integration, image-text integration

Cite: Nguyen Khac Toan and Nguyen Truong Thinh, "Integrating Vision-Language Models for Enhanced Robotic Grasping and Interaction Using RGB Image and Prompt," International Journal of Mechanical Engineering and Robotics Research, Vol. 14, No. 5, pp. 500-510, 2025. doi: 10.18178/ijmerr.14.5.500-510

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

附件说明

Article Metrics in Dimensions

Previous paper

The Procedure for Establishing the Permissible Speeds of Semi-Wagons

Next paper

Robust Control of a 2D Rehabilitation Robot Using Admittance and RBF Neural Network

Home

Articles

Author Guide

Editor Guide

Reviewer Guide

Topics and Special Issue

journal menu