AI-Enhanced Robotics, GPT-4V and myCobot in Action

ElephantRobotics

This article on the mylangrobot project was created by neka-nat, and the author Shirokuma has authorized its editing and reproduction. A big shout out at neka-nat:)
https://twitter.com/neka_nat

alt text

Introduction

This project demonstrates a simple pick-and-place operation using a robotic arm, employing GPT-4V and myCobot. The demonstration uses a method named SoM (Object Detection Method) to generate robotic movements through natural language. In simpler terms, the machine receives natural language instructions, locates the target, and then commands the robotic arm to perform the grabbing task. The highlight of this project is the combination of GPT-4V's image processing and the SoM object detection algorithm, enabling the robotic arm to move through natural language interaction with the machine.

Software

SoM

Set of Mark (SoM) is a method designed to enhance the visual understanding capabilities of large language models. After an image is processed with SoM, a series of markers are added to it. These markers can be recognized and processed by language models, aiding in more accurate identification and understanding of objects and content within the image. This enhancement allows language models to conduct more precise analysis and descriptions of elements in images, thereby improving their performance in visual tasks.
alt text

GPT-4V

The commonly discussed GPT is a large predictive model with which we can have conversational interactions. The new iteration, GPT-4V, is a multimodal large language model that not only processes textual information but now also handles image information. It is capable of image understanding, image generation, and image description. This large model combines the powerful natural language processing abilities of GPT-4 with current image analysis technologies, offering more efficient and accurate integrated capabilities in vision and language.
Below is an example provided by OpenAI:

User:
"These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video."

alt text

#GPT-4V Image description processing
GPT4:
"🐺 Survival of the Fittest: An Epic Tale in the Snow ❄️ - Witness the intense drama of nature as a pack of wolves face off against mighty bison in a harsh winter landscape. This raw footage captures the essence of the wild where every creature fights for survival. With each frame, experience the tension, the strategy, and the sheer force exerted in this life-or-death struggle. See nature's true colors in this gripping encounter on the snowy plains. 🦬"

Remember to respect wildlife and nature. This video may contain scenes that some viewers might find intense or distressing, but they depict natural animal behaviors important for ecological studies and understanding the reality of life in the wilderness.

Openai-whisper

https://openai.com/research/whisper
An automatic speech recognition (ASR) system developed by OpenAI. It utilizes advanced deep learning techniques to transcribe, translate, and understand human speech. Whisper is characterized by its ability to support multiple languages and its exceptional performance in recognizing voices with various accents and in noisy environments. Additionally, it can handle different audio qualities, making it suitable for various applications such as transcribing meeting records, automatically generating subtitles, and assisting in translations.
alt text

pymycobot

https://github.com/elephantrobotics/pymycobot/
pymycobot is a Python library for the mycobot robot series. MyCobot is a compact, multifunctional collaborative robotic arm suitable for education, research, and lightweight industrial applications. The PyMyCobot library provides a simple programming interface, enabling developers to control and program MyCobot robots for operations such as movement, grabbing, and sensing. This library supports multiple operating systems and development environments, facilitating its integration into various projects, especially in robotics and automation. By using Python, a widely-used programming language, pymycobot makes operating and experimenting with MyCobot robots more accessible and flexible.

Hardware

myCobot 280M5

The myCobot 280 M5 is a desktop-level compact six-axis collaborative robot produced by Elephant Robotics. Designed for compactness, it is suitable for education, research, and light industrial applications. The myCobot 280 M5 supports various programming and control methods, compatible with different operating systems and programming languages, including:
●Main and auxiliary control chips: ESP32
●Supports Bluetooth (2.4G/5G) and wireless (2.4G 3D Antenna)
●Multiple input and output ports
●Supports free movement, joint movement, Cartesian movement, trajectory recording, and wireless control
●Compatible operating systems: Windows, Linux, MAC
●Supported programming languages: Python, C++, C#, JavaScript
●Supported programming platforms and tools: RoboFlow, myblockly, Mind+, UiFlow, Arduino, mystudio
●Supported communication protocols: Serial port control protocol, TCP/IP, MODBUS
These features make the myCobot 280 M5 a versatile, user-friendly robot solution suitable for a variety of application scenarios.
alt text

myCobot Vertical Suction Pump V2.0

Operates on the principle of vacuum adhesion, providing 3.3V IO control, and can be extensively used in the development of various embedded devices.
alt text

Camera

Standard USB and LEGO interfaces. The USB interface can be used with various PC devices, and the LEGO interface can be conveniently fixed. It is applicable to machine vision, image recognition, and other applications.
alt text
mylangrobot Software Analysis
The specific workflow of the project described at the beginning is as follows:

Audio Input: Record audio instructions first.
Audio Processing: Use "openai-whisper" to process the audio and convert it into text.
Language Model Interaction: Use the GPT-4 model to process the converted text instructions and understand the user's commands.
Image Processing: Use GPT-4V and the enhanced image capability of SoM to process images and find the target mentioned in the instructions.
Robotic Arm Control: Control the robotic arm to grab the identified target.

Audio Processing
This function utilizes speech_recognition to capture audio data from the microphone, enabling the computer to recognize it.
Libraries used:

import io
import os
from enum import Enum
from typing import Protocol

import openai
import speech_recognition as sr
from pydub import AudioSegment
from pydub.playback import play

Define interfaces, capture user input, and provide output to the user.

class Interface(Protocol):
    def input(self, prefix: str = "") -> str:
        return prefix + self._input_impl()

    def _input_impl(self) -> str:
        ...

    def output(self, message: str) -> None:
        ...

Initialize the microphone for audio input and output.

class Audio(Interface):
    def __init__(self):
        self.r = sr.Recognizer()
        self.mic = sr.Microphone()
        # openai-whisper API key
        self.client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))

Convert the input audio into text format for output.

 def _input_impl(self) -> str:
        print("Please tell me your command.")
        with self.mic as source:
            self.r.adjust_for_ambient_noise(source)
            audio = self.r.listen(source)

        try:
            return self.r.recognize_whisper(audio, language="japanese")

        except sr.UnknownValueError:
            print("could not understand audio")
        except sr.RequestError as e:
            print("Could not request results from Google Speech Recognition service; {0}".format(e))

The final return 'r' is the text format of the audio, which can be used for interaction with the GPT-4 model.

Image Processing and GPT-4 Language Interaction

When transmitting text to the GPT-4 model for interaction, images are sent along, so image processing and interaction are discussed together.
Libraries used for image processing:

import cv2
import numpy as np
import supervision as sv
import torch
from segment_anything import SamAutomaticMaskGenerator, sam_model_registry

from .utils import download_sam_model_to_cache

Primarily uses the SamAutomaticMaskGenerator feature to mark and draw markers on detected targets.

#Convert image to RGB format
        image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
#Image processing, target detection and marker rendering
        sam_result = self.mask_generator.generate(image_rgb)
        detections = sv.Detections.from_sam(sam_result=sam_result)
        height, width, _ = image.shape
        image_area = height * width

        min_area_mask = (detections.area / image_area) > self.MIN_AREA_PERCENTAGE
        max_area_mask = (detections.area / image_area) < self.MAX_AREA_PERCENTAGE
        detections = detections[min_area_mask & max_area_mask]
        
        
        #Returns the result of the image and detected information
        labels = [str(i) for i in range(len(detections))]
        annotated_image = mask_annotator.annotate(scene=image_rgb.copy(), detections=detections)
        annotated_image = label_annotator.annotate(scene=annotated_image, detections=detections, labels=labels)
        return annotated_image, detections

alt text

This results in the following effect.
Note: The below function requires obtaining the GPT-4 API-Key for usage.
The resulting image is passed to the GPT-4 model, which requires some processing before use. Through GPT-4V, the image can be processed to return information about the image content and corresponding object information.

def prepare_inputs(message: str, image: np.ndarray) -> dict:
    # # Path to your image
    # image_path = "temp.jpg"
    # # Getting the base64 string
    base64_image = encode_image_from_cv2(image)

    payload = {
        "model": "gpt-4-vision-preview",
        "messages": [
            {"role": "system", "content": [metaprompt]},
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": message,
                    },
                    {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}},
                ],
            },
        ],
        "max_tokens": 800,
    }

    return payload


def request_gpt4v(message: str, image: np.ndarray) -> str:
    payload = prepare_inputs(message, image)
    response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)
    res = response.json()["choices"][0]["message"]["content"]
    return res

Robotic Arm Control and Overall Integration
After image processing and GPT-4V model processing, the interpreted instructions generate target position information. This position information is passed to the robotic arm control system, which moves to the corresponding location to perform the grabbing action.
Key methods involved:
Move to the target object.

    def move_to_object(self, object_no: int, speed: Optional[float] = None) -> None:
        object_no = self._check_and_correct_object_no(object_no)
        print("[MyCobotController] Move to Object No. {}".format(object_no))
        detection = (
            np.array([-self._detections[object_no][0], -self._detections[object_no][1]]) + self.capture_coord.pos[:2]
        )
        print("[MyCobotController] Object pos:", detection[0], detection[1])
        self.move_to_xy(detection[0], detection[1], speed)

grab action
  def grab(self, speed: Optional[float] = None) -> None:
        print("[MyCobotController] Grab to Object")
        current_pos = self.current_coords().pos
        self.move_to_z(self.object_height + self.end_effector_height, speed)
        self._mycobot.set_basic_output(self._suction_pin, 0)
        time.sleep(2)
        self.move_to_z(current_pos[2], speed)

drop action
    def move_to_place(self, place_name: str, speed: Optional[float] = None) -> None:
        print("[MyCobotController] Move to Place {}".format(place_name))
        self._current_position = self.positions[place_name]
        self._mycobot.sync_send_angles(
            np.array(self._current_position) + self.calc_gravity_compensation(self._current_position),
            speed or self._default_speed,
            self._command_timeout,
        )
        print("Current coords: {}".format(self.current_coords()))

After each function is implemented, coordinate the entire process, streamline the workflow logic, and complete the task.
The specific code can be viewed in the operator.py file.

Example

Below is an example test to observe the project's outcome. The content involves a voice input saying "pick up the chocolate," and the robotic arm executes the task.
https://youtu.be/Eda1m7DnIhQ

Summary

This project demonstrates how to leverage advanced artificial intelligence and robotics technologies to accomplish complex automation tasks. By integrating voice recognition, natural language processing, image analysis, and precise robotic arm control, the project has successfully created a robotic system capable of understanding and executing spoken instructions. This not only enhances the naturalness and efficiency of robot-human interaction but also opens up new possibilities for robotic technology in various practical applications, such as automated manufacturing, logistics, assistive robots, and more.
Finally, thanks again to Shirokuma for sharing this case with us. If you have better examples, feel free to contact us!