Our project aims to explore and analyze the specific performance task of a robot arm setting a checkerboard within the RLBench environment. Our approach involves experimenting with existing algorithms in the RLBench environment, tweaking them to improve their performance, and conducting a comparative study to understand the impact of these modifications. For this project, the input will include task specifications and environmental observations, such as RGB, depth, and segmentation masks, while the output will involve the successful completion of the checkerboard setup task.
To achieve the goals of our project, we added rewards to encourage the robot to prioritize correctly placed pieces and efficiently move toward misplaced pieces. We also have an incremental reward system that continually rewards progress even when full success has not yet occurred.
We used the Proximal Policy Optimization (PPO) to train our robot arm. Through tweaking different train parameters of the ppo model to observe the influences on the train model in order to improve the accuracy and efficiency of the robot arm. Specifically, we mainly change the parameters such as learning_rate, batch_size, clip_range, etc to observe the model.
To tackle the checkerboard setup task, we primarily use PPO as our reinforcement learning baseline. PPO is a model-free policy gradient method that improves sample efficiency by constraining policy updates using a clipped objective function. The loss function is formulated as follows:
$(L(\theta) = \mathbb{E}_t \left[ \min(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) A_t ) \right])$
where $r_t(\theta)$ is the probability ratio between the new and old policies, $A_t$ is the advantage function, and is a small constant (default is 0.2). We set hyperparameters based on OpenAI’s default PPO settings and then adjust them as necessary.
The OpenAI’s default PPO settings:
Then we trained the modal with the learning rate at 1e-3, 3e-3, 1e-2, 1e-4, 3e-5, 1e-5, 3e-4 etc in order to get a better learning rate for our model. We also change the batch size to 128, 256, or 512 to see the effects. We also tweaked the clip range and entropy coefficient for improvements.
Policy Architecture: Multi-Input Policy
Why Multi-Input?
Our environment provides diverse inputs: RGB, depth, and segmentation masks. A single-stream policy struggles to efficiently process heterogeneous data. The multi-input policy enables specialized processing for each modality, improving feature extraction.
Implementation Choice
For hierarchical structure, we employ Hierarchical Reinforcement Learning (HRL), breaking down the task into sub-goals such as:
Grasping a piece Moving it to the correct location Placing the piece accurately Each sub-goal is managed by a lower-level policy, while a high-level policy orchestrates overall execution. To improve learning efficiency, we bootstrap training with Imitation Learning (IL) by collecting expert demonstrations and using Behavior Cloning (BC) to pre-train the agent before transitioning to reinforcement learning.
We evaluate our models over 500,000 training steps, analyzing performance in terms of task success rate, execution time, and reward accumulation. Our experiments involve ablations, such as removing hierarchical structures or imitation learning, to measure their individual contributions.
Our project aims to optimize PPO-based reinforcement learning for a robotic arm setup of a checkerboard using RLBench. We conducted hyperparameter tuning and analyzed training performance through various metrics, including mean episode reward, training stability, and hyperparameter sensitivity.
We experimented with learning rate, clip range, and entropy coefficient to assess their impact on performance. The top 18 configurations ranked by mean reward are summarized in the table below:
From the top 18 hyperparameter configurations, we observed:
Above are the graphs of the training progress at different steps. From the training progress graphs, we observed:
From the above bar charts and heatmaps graphs, we learned:
Above is the picture of meshgrid groups for different parameter interactions. The meshgrid plots further highlight interactions between hyperparameters:
Our project focuses on using RLBench to train a robot arm for setting up a checkerbot. Throughout the development process, we encountered several challenges, made key observations, and refined our approach based on qualitative insights.
The first major hurdle was setting up the environment on HPC3. First, we lacked sudo privileges, so we could not follow the official installation guide. Due to system constraints, we had to use Ubuntu 18.04 instead of 20.04, as RLBench’s dependencies required GLIBC 2.29, which was not available on Rocky 8.10 (limited to GLIBC 2.28). Additionally, the remote machine had cloning restrictions, forcing us to manually download and install dependencies. Fortunately, we ultimately succeeded with guidance from the TA.
To assist future users, we documented the setup process in “RLBench Setup for HPC3.md”, which serves as a detailed guide to configuring RLBench on HPC3 efficiently. Once the environment was fully operational, we successfully ran a test video, validating our setup and marking our first tangible success. The following picture is the screenshot of the RLBench environment test.
Screenshot of the output of RLBench environment test
We selected Proximal Policy Optimization (PPO) as our reinforcement learning model. During training, we experimented with various hyperparameters, observing their effects on performance. While we expected PPO to gradually refine its actions, early results were inconsistent, with the robot arm struggling to complete the task reliably.
Key qualitative observations included:
One of the main challenges was achieving stable and meaningful learning progress. Despite multiple training attempts, we have yet to develop a fully functional model that reliably completes the task. However, this process has provided valuable lessons:
Hyperparameter tuning is non-trivial, and small adjustments can lead to drastically different learning behaviors. Reinforcement learning for robotic control is highly sensitive to reward shaping and environment design, emphasizing the need for careful engineering of learning conditions.
To further enhance model performance, we plan to:
The following image is the screenshot from our training video
source code:
Related source:
We have used ChatGPT to ask questions and debug, polishing documents, and we also used OpenAI’s default PPO settings for our PPO training model