Reinforcement Learning (RL) has demonstrated remarkable potential in achieving superhuman performance across diverse domains, from agents conquering Atari games to DeepMind's AlphaGo mastering the ancient game of Go, self-driving vehicles, and robots learning complex physical tasks. Reinforcement learning has proven to be a powerful tool in these situations, allowing agents to develop never-before-seen strategies to approach tasks in novel ways, leading to often super-human performance.
With RL, agents are not limited to mimicking human behaviour through supervised learning, but are instead able to explore and generate their own data, removing the ceiling of human performance. This has again been observed recently as RL was used to finetune DeepSeek's R1 large language model, resulting in the emergence of powerful reasoning behaviour and subsequently superior performance.
At AgileRL, we believe that reinforcement learning will not only play a key role in the path towards artificial general intelligence but also that it has the potential to provide immense practical value in solving real-world business challenges. The ability of RL agents to learn optimal behaviours through trial-and-error interaction with their environment, combined with their capacity to generalize and adapt, makes them uniquely suited for tackling complex decision-making problems across industries. Additionally, RL in large language model finetuning will allow businesses to finetune small and relatively basic LLMs into powerful, specialized agents, which will be cheaper than the general models provided by big players in the market.
In this article, we delve into the implementation of GRPO in the AgileRL framework, which combines powerful evolutionary hyperparameter optimization with the DeepSeek algorithm to produce an expert reasoning agent, with models as small as 3B parameters. To demonstrate this capability, we utilize a Countdown-inspired dataset that emulates the mathematical challenge from the popular television show. In this task, participants are presented with 3-4 numbers and must craft an equation using basic arithmetic operations to reach a target number.
GRPO - Group Relative Policy Optimization
GRPO (Group Relative Policy Optimization) is an elegant simplification of PPO (Proximal Policy Optimization) that makes reinforcement learning more computationally efficient, especially for large language models.
The two key innovations are:
- Eliminating the critic network: Instead of training a separate value function to estimate expected rewards (which requires additional compute and memory), GRPO normalizes rewards across a batch of samples. It calculates advantage by subtracting the mean reward from each sample's reward and dividing by the standard deviation.
- Group-based evaluation: GRPO generates multiple outputs using the same policy, evaluates them as a group, and then updates the model. This approach reduces variance in the training signal by smoothing out the randomness inherent in probabilistic environments.
These changes are particularly valuable for LLM training because they reduce computational overhead by removing the need for a separate critic model, provide more stable gradient updates in environments with sparse or noisy rewards, and simplify implementation while maintaining or improving performance.
Evolutionary Hyperparameter Optimization
Traditionally, hyperparameter optimization (HPO) for reinforcement learning (RL) is particularly difficult when compared to other types of machine learning. This is for several reasons, including the relative sample inefficiency of RL and its sensitivity to hyperparameters.
Evolutionary algorithms have been shown to allow faster, automatic convergence to optimal hyperparameters than other HPO methods. At regular intervals, after learning from experiences, a population of agents can be evaluated. Through tournament selection, the best agents are selected to survive until the next generation, and their offspring are mutated to further explore the hyperparameter space. Eventually, the optimal hyperparameters for learning in a given environment can be reached in significantly fewer steps than are required using other HPO methods.
Tournament selection
Tournament selection is used to select the agents from a population that will make up the next generation of agents. The best agent from a population is automatically preserved and becomes a member of the next generation. Then, for each tournament, k individuals are randomly chosen, and the agent with the best evaluation fitness is preserved. This is repeated until the population for the next generation is full.
Mutations
Mutation is periodically used to explore the hyperparameter space, allowing different hyperparameter combinations to be trialled during training. If certain hyperparameters prove relatively beneficial to training, then that agent is more likely to be preserved in the next generation, and so those characteristics are more likely to remain in the population.
In this article, we train a population of 4 3B Qwen2.5 models using GRPO with evolutionary hyperparameter optimization (HPO) and compare the performance with a single 3B Qwen2.5 model without HPO.
Code
In this section, we will outline how we used the AgileRL framework to train a population of LLMs using GRPO and evo-HPO, taking you through each step of the process.
Dependencies
1 import re
2 from typing import Tuple
3 import torch
4 import yaml
5 from accelerate import Accelerator
6 from datasets import load_dataset
7 from peft import LoraConfig, get_peft_model
8 from torch.utils.data import Dataset
9 from transformers import AutoModelForCausalLM, AutoTokenizer
10 from agilerl.algorithms.core.registry import HyperparameterConfig, RLParameter
11 from agilerl.hpo.mutation import Mutations
12 from agilerl.hpo.tournament import TournamentSelection
13 from agilerl.training.train_llm import finetune_llm
14 from agilerl.utils.llm_utils import HuggingFaceGym
15 from agilerl.utils.utils import create_population
Defining Hyperparameters
Before we commence training, it's easiest to define all of our hyperparameters in one dictionary. Below are the initial hyperparameters and mutation parameters that we used for this experiment. Additionally, we also define a mutations parameters dictionary, in which we determine what mutations we want to happen, to what extent we want these mutations to occur, and what RL hyperparameters we want to tune. Additionally, we also define our upper and lower limits for these hyperparameters to define search spaces. It is worth noting that, unlike the rest of the AgileRL framework, we can only tune the RL hyperparameters and not the architecture hyperparameters.
1 mutations_params = {
2 "NO_MUT": 0.1, # No mutation
3 "RL_HP_MUT": 0.6, # Learning HP mutation
4 "MUT_SD": 0.1, # Mutation strength
5 "RAND_SEED": 42, # Random seed
6 "MIN_LR": 0.0000001, # Minimum learning rate
7 "MAX_LR": 0.00001, # Maximum learning rate
8 "MIN_BETA": 0.0001, # Minimum beta
9 "MAX_BETA": 0.01, # Maximum beta
10 "MIN_GROUP_SIZE": 4, # Minimum group size
11 "MAX_GROUP_SIZE": 12 # Maximum group size
12 }
13
14 initital_hyperparams = {
15 "ALGO": "GRPO", # Algorithm name
16 "BATCH_SIZE_PER_GPU": 4, # Batch size per GPU
17 "REDUCE_MEMORY_PEAK": True, # Flag to reduce memory peak
18 "BETA": 0.001, # KL penalty co-efficient
19 "LR": 0.000005, # Learning rate
20 "CLIP_COEF": 0.2, # Clipping co-efficient
21 "MAX_GRAD_NORM": 0.1, # Maximum grad norm
22 "UPDATE_EPOCHS": 1, # Update epochs
23 "GROUP_SIZE": 8, # Group size
24 "TEMPERATURE": 0.9, # Temperature
25 "MIN_OUTPUT_TOKENS": None, # Minimum output tokens
26 "MAX_OUTPUT_TOKENS": 1024, # Maximum output tokens
27 "TOURN_SIZE": 2, # Tournament size
28 "ELITISM": True, # Use elitism
29 "POP_SIZE": 4, # Population size
30 "EVAL_LOOP": 1 # Number testing loops to evaluate across
31 }
Defining our base model and dataset
In this tutorial, we use the open-source transformers and datasets libraries from
Hugging Face to download our pretrained model weights and training data. There are a huge number of models and datasets hosted on Hugging Face, and different ones can easily be substituted in.
In this tutorial, to keep things simple, we will use a 3 billion parameter Qwen model, and the Countdown dataset, and initialise them as follows:
1 def create_model(pretrained_model_name_or_path):
2 model = AutoModelForCausalLM.from_pretrained(
3 pretrained_model_name_or_path=pretrained_model_name_or_path,
4 torch_dtype=torch.bfloat16,
5 attn_implementation="flash_attention_2",
6 device_map="cpu"
7 )
8 peft_config = LoraConfig(
9 r=16,
10 lora_alpha=64,
11 target_modules=[
12 "q_proj",
13 "k_proj",
14 "v_proj",
15 "o_proj",
16 "up_proj",
17 "down_proj",
18 "gate_proj",
19 ],
20 task_type="CAUSAL_LM",
21 lora_dropout=0.05,
22 )
23 model = get_peft_model(model, peft_config)
24 return model
25
26 def make_dataset(dataset_name):
27 raw_dataset = (
28 load_dataset(dataset_name, split="train").shuffle(seed=42).select(range(50000))
29 )
30 raw_dataset = raw_dataset.rename_column("target", "answer")
31 raw_dataset = raw_dataset.rename_column("nums", "question")
32 train_test_split = raw_dataset.train_test_split(test_size=0.2)
33 train_dataset = train_test_split["train"]
34 test_dataset = train_test_split["test"]
35 return train_dataset, test_dataset
36
37
38 model = create_model(pretrained_model_name_or_path="Qwen/Qwen2.5-3B")
39 train_dataset, test_dataset = make_dataset("Jiayi-Pan/Countdown-Tasks-3to4")
40
41 # Create the tokenizer and set the pad token equal to the eos-token
42 tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B")
43 tokenizer.pad_token = tokenizer.eos_token
Create the Reasoning Environment
From model to agent: In reinforcement learning, models are called agents. This is because they are trained by taking actions, receiving rewards, and learning from this feedback. This enables them to become very good at taking actions to solve tasks - to develop agency. Since we are training our model with reinforcement learning, it becomes an agent through this process.
We must create a reinforcement learning environment in which our agent can explore possible solutions and learn to optimise rewards. AgileRL provides a HuggingFaceGym
class that wraps a Hugging Face dataset and converts it into a reinforcement learning, gymnasium-style, environment.
So, how does the environment know how to reward an agent for its outputs? We must define a reward function that the agent learns to optimise. Following the techniques used in the DeepSeek reasoning paper, we will define our reward function as the sum of two rewards:
- Accuracy rewards: Verifying answers against ground truth. In this tutorial, we will reward the model +1 if the final answer it produces is correct, otherwise 0.
- Format rewards: Encouraging structured reasoning with explicit steps. In this tutorial, we will reward the model +1 if it puts its thinking process between '<think>'
and '</think>'
tags, otherwise 0.
Therefore, the maximum score an agent can receive is 2, by producing the correct answer in the correct format. The key here is that we never tell the agent which answer it should produce or which format it should use. By giving it rewards for displaying these behaviours, the agent itself discovers the best way to achieve high rewards and learn the behaviour we desire.
1 def format_reward_func(completions, target, **kwargs):
2 rewards = []
3 for completion, gt in zip(completions, target):
4 try:
5 # add synthetic <think> as its already part of the prompt and prefilled for the assistant to more easily match the regex
6 completion = "<think>" + completion
7 regex = r"^<think>([^<]*(?:<(?!/?think>)[^<]*)*)<\/think>\n<answer>([\s\S]*?)<\/answer>$"
8 match = re.search(regex, completion, re.DOTALL)
9 if match is None or len(match.groups()) != 2:
10 rewards.append(0.0)
11 else:
12 rewards.append(1.0)
13 except Exception:
14 rewards.append(0.0)
15 return rewards
16
17
18 def equation_reward_func(completions, target, nums, **kwargs):
19 rewards = []
20
21 for completion, gt, numbers in zip(completions, target, nums):
22 try:
23 # add synthetic <think> as its already part of the prompt and prefilled for the assistant to more easily match the regex
24 completion = "<think>" + completion
25 answer_tags = re.findall(r"<answer>([\s\S]*?)<\/answer>", completion)
26
27 if len(answer_tags) != 1:
28 rewards.append(0.0)
29 continue
30
31 equation = answer_tags[0].strip()
32 used_numbers = [int(n) for n in re.findall(r"\d+", equation)]
33
34 if sorted(used_numbers) != sorted(numbers.flatten().tolist()):
35 rewards.append(0.0)
36 continue
37
38 allowed_pattern = r"^[\d+\-*/().\s]+$"
39 if not re.match(allowed_pattern, equation):
40 rewards.append(0.0)
41 continue
42
43 result = eval(equation, {"__builtins__": None}, {})
44
45 if abs(float(result) - float(gt)) < 1e-5:
46 rewards.append(1.0)
47 else:
48 rewards.append(0.0)
49 except Exception:
50 rewards.append(0.0)
51 return rewards
52
53
54 def combined_rewards(completion, solution, prompt):
55 reward = (
56 equation_reward_func([completion], [solution], [prompt])[0]
57 + format_reward_func([completion], [solution])[0]
58 )
59
60 if reward == 2.0:
61 with open("countdown_completions.txt", "a") as text_file:
62 text_file.write(
63 f"Prompt {prompt}" + "\n" + completion + "\n" + "=" * 50 + "\n"
64 )
65
66 return reward
Now that we have defined our reward functions, we must also design our prompt. This forms the input given to the agent and provides the context necessary to complete the task. This is a task-specific feature, and different reasoning problems will require different chat templates, although they can follow a similar
format. We must also define a function to collate our questions and answers, and standardise their length. Combining all these components, we can now initialise the HuggingFaceGym object.
1 def countdown_chat_template(q, a, tokenizer):
2 conversation = [
3 {
4 "role": "system",
5 "content": "You are a helpful assistant. You first think about the reasoning process in your mind and then provide the user with the answer.",
6 },
7 {
8 "role": "user",
9 "content": f"Using each number in this tensor only once {tuple(i.item() for i in q)}, create an equation that equals {a.item()}. You can use basic arithmetic operations (+, -, *, /) and each number can only be used once. Show your work in <think> </think> tags. And return the final equation and answer in <answer> </answer> tags, for example <answer>(1 + 2) / 3</answer>.",
10 },
11 {"role": "assistant", "content": "Let me solve this step by step.\n<think>"},
12 ]
13 updated_prompt = tokenizer.apply_chat_template(
14 conversation, tokenize=False, continue_final_message=True
15 )
16 tokenized_prompt = tokenizer(
17 [updated_prompt],
18 return_tensors="pt",
19 padding=True,
20 padding_side="left",
21 return_attention_mask=True,
22 )
23 return tokenized_prompt
24
25 def custom_collate_fn(batch):
26 # Extract answers and questions
27 answers = torch.tensor([item["answer"] for item in batch])
28
29 # For questions of variable length, we need to pad them
30 max_len = max(len(item["question"]) for item in batch)
31 questions = torch.zeros(len(batch), max_len, dtype=torch.long)
32 for i, item in enumerate(batch):
33 q_len = len(item["question"])
34 questions[i, :q_len] = torch.tensor(item["question"])
35 return {"answer": answers, "question": questions}
36
37 accelerator = Accelerator()
38 env = HuggingFaceGym(
39 train_dataset=train_dataset,
40 test_dataset=test_dataset,
41 tokenizer=tokenizer,
42 reward_fn=combined_rewards,
43 apply_chat_template_fn=countdown_chat_template,
44 data_batch_size_per_gpu=2,
45 custom_collate_fn=custom_collate_fn,
46 accelerator=accelerator,
47 )
Create a population of GRPO Agents
To allow our model to become an agent and learn through reinforcement learning, we can use the GRPO
class. This class follows the same structure as the other reinforcement learning algorithms in the AgileRL library.
We also define the RL hyperparameters that we want to tune in this training run. For this experiment we selected three: beta (the KL penalty coefficient), learning rate, and group size.
1 hp_config = HyperparameterConfig(
2 beta=RLParameter(min=mutation_params["MIN_BETA"], max=mutation_params["MAX_BETA"]),
3 lr=RLParameter(min=mutation_params["MIN_LR"], max=mutation_params["MAX_LR"]),
4 group_size=RLParameter(
5 min=mutation_params["MIN_GROUP_SIZE"], max=mutation_params["MAX_GROUP_SIZE"], dtype=int
6 ),
7 )
8
9 pop = create_population(
10 algo=initial["ALGO"],
11 observation_space=env.observation_space,
12 action_space=env.action_space,
13 actor_network=model,
14 net_config=None,
15 INIT_HP=initital_hyperparams,
16 hp_config=hp_config,
17 population_size=initital_hyperparams["POP_SIZE"],
18 accelerator=accelerator,
19 )
Creating Tournament Selection and Mutations objects
As discussed in the Tournament Selection section, tournament selection is used to select the agents from a population that will make up the next generation of agents. If elitism is used, the best agent from a population is automatically preserved and becomes a member of the next generation.
The class TournamentSelection
defines the functions required for tournament selection. TournamentSelection.select()
returns the best agent and the new generation of agents.
1 tournament = TournamentSelection(
2 initial_hyperparams["TOURN_SIZE"],
3 initial_hyperparams["ELITISM"],
4 initial_hyperparams["POP_SIZE"],
5 initial_hyperparams["EVAL_LOOP"],
6 )
The Mutations
class is used to mutate agents with pre-set probabilities. The available mutations for GRPO currently implemented are:
- No mutation
- RL algorithm mutation - mutation of learning hyperparameters, such as learning rate or batch size.Mutations.mutation()
returns a mutated population. Tournament selection and mutation should be applied sequentially to fully evolve a population between evaluation and learning cycles.
1 mutations = Mutations(
2 no_mutation=mutation_params["NO_MUT"],
3 architecture=0,
4 new_layer_prob=0,
5 parameters=0,
6 activation=0,
7 rl_hp=mutation_params["RL_HP_MUT"],
8 mutation_sd=mutation_params["MUT_SD"],
9 rand_seed=mutation_params["RAND_SEED"],
10 accelerator=accelerator,
11 )
Training and Saving an Agent
The simplest way to train an AgileRL agent is to use the finetune_llm
function.
1 finetune_llm(
2 pop=pop,
3 env=env,
4 init_hp=init_hp,
5 evaluation_interval=10,
6 wb=True,
7 save_elite=True,
8 elite_path="path/to/model/directory",
9 max_reward=2.0,
10 evo_steps=10,
11 mutation=mutations,
12 tournament=tournament,
13 accelerator=accelerator,
14 verbose=True,
15 )
Configuring Accelerate and DeepSpeed
An important part of training an LLM to display reasoning behavaiour is distributed training. They are called Large Language Models for a reason, and are often too large to train on a single GPU. If you want to train a larger, more powerful model, then this becomes even more infeasible.
Instead, we can leverage distributed training, to share the workload across multiple devices and speed up training. To enable distributed training in this tutorial, we use deepspeed and accelerate.
To generate an accelerate file, run the command accelerate config
in your terminal, following the instructions on screen to outline the details of the compute you intend to use for your finetuning, saying yes to the question "Do you want to use DeepSpeed?" and no to the question "Do you want to specify a json file to a DeepSpeed config?" if you want an auto-generated DeepSpeed config file. More information on the deepspeed configuration can be found in their docs. The accelerate config will handle the details of the distribution, and the GRPO class handles how the accelerator is used during training. You can then launch a training
run using accelerate with the following command:
1 accelerate launch path/to/training_script
Alternatively, you can avoid accelerate config
by defining your own accelerate-deepspeed config file and pass it as an argument to accelerate launch
:
1 accelerate launch --config_file path/to/accelerate-deepspeed-config.yaml path/to/training_script
Example config file:
1 compute_environment: LOCAL_MACHINE
2 debug: false
3 deepspeed_config:
4 gradient_accumulation_steps: 2
5 gradient_clipping: 1.0
6 offload_optimizer_device: cpu
7 offload_param_device: cpu
8 zero3_init_flag: false
9 zero_stage: 2
10 distributed_type: DEEPSPEED
11 downcast_bf16: no
12 enable_cpu_affinity: false
13 machine_rank: 0
14 main_training_function: main
15 mixed_precision: bf16
16 num_machines: 4
17 num_processes: 1
18 rdzv_backend: static
19 same_network: true
20 tpu_env: []
21 tpu_use_cluster: false
22 tpu_use_sudo: false
23 use_cpu: false
Results
For our benchmarks, we did two training runs; one with a population of 4 GRPO agents with hyperparameter optimization (HPO) and a single GRPO agent without HPO.
To compare the performance of these two runs, we used three metrics: reward, accuracy, and completion length. The reward indicates the raw reward obtained from a given response, the accuracy is the percentage of responses that resulted in the maximum reward, and the completion length is the number of tokens used in a given response.
The benchmarks below show that the population-based training with evo-HPO resulted in significantly larger rewards and accuracies compared to the single-agent training run without it. Furthermore, the population-based evo-HPO approach also led to far greater reasoning capabilities, with completion lengths at the end of training converging to 800 tokens compared to 200 tokens for the single-agent case.
Exploration vs exploitation
The HPO graphs indicate that higher learning rates were favoured in these early stages, allowing the agents to update parameters more aggressively, resulting in more effective exploration of the solution space and therefore higher scores and accuracies. It is also worth noting that towards the end of the training run, smaller learning rates were favoured as exploitation was prioritized over exploration.
Evo-HPO as a regularization method
The dramatic difference in completion length (200 tokens vs 800 tokens) presents an interesting empirical observation that evo-HPO acts as a regularizer, preventing overfitting on early reward signals, which, in the single agent case, led to convergence to a local optimum at the expense of reasoning depth. The population of agents were able to effectively explore the hyperparameter space and avoid falling into the same local minima that the single agent did. This allowed agents to reason through problems completely. The population diversity that emerged from evo-HPO prevented both early convergence to sub-optimal solutions and the maximization of reward metrics without satisfying the underlying intent (reward hacking).
Below we can see a comparison in the solutions to the same question of the elite agent from the population vs the single-agent at the end of training. The elite agent demonstrates a systematic approach, showing its step-by-step reasoning process and verifying each attempt, while the single agent provides a minimal explanation before jumping to the answer. This difference in reasoning depth and thoroughness highlights how the population-based approach with evo-HPO encourages more comprehensive problem-solving strategies.
<think>First, let's see if we can get close to 54 by combining two numbers.</think>
<answer>(96 - 63) + 47 - 26</answer>
Without HPO (above) vs with HPO (below).
<think>First, we can try 26 + 63 = 89, then 89 - 96 = -9, and finally -9 + 47 = 38. That's not 54. Let's try 26 + 96 = 122, then 122 - 63 = 59, and finally 59 - 47 = 12. That's not 54. Let's try 26 + 47 = 73, then 73 - 63 = 10, and finally 10 + 96 = 106. That's not 54. Let's try 63 + 96 = 159, then 159 - 26 = 133, and finally 133 - 47 = 86. That's not 54. Let's try 63 + 47 = 110, then 110 - 26 = 84, and finally 84 - 96 = -12. That's not 54. Let's try 96 + 47 = 143, then 143 - 26 = 117, and finally 117 - 63 = 54.</think>
<answer>(96 + 47) - 26 - 63</answer>
Conclusion
Our experiments with GRPO and evolutionary hyperparameter optimization have demonstrated several key findings in the context of LLM finetuning for reasoning tasks:
1. Superior Performance: The population-based approach with HPO consistently outperformed single-agent training, achieving higher rewards and accuracy. This was particularly evident in the early stages of training, where the dynamic adaptation of learning rates enabled more effective exploration of the solution space.
2. Enhanced Reasoning Depth: The population-based approach maintained significantly deeper reasoning processes (800 tokens) compared to the single-agent approach (200 tokens), demonstrating that evo-HPO acts as an effective regularization method. This prevented premature convergence to sub-optimal solutions and encouraged more thorough problem-solving approaches.
3. Adaptive Learning Dynamics: The evolutionary approach showed remarkable adaptability in balancing exploration and exploitation. Higher learning rates early in training facilitated better exploration, while gradually decreasing rates later enabled more focused exploitation of successful strategies.
4. Efficient Model Scaling: The combination of GRPO with evo-HPO demonstrates that smaller language models can achieve sophisticated reasoning capabilities when properly optimized. This suggests a promising path toward creating efficient, specialized language models that can be trained and deployed with fewer computational resources.
To obtain these results, we used an x4 A100 GPU instance. This experiment, is a prototype for what we intend to bring to Arena, where we can leverage our reinforcement learning engine, built on top of our open-source framework, to further distribute and speed up the training process. These results validate our approach of using evolutionary hyperparameter optimization to enhance the performance of smaller language models through reinforcement learning.
What's next for AgileRL LLM finetuning?
This experiment has shown that hyperparameters significantly impact the training process when it comes to LLM finetuning with RL, and we believe that HPO could play a role in superior LLM performance. Over the coming weeks and months, we look to introduce LLM finetuning to Arena - our platform for all things RL. Our RL engine will make the process more distributed, allowing us to significantly reduce the time and cost to tune hyperparameters for LLM-based agents. We also look to enable creation of RL datasets from business data, comprehensive LLM finetuning experiment tracking and monitoring for team collaboration, deployment with built-in guardrails and monitoring, and the ability to retrain models on real-world deployment data. These features will streamline the entire LLM finetuning workflow from data preparation through to production deployment, allowing users to create expert LLMs using much smaller models than some of the existing solutions offered.