👉 Paper 👉 Code

🏆 Future Research Award

2025 K-Data Science Conference (Research & Poster Presentation)


📌 Problem Statement

Mathematical reasoning problem illustration

High Math-500 accuracy is primarily achieved by large-scale LLMs, while compact models remain underrepresented, with LFM showing limited performance.

Recent advances in large language models (LLMs) have demonstrated strong performance in general language understanding; however, mathematical reasoning remains a challenging domain, particularly under resource-constrained settings.
Most existing approaches rely on large-scale models or supervised fine-tuning (SFT), which limits practical deployment and increases data dependency.

This project addresses the following research question:
Can mathematical reasoning ability be significantly improved using reinforcement learning alone, without supervised fine-tuning, on a compact language model?


💡 Proposed Solution

(1) Curriculum Learning

Curriculum Learning Strategy

The training process progresses from easy to hard problems, enabling stable reasoning acquisition and preventing early-stage failure.

(2) KL-free Policy Optimization (ZeroGRPO)

KL-free Policy Optimization

Removing the KL-divergence constraint allows compact models to freely explore diverse reasoning trajectories.

(3) Simple Reward Design with Reasoning-Length Penalty

Reward Design with Length Penalty

A minimal reward based on answer correctness and format validity, combined with a mild penalty on excessively long reasoning.

Together, these three components form a unified reinforcement-learning framework that enables effective mathematical reasoning in compact language models without increasing model size or supervision cost.


🛠️ Technical Overview

  • Base Model:
    • DeepSeek-R1-Distill-Qwen-1.5B (text-only lightweight LLM)
  • Training Method:
    • Reinforcement learning using a modified GRPO framework
    • Removal of KL-divergence regularization to allow unconstrained policy exploration
  • Key Techniques:
    • Zero-KL policy optimization (ZeroGRPO)
    • Simple and stable reward design based on answer correctness and format consistency
    • Curriculum learning strategy with progressively increasing difficulty
  • Evaluation Benchmark:
    • Math-500 benchmark covering multiple mathematical domains and difficulty levels


🎬 Results and Achievements

  • Substantial Performance Improvement:
    • Achieved up to 3.7× accuracy improvement over the base model on Math-500
    • Outperformed standard GRPO and penalty-based variants under identical settings
  • Efficiency and Practicality:
    • Demonstrated that strong mathematical reasoning can be achieved with a 1.5B-parameter model
    • No supervised fine-tuning required, reducing data and annotation costs
  • Research Contribution:
    • Validated that removing KL constraints is particularly beneficial for small models
    • Showed the effectiveness of curriculum learning in stabilizing RL-based reasoning training
  • Outcome:
    • Research results accepted for poster presentation at the 2025 K-Data Science Conference
    • Selected as a funded research project by the conference committee


SuperSmall-R1: A Lightweight Reinforcement Learning Model for Mathematical Reasoning
Jaegun Lee, Janghoon Choi
Journal of The Korea Society of Computer and Information (KCI-indexed), 2025

  • Introduces a reinforcement-learning-only training strategy for compact reasoning models
  • Proposes ZeroGRPO by removing KL-divergence constraints
  • Demonstrates significant performance gains on Math-500 under strict resource constraints

Commemorative Photo