Technology & Research

Built on complete transparency with open training code, data, and checkpoints following MOF principles.

Moxin-LLM at a Glance

2T+

Tokens Trained

32K

Context Length

$160,000

Training Cost

MOF Open Science

Openness Level

Core Architecture

Building a Better Foundation

Enhanced Mistral Base

We extend the Mistral architecture with 36 transformer blocks (up from 32) to improve learning capacity. This avoids the restrictive licenses and data contamination issues associated with other models.

Long-Context Efficiency

Using Sliding Window Attention (SWA) and a Rolling Buffer Cache, our model supports a 32K context length while reducing memory usage by ~8x compared to standard methods.

Innovative MoE Tokenizer

A unique Mixture-of-Experts (MoE) structure at the tokenizer level provides enhanced, efficient support for multiple languages, including Chinese, Japanese, and Korean, beyond just Latin characters.

Data & Training

Quality Data for Superior Performance

High-Quality Text Corpus

Our text data is a mix of SlimPajama (a cleaned, deduplicated RedPajama version) and DCLM-BASELINE, which uses quality filters to retain only the top 10% of web documents.

Reasoning through Code

We incorporate the-stack-dedup dataset, which includes code from 358 programming languages. This not only enables code generation but also improves the model's overall logical reasoning.

Phased Training Approach

The model undergoes a three-phase training process, starting with a 2K context, extending to 4K, and finishing with a capability enhancement phase that integrates high-quality data from evaluation benchmarks.

From Assistant to Reasoner

A Disciplined Path to Advanced Reasoning

Step 1: Supervised Fine-Tuning (SFT)

The base model is first fine-tuned using the open-source Tülu 3 framework on a diverse data mixture to create Moxin-Instruct, a helpful and harmless AI assistant.

Step 2: Direct Preference Optimization (DPO)

The SFT model is further trained with DPO on a preference dataset, aligning it more closely with user intent and preferred response styles.

Step 3: Reinforcement Learning (GRPO)

To create Moxin-Reasoning, we apply Group Relative Policy Optimization (GRPO), a pure RL method inspired by DeepSeek, to dramatically enhance Chain-of-Thought capabilities.

Result: SOTA Reasoning in a 7B Model

The outstanding performance of Moxin-Reasoning demonstrates that advanced RL techniques can be highly effective for smaller 7B models, achieving results previously seen only in much larger models.