We extend the Mistral architecture with 36 transformer blocks (up from 32) to improve learning capacity. This avoids the restrictive licenses and data contamination issues associated with other models.
Using Sliding Window Attention (SWA) and a Rolling Buffer Cache, our model supports a 32K context length while reducing memory usage by ~8x compared to standard methods.
A unique Mixture-of-Experts (MoE) structure at the tokenizer level provides enhanced, efficient support for multiple languages, including Chinese, Japanese, and Korean, beyond just Latin characters.
Data & Training
Our text data is a mix of SlimPajama (a cleaned, deduplicated RedPajama version) and DCLM-BASELINE, which uses quality filters to retain only the top 10% of web documents.
We incorporate the-stack-dedup dataset, which includes code from 358 programming languages. This not only enables code generation but also improves the model's overall logical reasoning.
The model undergoes a three-phase training process, starting with a 2K context, extending to 4K, and finishing with a capability enhancement phase that integrates high-quality data from evaluation benchmarks.
From Assistant to Reasoner
Step 1: Supervised Fine-Tuning (SFT)
The base model is first fine-tuned using the open-source Tülu 3 framework on a diverse data mixture to create Moxin-Instruct, a helpful and harmless AI assistant.
Step 2: Direct Preference Optimization (DPO)
The SFT model is further trained with DPO on a preference dataset, aligning it more closely with user intent and preferred response styles.
Step 3: Reinforcement Learning (GRPO)
To create Moxin-Reasoning, we apply Group Relative Policy Optimization (GRPO), a pure RL method inspired by DeepSeek, to dramatically enhance Chain-of-Thought capabilities.
Result: SOTA Reasoning in a 7B Model
The outstanding performance of Moxin-Reasoning demonstrates that advanced RL techniques can be highly effective for smaller 7B models, achieving results previously seen only in much larger models.