SN9: Dataset mixing delivers SOTA results
Last month, we announced that Subnet 9, Pre-training, would be enabling dataset mixing - so what were our preliminary results?
For SN9, pre-training, our goal is to push decentralized training competitions to the forefront of the latest generation. That’s why we recently updated SN9, pre-training, to support dataset mixing.
This will give users full control over dataset mixes, custom evaluation tasks and reward mechanisms. In turn, this delivers more customization, improved models, and a better pretraining service.
With the right initial configuration, models are now trained faster and more efficiently, while still producing SOTA results.
We enabled this on one of our 14B competitions, using sources from FineWeb-Edu2 and The Stack V2 Dedup (initially starting with V1). And the results so far are encouraging.
After adding dataset mixing, we discovered that our 14B mixed models were outperforming our 14B non-mixed ones. This is demonstrated when we run two 14B competitions which are identical, one of which supports dataset mixing while the other does not. On multiple-choice benchmarks such as MMLU, AGIEval, ARC Challenge, and GSM8K-CoT, mixed datasets outperformed their non-mixed counterparts.
We are now seeing the initial signs of success: SN9 models are outperforming SOTA models from companies like DeepSeek AI, Mistral, and Google on prominent benchmarks (DeepSeek AI, Mistral, and Google).
Data-mixed 14B winners beating their non-mixed counterparts is worth celebrating. With just one additional data domain, an amalgamation of code and science (totalling 15%, with the remaining 85% from FineWeb Edu2), our results are already conclusive: modern data-mixtures make better models.
This is not surprising. As our previous article mentioned, researchers at Meta demonstrated the efficacy of this technique by training Llama3 - currently the leading open-source generation of models - on a blend of general knowledge, math, code, and multilingual content. But this had never been achieved in distributed, incentivised systems like Bittensor - until now.
Our goal is for SN9 is to outpace models trained at the most prestigious and well-funded AI labs. Not only do we have good preliminary results, but we now have evidence that we are on the path to truly world class performance.
That’s why we are now integrating additional data sources, to enrich our models with further domain expertise, so that Bittensor can deliver on its promise to train the best models in the world.
Our results
Our preliminary results reveal a success. Data-mixing beats non-mixing in our 14B competitions, and in some benchmarks we produce SOTA or near-SOTA results. Notably, we achieved this improvement with only a 15% code and science mix. We expect future domain expansion for SN9 to yield further improvements.
The obstacle is accumulating the right datasets to produce the best results. While SOTA models backed by centralised corporations can exploit a more extensive range of proprietary datasets, publicly available options limit our community and challenge us to new solutions.
A snapshot of our results from the 14B mixed and non-mixed competitions, on January 8th at 2:45pm UTC, reveals the promise of mixed dataset approaches across multiple benchmarks.
On AGIEval_en, an AGI focused benchmark which takes a human-centred approach, our winner for the 14B data-mix competition (“coder15”) is higher by 0.0195 points than the winner for our 14B non-mixed competition (“jw-14B-300”). This is shown in figure 1.
On the ARC Challenge benchmark, focused on reasoning and abstraction, “coder15” scores not only above the non-mixed winner, “jw-14B-300”, by 0.035 points, but also surpasses DeepSeek V2-Lite, a well-respected open-source model. This is shown in figure 2.
On the GSM8K-CoT benchmark, focused on grade school-level mathematics and chain-of-thought reasoning, “coder15” performs 0.0212 points higher than “jw-14B-300”. This is shown in figure 3.
On MMLU, the industry standard benchmark for broad reasoning, “coder15” performs 0.002 points higher than “jw-14B-300”, and nearly reaches DeepSeek V2 Lite’s score. This is shown in figure 4.
Across all four benchmarks, these are encouraging results - and this is only one small proportion. The results are likely to be higher once we align those proportions closer to SOTA models. We’re already closing the gap with DeepSeek V2-Lite, and at some points even surpassing it, and we’re only just getting starter.
We’ve proven our ability to implement mixes within Bittensor. Not only does that mean we can pioneer further, it also indicates that we can continue building the perfect environment for pre-training open-source SOTA models.