Organic matters: Bringing more organic inputs to Subnet 1 validation
We’re deepening the range of queries miners will receive on SN1, to deliver richer conversational interactions.
By Macrocosmos
On Subnet 1, we’ve been laser-focused on how we evolve the subnet to better deliver human-like interactions. To achieve that, we’re rolling out updates to refine how well the subnet responds to a wider range of inputs.
Supplementing synthetic data
To date, the validation process on SN1 has been running on synthetic data designed to test miners on a specific task, such as solving a math problem or parsing a Wikipedia page. Relying on synthetic data in this way has given us an endless supply of queries that we can use to train the subnet.
However, synthetic data still struggles to directly capture the full range and variety of inputs that a human would provide. The risk is that miners over-fit their responses to specific synthetic prompts, resulting in higher scores from validators, but struggle when presented with a generic prompt from a human. For our team at Macrocosmos, success for the subnet can’t just be built on synthetic queries. That’s why we’re now integrating a secondary layer of organic queries into the validation process.
How we’re integrating organic queries
Alongside the constant benchmarking process using synthetic prompts, we’ve built a secondary asynchronous task that draws on a database of organic prompts submitted to Chattensor. This flow has a number of steps:
Validators are queried with Chattensor organic input.
The validator then randomly queries five miners from the roughly 1,000 miners active on the subnet.
Simultaneously, the validator is running the same query to generate the reference answer using Llama3 70B.
Once the validator receives back both the miner response and the reference answer, it assesses the performance of the miner and assigns rewards accordingly.
Over time, we expect this additional stream of organic inputs to push miners towards better handling general queries, rather than optimizing for specific synthetic prompts. In turn, this will yield better performance and a more natural conversational experience from the subnet.
The future: refining and expanding how we assess miners on SN1
Once we’ve implemented this upgrade, there are a few further areas of focus we have as a team. The first is that the current organic scoring mechanism is better at responding to true/false queries than open-ended, subjective or creative ones. A validator will find it easier to assess a miner’s response for a query like “What is the capital of Texas?” than it will for a query like “Write me a poem about subnet incentive mechanisms”.
Similarly, the current competition is capped by the performance of the reference LLM, currently Llama3-70B. A miner that provides a better response than Llama3 to a prompt about subnet-themed poetry will be penalized for deviating from the benchmark, even if that deviation is an objectively better response.
Neither of these challenges are insurmountable. In the future, we’ll be rolling out an agentic system for validator fact-checking, so that rather than solely rely on Llama3, they will also be embedding information from across the web. The solution here may look similar to LangChain ReAct, for example. We’re also exploring how we classify and then evaluate different tasks more precisely. If we can separate coding tasks from questions about history, we can answer each with more precision than simply the raw responses Llama3 will return. Queries about history, for example, may be a prime use case for web search rather than relying solely on Llama3.
We’ll share more on these improvements as we develop solutions and begin to roll them out for the subnet and move closer to our goal of delivering state-of-the-art open-source intelligence on Bittensor.