How we’re expanding the Data Universe

1.34bn rows of open source data is now live on Hugging Face, thanks to Subnet 13. Here’s how we’ll be growing that number in the coming months.

Aug 02, 2024

By Macrocosmos

After a busy few months on Subnet 13, we’re seeing the results: the Data Universe is fast becoming one of the biggest and freshest sources of open-source social media datasets on Hugging Face.

Now, we want to lay out our roadmap for the rest of 2024 and the improvements we’re looking to make in order to deliver on the potential of the Data Universe on Bittensor.

Delivering 1.34bn rows of open-source data

Since launch, the competition on SN13 has focused exclusively on building Reddit and X(FKA Twitter) datasets. The results speak for themselves. The ten largest X and Reddit datasets from Subnet 13 contain 1.34bn rows of anonymized data. The largest individual datasets on both social networks are already more than 300mn rows each. And these are just the datasets that have been uploaded to Hugging Face - we expect these numbers to grow further as more miners upload their data.

Because those are the result of a constant competition mechanism, they’re also some of the freshest datasets available on Hugging Face and will grow further as miners continue to append new data to them.

A bigger, better Data Universe

Alongside a number of quality of life improvements we’ve rolled out for miners, the biggest recent changes to Subnet 13 has been introducing Hugging Face as an optional repository for miner datasets. It was this decision that has allowed us to start delivering additional reporting, such as the Hugging Face dashboard.

Introducing Hugging Face as a repository for datasets generated on SN13 has been a natural evolution for the Data Universe. Today, uploading datasets to Hugging Face isn’t required, but over time we’re looking to make it a critical part of the validation process. To achieve this, we’re building out our processes around Hugging Face uploads so that validators can now query Hugging Face metadata directly from miners, for example.

Our roadmap for 2024 and beyond

Right now Hugging Face uploads are optional for miners. The next step for our team is to rework the reward system so we’re gradually increasing the penalty (or reward) for miners that do publish their results on Hugging Face. For a decentralized data source to be worthy of the name, it has to be able to provide reliable, trustworthy, anonymized data. Hugging Face integration is a first step and now we’re building the processes that will allow us to confirm every miner is uploading to Hugging Face and that we make this process central to the subnet’s performance.

We’re also exploring how we expand beyond the current X and Reddit data collection into additional datasets. In order to support a wide number of AI model training applications, we know we will need to expand the number and diversity of datasets we’re generating through the subnet. High on our list of priorities is to add a YouTube transcript dataset, as well as other social platforms. With every new data source we look to add, we’ll obviously be balancing how we incorporate it into the subnet design and can incorporate it in a way that respects the privacy of individuals.

Get these improvements right, and we’ll be well on our way to our immediate goal: Becoming the largest open source dataset provider on Hugging Face.

How we’re expanding the Data Universe

1.34bn rows of open source data is now live on Hugging Face, thanks to Subnet 13. Here’s how we’ll be growing that number in the coming months.

Delivering 1.34bn rows of open-source data

A bigger, better Data Universe

Our roadmap for 2024 and beyond

Discussion about this post