Macrocosmos’ approach to weight transparency: Our introduction to SN25, protein folding

Protein folding is one of the biggest, most compute-intensive problems in biology. Here's why we think it's the right problem to showcase the potential of the Bittensor ecosystem.

Jun 17, 2024

By Will Squires and Dr Steffen Cruz, Macrocosmos

Significant attention and focus has been given to the process of weight-setting from the root network in Bittensor in recent months. From challenges over transparency, to differing opinions around process and what matters to the network, the discourse around this has fuelled significant interest within the community and beyond. Validators including OpenTensor Foundation and Foundry are beginning to publish rubrics, and community initiatives like the Raomittee working on assessment frameworks.

As part of our commitment to transparency, moving forward Macrocosmos will be publishing an aggregated version of all weight submission requests to our website and Substack alongside video introductions. We aim for this to become a living, public thesis supporting why we build what we build, and our defence and celebration of the work of the subnets we participate in.

We shall be retroactively releasing weight rubrics and theses for our live subnets, and will continue to embellish these dependent on context - different parts of these may progressively take forms analogous to grant applications, technical architecture seminars, business cases, or investor pitch decks, depending on the contextual element of the innovation we are aiming to achieve. We will ladder “deep dive” articles into particular areas of these theses, including upcoming whitepapers our team are working on.

To start this process, we begin by celebrating protein folding. Protein folding is rightly known as the holy grail of biochemistry. For researchers, it’s the chance to understand how life works at a molecular level. For pharmaceutical and biotech companies, it’s the foundation for a multi-billion dollar opportunity for new therapies and drug development. The biggest blocker? The raw computing power needed to run protein folding simulations. Subnet 25 is our answer to this problem.

Protein folding is also a growing industry, with AlphaFold valued in the 9-figures, and protein engineering is a $2.6bn/year market that’s set to grow significantly by 2030. We believe our subnet presents a path to Bittensor capturing pools of value associated with scientific applications of machine intelligence.

We welcome and relish feedback on this application, and will continue to incorporate suggestions from the community and beyond - for now however, enjoy our discourse on the wonders of biochemistry, and how form defines function.

Thank you for reading Macrocosmos. This post is public so please share it:

Why protein folding?

Protein folding is a notoriously difficult research problem. That’s why we chose it: to demonstrate that Bittensor can tackle the worlds’ hardest research problems, and to motivate academics and universities to begin building research subnets. We wanted to demonstrate that Bittensor can tackle computationally complex problems with significant market value. Protein engineering is a $2.6bn/year market, one that’s set to grow significantly by 2030.

We’ve designed Subnet 25, a subnet dedicated to solving protein folding, as Bittensor’s first venture into academic use cases, demonstrating the network’s efficacy and flexibility. It uses the industry standard GROMACS software to simulate the molecular dynamics of proteins. In short, a molecular dynamics simulation takes an initial 3D protein structure, puts it in a cell-like environment and repeatedly applies the laws of physics to ‘evolve the system’ through time. When run successfully, the simulation is able to accurately predict the final (or native) 3D structure and this is used by researchers to understand the biological function of the protein.

Protein folding is computationally intensive and it can take days or weeks to fully simulate complex structures, which restricts research to large and well-funded institutions (such as universities and big-pharma). As proteins have countless potential shapes, simulations must be repeated many times in order to be confident in the results. Our ambition is to produce an optimized protein folding supercomputer which will enable proteins to be studied more efficiently and in a more democratic way. Researchers and universities are invited to the subnet to solve any protein, on demand, for free. We want this subnet to empower researchers to conduct world-class research and publish in top journals, while demonstrating that decentralized systems are an economic and efficient alternative to traditional approaches.

Communicating on SN25

GROMACS relies upon the use of many input files to configure the many aspects of a simulation. These files specify protein coordinates, the biological environment, particle interactions and the core simulator engine. Our FoldingSynapse (code), based on the Bittensor synapse, is designed to facilitate the exchange of serialized files between validators and miners. These files are a mixture of raw text and binary formats. The binaries are generated by GROMACS and are only executed by the GROMACS interpreter. GROMACS simulation inputs are typically rather small text files (a few MB or less) and are used to define the protein folding job (initial conditions and input parameters etc.). We attach this information to the FoldingSynapse using the md_input attribute (which is a dictionary of filename - serialized content).

GROMACS often produces hundreds of output files which can consume many GBs of disk space, however we have designed the subnet so that only a small subset of the generated output files (those which are essential for verification and rewarding) are sent back by miners by being attached (serialized) to the md_output synapse attribute (also a dictionary). The md_output attribute is usually a few MB in total but in some cases it can be up to a few GB. We have a clear path to further reducing these file sizes by suppressing logging within the files, however, at present the extra logs are providing valuable insights to us as we continue to stabilize the network.

The validators are designed as job schedulers which assign jobs to miners and periodically query them to get progress updates. The query interval is presently 5 minutes, which uses approximately 50% of the validator bandwidth.

Currently, files are not encrypted as we do not believe this is necessary for the present use-case. However, implementing encryption in future would be straightforward given the system design.

How we have designed SN25’s incentive mechanism

Physical systems such as proteins tend to minimize their energy and so this provides a succinct, exploit-resistant and highly sensitive measure of quality. For this reason, miners compete to provide protein configurations that coincide with the lowest energy (analogous to loss). The benefit of this is twofold; the metric that the network is optimizing for is highly aligned with the desired outcome (biologically stable structures) and it is transparent and deterministic (both miners and validators can quickly calculate the energy of a configuration).

The validators select proteins at random from a large public database (RCSB protein data bank) and download the input files for preprocessing and simulation. We further discuss data sources in the next section. The validators use the files that are attached to the md_output attribute of the FoldingSynapse to calculate the energy while performing additional verification steps to ensure the the files match the exact protein configuration, eliminating the remote-code attack surface of the incentive mechanism. Each validator currently maintains a queue of 10 concurrently running protein folding jobs, each of which is assigned to 10 hotkeys (code). With 30 validators active on the subnet, there are 3000 simultaneous simulations running at all times — or around 15 per miner. In the near future we will increase the queue length and the number of hotkeys by another order of magnitude, requiring the miners to be able to sustain hundreds of concurrent protein folding simulations. The miners are oversubscribed to jobs by design, which means that there is an effectively unbounded opportunity for miners that can handle the enormous computational workload. Each miner uses a separate random seed for their simulations, which ensures that each simulation suitably explores the folding space and utilizes the parallelism opportunity of batching jobs. To mitigate biases that arise from the luck aspect of this randomization process, we will be introducing a rebase (or copy) mechanism in an upcoming release, similar to subnet 9. With rebasing, the winning coordinates can be copied by all participants shortly after submission, which intensifies the competition (and thus results). We note that this presents a novel tradeoff between exploration and exploitation; the current winning configuration may be a local minimum rather than a global one and so miners must be strategic in their approaches.

We have opted for an exponential incentive curve design to encourage highly motivated miners to innovate and provide more value to the subnet by procuring more compute. At present, the winning miner gains 80% of the possible rewards in each step (code), but we will be modifying this to winner-takes-all in the near future. The weights are calculated based on the average EMA across the jobs assigned to the miners, which means that they must perform well on all of them. Already we have observed that most miners have innovated beyond our base CPU miner and all serious miners are running on GPUs which has increased productivity by 20-50x. We utilize early stopping to prevent wasted computational efforts. Currently, if no improvements are made to a given proteins’ configuration for an hour the job is terminated (code). Also, we use epsilon-bounded scoring (code) to ensure that configurations meaningfully improve over time. We have developed heuristics which allow us to predict optimal epsilon values (code) as a function of protein complexity, ensuring that the reward landscape is well-calibrated.

Data sources and security

The subnet is based on real data from the RCSB protein data bank, which is updated weekly with additional proteins. The space of proteins is very vast; with around 10 million existing proteins from natural and engineered sources and an infinite space of possible variations (>10130 permutations of existing sequences alone). The current version of the subnet selects from around 65,000 natural proteins (code). There is some preprocessing required in order to prepare the proteins for simulation. At present, we filter out proteins which require any preprocessing, leaving around 35,000 addressable specimens. Ensuring the stability of simulations within such a diverse space is challenging from an engineering point of view. There is no single global set of input parameters which guarantees a stable simulation across the plethora of proteins. We address this by performing a hyperparameter search over a space of possible input parameters before dispatching jobs to miners in order to ensure the folding simulations are stable, which currently increases the dataset size by a factor of around 15x. The current configuration will ensure that our dataset is not exhausted for at least 1,000 days. We will be increasing the dataset size by a further factor of around 1,000 by adding further preprocessing, perturbing input configurations, supporting non-default environmental factors (temperature, pressure, solvent), and increasing the hyperparameters space. This dataset increase is planned for an upcoming version in ~4 weeks from now.

There is a possibility that miners could ‘look up’ solutions to folded proteins, however after our extensive research these types of databases are hard to utilize in practice due to the highly sensitive nature of outputs on the specific environmental factors and initial configurations. With a dataset this large, the only way to have a reasonable grasp on a lookup attack would be to ‘pre-mine’ — computing a vast array of the protein solutions beforehand and store all files in a database. Such a database would require enormous disk space, which we estimate to be hundreds of TB. We also point out that this is productive for the network (just as with data scraping).

Compute requirements for SN25

The base miner is a very lightweight requirement. Miners can increase their rewards and thus emissions by using more powerful hardware. Indeed, even with the nominal emissions that SN25 currently receives, most miners have already upgraded the base miner and are running GPU enabled simulations. It is necessary to use GPU-enabled hardware to be competitive in practice. We are currently working on support a GPU-enabled base miner, which is expected to be released in the coming weeks. Additionally, the number of jobs that a miner can simultaneously process depends on the number of processes it can handle. Since miners are, by design, oversubscribed to jobs, multiprocessing is an essential part of mining our subnet. The base miner, based on python’s ProcessPoolExecutor (code), provides an elastic and scalable framework for miners to utilize multi core machines. For this reason, top miners are running machines with tens of cores already, and with more emissions we expect that there will be miners scaling up to support hundreds of processes.

A simple dashboard will be released next week which displays the total number of proteins folded by the network amongst other key indicators. Development of a more comprehensive and interactive dashboard is also underway at Macrocosmos. Validators log their events to an open weights and biases project (wandb), which can be accessed via the web UI or python API. A notebook which demonstrates API access is provided in the repo (demo).

Ultimately, we want researchers to be able to use our subnet, for free, as an alternative to a traditional HPC cluster (highly optimized for molecular dynamics simulations), while also providing them with key analysis tools to better understand their results. Our UI design is inspired by the AlphaFold3 server (example). The development of the app will begin once we reach the first ‘stable’ version of the subnet, which is expected to be around 1 month from now. It will enable researchers to specify the folding job with full control of the input parameters so that they can use their expertise to achieve SOTA results. We discuss our plans to create a product on this subnet in the subsequent section.

Development Roadmap:

Our present development roadmap consists of the following priorities:

Dataset expansion - Increase the dataset by 1000x (reasons discussed in sec. 4)
Throughput increase - Increase the number of concurrent protein folding jobs by 100x.
Codebase cleanup -
- Expand to multiple synapses for a cleaner design
- Introduce a miner job store to prevent data loss and missed opportunities
- Refactor codebase; improve readability, remove unused code, add consistent code styling
Security improvements -
- Investigate use case for commit hashes to prevent forced tiebreaks (miners should be able to prove exactly when they converged)
- Perform additional simulation steps on the validator side to reduce attack surface
- Add further checks to outputs to ensure that all hyperparameters were used correctly (introduce penalties for poor instruction following)
- Continue to red-team of our incentive mechanism to detect and fix exploits
Simulation optimization -
- Return to winner-takes-all rewarding to incentivize top-performing miners
- Implement rebase mechanism (where miners copy the best result)
- Add further preprocessing to enable more proteins to be simulated
- Refine heuristics such as epsilon and early stopping criteria as a function of protein complexity
- Introduce RMSD-based scoring to promote exploration and discourage repetition in results
Efficiency improvement -
- Enable base miner to cancel running jobs to free up resources
- Reduce unnecessary log files (customizable/ adaptive logging)
- Purge unused data both on the validator and the miner side
- Introducing parallelization on the miner side so that jobs can be handled asynchronously.
Data analysis and introspection improvement -
- Add support for advanced visualization and data introspection
  - Integrate industry protein visualization libraries (vmd, pymol)
- Utilize additional files that are generated by GROMACS to facilitate
  - Energy changes over time (as opposed to improvements over time)
  - RMSD changes over time,
  - Protein trajectories for constituent atoms
  - Radius of gyration over time (not supported by AlphaFold)
Benchmarking -
- Analyze folded protein quality in comparison to other leading methods (AlphaFold) and experimental data (where available) in order to better measure and understand the relative strengths and weaknesses of our system with respect to others. Topics of interest include
  - Accuracy of our results
  - Speed of our simulations
  - User control over simulations
  - User access to results
- Prepare a whitepaper with our analysis results

Our present product roadmap can be articulated as four core pathways:

Identify Academic Collaboration Opportunities - ongoing outreach programme to the DeSci community and research communities to both engage with Protein Folding (and pull in “real” customers), as well as develop awareness and tactical useability for the subnet. We believe working with leading academic institutes is both a clear validation signal for Bittensor, but also a way to ensure the necessary fluctuations of a decentralized system can be designed out with true scientific rigour, and help us guide a path towards tangible value creation.
Publish papers - Our ambition is to enable collaborators to conduct their research on our subnet and publish in high-impact journals.
Visualization - we are developing dashboards and leaderboards to allow users to see the volume, type, and proteins folded, as well as understand miner performance. These will be similar in style to the Subnet 9 board.
Live API access - creation of a RESTful API that enables validators to submit additional folding jobs to the network. This will be the first step towards the creation of our application.
App build and release - Our in-house design and frontend team are already working on schematics for the folding application layer. Primarily, it will be an online portal which enables users to upload / select a protein, configure parameters and add to a managed job queue. Jobs in the queue will be entered into the network as organic queries and will be based on a prioritization mechanism. Organic scoring is both efficient and scalable in this subnet, so there are no firm limits on the number of users that can access our folding platform. Moreover, we intend to provide an open access protein database for users to browse and analyze
Deep blue sky - We are creating a portal for researchers to submit proposals for innovative new research subnets. We will select top proposals and work with those researchers to build new subnets, based on our flexible and general subnet design, to help them to reach their research goals.

Finally, we note that great care was taken when designing this software so that it may be used as a template for the next generation of research subnets (and beyond).

Beyond the technical roadmap articulated above, we have begun the process of outlining the boundaries of commercialisation for decentralized protein research. This could develop as an on-demand, community-driven protein-folding service, sale of the whole subnet’s bandwidth for a period of time, or direction of the subnet’s capacity towards altruistic goals the community believes in.

Testnet Information:

Folding is registered as Testnet 141, where we maintain a validator and several miners. Over the course of developing our incentive mechanism, we have used testnet to produce around 2000 hours of tracked data that was used for analysis and development.

All tests that we run on testnet have a standard set of KPIs, such as validator stability, miner response quality, and overall network communication stability. Once these have been measured in a 2-validator, 2+ miner environment, we can deploy these changes into the production setting.