Artificial Intelligence & Machine Learning
Technology / AI & ML

With the rapid rise in both the interest (market capitalization) and the quantity of data in biotechnology, deriving meaningful and marketable insights becomes more difficult. While traditional techniques such as statistical modeling have been employed to interpret such data, recent advances in computational power have enabled resource intensive machine learning and artificial intelligence techniques to become more tractable.

In this section we discuss some sample problems, a secure encryption scheme, and a general framework to formulate experiments and train models.

Data Encryption

Problems that will be solved in the NUCLE.AI network will be angled towards our partnerships.

While the examples above are clear, they only represent a small fraction of potential problems that may be solved from the vast blockchain of medical and biological data.

Clustering and identifying cells through time to determine effective drug and medicine treatments.

Using computer vision and object recognition to identify malignant tumors.

Providing tailored information based on analysis of personalized data.

Using medical information to identify at risk individuals for patients, doctors, analysts, and scientists.

Experiment Design and Distributed Training

Given the sheer variety in the kinds of applications the network will be involved with ( c.f. § 3.4.2.), it is impractical to utilize a predetermined set of experimental design procedures. Despite this variety in the different kinds of analyses the network will demand, NUCLE.AI’s distributed computing procedure can be generalized into the following sequential steps:
NUCLE.AI company determines the relevant information to the posed problem. Note that this is highly specialized and our R&D team will be carefully considering and designing the parameters of the experiment.
NUCLE.AI company obfuscates the sensitive data while maintaining data integrity.
The training data (subproblem) is released to the network.
Node operators (primarily data analysts) create a model befitting the provided task provided. If the currently generating block is secondary (c.f. §3.3.3), node operators without data analysis training can also choose to utilize other operators’ models released in previous block generation cycles.
Users train their models on the released training data.
Users release encrypted signatures of their models to the network
Users wait for a full model release buffer time to pass (prevents model-copy cheating)
Users release full model to the network.
NUCLE.AI company releases validation set to the network to have users check the accuracy of their trained models.
Users compute accuracy scores of their submitted models and announce their scores.
Models claiming to have the top accuracy values and were submitted properly in step 7 are cross-validated by other nodes in the network and the best model is identified by the network.
The winning node gains agency over block generation.
Repeat steps 1-12 until submitted model accuracy plateaus.

Differential Privacy

To ensure the data aggregation maintains privacy for the individual users, the Nucle.ai scheme will use Differential Privacy. When tokens are exchanged for the statistical analysis from the models, some amount of inference can be made about specific individuals within the sampled group. As a result, the differential privacy scheme first proposed by Cynthia Dwork attempts to obfuscate personally identifying information by adding mathematical noise to collected samples, while still retaining a high degree of statistical accuracy. The scheme ensures that the data provides minimal evidence as to whether any individual person submitted their data, mitigating any form of linkage attacks.

While it is not possible for the individual contribution to be completely decoupled from the result, we can formally bound the privacy loss for a particular user by the relative likelihood that a given result R occured in a particular dataset.

The ε-differential privacy is defined by the equation:

Where D and D’ differ in at least one element. Next, for the various statistical data we will aggregate, we want to ensure low privacy loss while maintaining the statistical validity of our results. Let ƒ be a function mapping our data to a desired statistical output.

We denote Δƒ as the l1 sensitivity of this function, defined by:

A low sensitivity will correspond to low privacy loss, and so the goal of the scheme is to implement a differentially private algorithm with low sensitivity.
Now, let M denote our differentially private mechanism, which takes as input the dataset D and generates the aggregated statistics after adding mathematical noise. One method for generating such noise is by sampling from the Laplacian Distribution, X~Lap(λ), such that:

Which is provably privacy preserving to a factor ε, as desired.