Machine Learning and Algorithmic InformationThe induction problemInduction is one of the oldest philosophical problem.
And Algorithmic Information Theory solved it!
As an example, we will show how Kolmogorov complexity can be used to make analogies.
Minimum Description Length The most common way to perform induction in practice is called MDL: Minimum Description Length.
Prototype-based models and compression The central topic of this chapter is that machine learning translates into compression.
This claim will be illustrated with prototype-based models.
Conversely, compression provides a criterion that can be used to guide machine learning. It is illustrated in the context of clustering.
Anomaly detectionAnomaly detection is an important and difficult topic of AI. Anomalies will be shown to be events that are abnormally simple.
Universal induction Are you dreaming of an AI system that would beat us all?
AIT offers precisely this, at least in principle.
First as an ideal way to perform prediction, using what is called the universal distribution.
Then as an ideal way to perform action: we will briefly consider AIXI, an ideal version of reinforcement learning based on algorithmic information.
What you will learn in this chapter
What you will learn in this chapter:
a criterion to make optimal hypotheses in learning tasks,
a method to solve analogies,
a new way to detect anomalies,
a new understanding of machine learning as a way to achieve compression,
a method to define an unbeatable AI.
→ All the programs used here can be retrieved fromthere.
The induction problem
Problem:
What can we learn from (sometimes very few) examples?
Induction is a fundamental component of intelligence.Thanks to induction, humans and robots can
learn
recognize situations
detect new structures
make appropriate decisions
and much more.
Induction differs from Deduction.Deduction generates knowledge through a proof.
All humans are mortal.
Socrates is human.
Therefore Socrates is mortal.
It is sometimes considered that deduction doesn’t produce new knowledge, as the conclusion was in some way already present, in a potential form, within the hypotheses.
Will it rain tonight?
Any intelligent being faces the problem of adapting to changing conditions, of deciding what to do next or what to believe next.
For this, deduction proves insufficient.Induction is a process through which one can produce new knowledge from past experiences.Induction is one of the oldest philosophical problems:
How can we extract general rules from limited data?
To be intelligent, induction must be guided.
Observe the following sequence.
12345
What symbol should come next?
Possible solutions:
6: because sequence of integers
7: because numbers k such that (7 × 10^{k}+71)/3is prime.
1: because repetition of the sequence 12345
42: because roots of polynomial x^{6} – 57x^{5}+715x^{4} – 3795x^{3}+9724x^{2} – 11628x +5040
Guessing what comes next is a fundamental issue. For instance:Prediction:
Knowing past weather, can I predict if it will rain?
Or in other words: continue the sequence
(rain, rain, sun, rain, rain, sun, snow, mist, rain, rain)Classification:
Having observed a sequence of pairs (object, label),
can I predict the label of a new object?
Or in other words: continue the sequence
(object 1, label 1, object 2, label 2, ... , object n, label n, object n+1)What is the best strategy for induction?Epicurus 341–270 BCPrinciple of multiple explanations:
If more than one theory is consistent with the
observations, keep all theories.What is the best strategy for induction ?Sextus Empiricus 160-210Validity of inductive reasoning:
‟When proposing to establish the universal
from the particulars [...]
induction will be insecure, since some of the
particulars omitted in the induction may
contravene the universal; while if one is to
review all, one will be toiling at the
impossible, since the particulars are infinite and
indefinite.”
What is the best strategy for induction ?
Do birds fly?
"some of the particulars omitted in the
induction may contravene the universal"
What is the best strategy for induction ?William of Ockham 1287-1347Occam’s Razor Principle:
Entities should not be multiplied beyond necessity What is the best strategy for induction ?Thomas Bayes 1701-1761Bayes’ Rule:
The probability of hypothesis H being true given some data D
is proportional to the learner’s prior belief in H
multiplied by the conditional probability of D given H. What is the best strategy for induction ?
Principle of multiple explanations
Validity of inductive reasoning
Occam’s Razor Principle
Bayes’ Rule
These principles are often insufficient for determining the best induction.
For ages, the problem of induction has been considered to have no general solution...
until 1964, when Ray Solomonoff solved it by creating
the Algorithmic Information Theory framework.
Most IQ tests are based on induction. The problem is to find a relevant rule based on a few examples only.
IQ test (1)
What is the odd object out in this list? cow, hen, pig, sheep.
This is a genuine induction problem: form a rule that covers three of the four items (but excludes the fourth one).
IQ test (2)
Why are some solutions less relevant than others in the previous exercise?
If you are intested in tempting and yet false inductions, have a look at Borwein integrals!
An illustration: Analogy making
Question:
Can we say that Analogy is like building bridges between neighbouring islands?
Answer:
🙂
Analogy is a form of reasoning and mental representation that works by connecting similar situations and transferring knowledge between them. This way of reasoning is commonly found in IQ tests, which means that it is thought to be central to human intelligence. It is also manifest in language use, as when we draw comparisons or speak in metaphors.
Understanding Artificial Intelligence
through Algorithmic Information Theory
Why is analogical reasoning so fundamental?
- In mathematics and science: discover new concepts, or generalize notions to other domains
- Justice: use of relevant past cases
- Art: metaphors, parody, pastiche
- Advertising: exploit similarity of products to influence people
- Humor: jokes are often based on faulty analogiesExamples of analogies
Gills are to fish as lungs are to humans.
Donald Trump is to the US as Vladimir Putin is to Russia
Donald Trump is to Barack Obama as Barack Obama is to George Bush
37 is to 74 as 21 is to 42
The sun is to Earth as the nucleus is to the electron
Formalization
A is to B as C is to D
is also noted:
A : B :: C : D
If R is a relation that holds between A and B:
R(A,B)
then the analogy says that the same relation holds between the two pairs:
R(A,B) and R(C,D).
Understanding an analogy means:
find the most appropriate relation RFormalization
A : B :: C : D
Following (Cornuejols, 1996; Cornuejols & Ales-Bianchetti, 1998), we may say that "the most appropriate relation R"
is the one that makes the quadruplet (A,B,C,D)simplest
or, in other words,
the one that achieves the best compression.
This is a special case of the induction principle.Formalization
A : B :: C : D
Such an analogy 𝒜 is often considered to obey the following axioms:
identity: 𝒜(A,B,A,B)
symmetry: 𝒜(A,B,C,D) ⇒ 𝒜(C,D,A,B)
central permutation: 𝒜(A,B,C,D) ⇒ 𝒜(A,C,B,D)
Complexity minimization explains why these properties generally hold.Analogical equation
We call analogical equation the problem of finding the missing term x in
an analogy:
A is to B as C is to x
or:
A : B :: C : x
Examples:
Paris is to France as Helsinki is to ... ?
ABC is to ABD as IJK is to ... ?
Analogical equation
A : B :: C : x
Following the induction principle, solving the analogical equation amounts to finding \(\mathbf{x}\) such that:
\(\mathbf{x} = argmin(K(A, B, C, \mathbf{x}))\)
where K designates Kolmogorov complexity.Analogical equation
A : B :: C : x
Doug Hofstadter
is well-known for his original study of analogy. Let’s consider one of his favourite examples.
abc : abd :: ppqqrr :x
Have you come up with a solution? Analogical equation
abc : abd :: ppqqrr :x
Most people would instantly answer
x = ppqqss.
Why should we consider it a "better" solution than, say, x = ppqqrd?Analogical equation
abc : abd :: ppqqrr :x
x = ppqqss.
x = ppqqrd.
We have to compare two rules:
increment the last item
replace the last letter by d
The second rule has two drawbacks in terms of complexity:
- it introduces a constant, d.
- if we do not want to lose the structure that led to the compression of ppqqrr, the rule should read: "break the last item back into letters, and replace the last letter by d"
This may explain why x = ppqqss would appear smarter.
Computing analogies
Question:
Can we process analogies automatically, knowing that complexity is not computable?
Answer:
Yes, and we show it in the case of morphological analogies.
When you learn a new language, you often infer grammar rules long before discovering their explicit form in a grammar book. This process is typical of analogical reasoning.
Let’s say you are studying Finnish and that you learn the verb puhua.
You discover that the first person in the present tense of this verb is puhun. Now, you want to conjugate the verb nauraa in the same tense and in the same person. What do you suggest? The similarity of the structure lets you think that the answer is probably nauran. Congratulations! You are now able to conjugate the Finnish first group of verbs in the first person of the present tense! You can do it because you intuitively know how to process morphological analogies.
Understanding Artificial Intelligence
through Algorithmic Information Theory
Morphological analogiesMorphological analogies concern the morphology of words.
semantic analogy: "king is to queen as man is to woman"
morphological analogy: "king is to kings as queen is to queens"
Morphological analogies match different grammatical forms of a same word.
They concern the symbols themselves, not their semantics.Example of morphological analogyLatin’s accusative case
rosa : rosam :: vita : x
Source: transformation of "rosa" into "rosam".
Possible interpretation: add letter "m" at the end of the radical.Target: the same transformation may apply to the target ("vita").
The result would be x = "vitam".BUT other transformations are also possible (for instance: always answer "rosam")
Can we compute the ‘correct’ transformation?Computing morphological analogiesIn (Murena et al., 2020), we introduced a simple language to implement morphological transformations.
let, ?0 : ?0, m, let mem, ?0 = rosa :: mem, ?0 = vita
line 1: define a function based on a variable ?0 that adds m to it
line 2: Call this function with variable ?0 = rosa
line3: Move to the second half of the analogy
line 4: Call this function with variable ?0 = vitaComputing morphological analogies
rosa : rosam :: vita : x
1. We use this simple language to try transformations that can solve the analogical equation.
2. We only keep transformations that match the two first terms (source)
(rosa → rosam) and can also be applied to the third term (in the target) (vita).
3. The algorithm is almost brute force, as there are few such candidate transformations.
4. We keep the simplest transformation.Computing morphological analogies And it works!The algorithm has been used to guess the fourth term of morphological analogies.
Our algorithm based on minimum of complexity (here named NLG_COMP) favourably compares with a state-of-the-art method NLG_PROP (Fam & Lepage, 2018) on the dataset SIGMORPHON’16 . . .Computing morphological analogiesComputing morphological analogies This example illustrates the fact that Kolmogorov complexity can inspire powerful learning algorithms.
🙂
Computing morphological analogies
Let’s show that we can solve analogies on words using a minimal complexity principle. To do so, we consider a specialized programming language.
A program in this language corresponds to a sequence of comma-delimited instructions.
A sequence describes a series of concatenations. For instance, the program 'ab','cd' outputs the string abcd.
The main characteristic of our language is the implementation of a memory as a push-down stack. At any time, an element can be stored into memory, using two let instructions (an opening one and a closing one). For instance, let, 'abc', let stores the string abc in memory.
The operator mem is used to retrieve the last elements stored in memory. The i^{th} last element stored in memory can be retrieved with the instruction mem, i. By convention, the index starts with 0 (which means that mem,0 retrieves the very last element saved in memory).
Code for strings (1)
Which would be the output of the program let,'ab',let,let,'cd',let,mem,1?
In addition, we allow the program to store operations in memory, in which case the parameters are denoted by ?n, where n is the index of the parameter. For instance, the instruction let, ?0, ?1, let stores an operation that concatenates its two operands.
Code for strings (2)
Which of the following programs defines an operation which duplicates a string and introduces a hyphen to separate the duplicates?
Once an instruction is stored in memory, it can be retrieved using the mem codeword, followed by the correct number of arguments. For instance, the code let, ?0, ?1, let, mem, 0, 'abc', 'ijk' defines a concatenation operators, then retrieves it from memory, applies it to two strings and finally outputs abcijk. Note that if the number of arguments is incorrect, the code will raise a syntax error.
Code for strings (3)
Which of the following programs will not raise a syntax error?
A compiler for this language is given in the python script .
The compilation is called by the method generate_string, which takes as argument the instruction as a string. For instance:
How can such a program be used to evaluate good analogies? A solution consists of assuming that the quality of an analogy A:B::C:D is measured by the complexity of the string 'A:B::C:D'.
The program estimates this complexity through the function instruction_complexity(instruction_string) which takes as argument the instruction as a string.
String generation (1)
Select all programs that generate the string: rosa:rosam.
String generation (2)
Based on the estimates of , how complex is the less complex among these programs?
Solving analogies in this way is very time-consuming for two reasons. First, it requires the exploration of all possible solutions. Second, we have to explore all possible programs that generate each solution. A simplified solver is proposed in the function restricted_search. To execute it, just provide the desired analogy. The method will suggest a result based on complexity minimization.
For instance, you can check the analogy rosa:rosam::vita:vitam by calling restricted_search("rosa:rosam::vita:vitam").
$ python >>> import analogy >>> analogy.restricted_search('rosa:rosam::vita:vitam') Best instruction: let,?0,a,:,?0,a,m,let,mem,0,ros,::,mem,0,vit Expected result: vitam Obtained result: vitam
Be careful: though the exploration space is restricted, the exploration can still take some time, especially if you ask for difficult analogies!
Analogy in Finnish
Execute the code to solve an analogy in Finnish:
puhua:puhun::katsoa:katson.
Which instruction do you get as a result?
Analogy Solving algorithm
Does the restricted_search method work in all cases?
Does it work on analogies such as "abc:abd::ijk:ijl" (known as Hofstadter’s analogies)?
Can you explain why?
Forum: Analogies
Can you think of analogies in your mother tongue (or in any language you know) that won’t work with this method?
I saw a funny thing on Twitter.
What does it say?
It asks what comes after 5, 7, 11, 13, 17, 19, 23.
There are two options: 25 and 29.
I can see that people hesitate:
It’s almost half-half! I don’t understand why!
Why what?
There’s an obvious answer.
Which is?
You have to select the next prime, which is 29.
Ha ha! You are reacting exactly as "MC Pablo"!
It’s in French.
So you didn’t see the pattern +2 +4 +2 +4 +2 +4.
Ah! OK, I understand why 25 makes sense, now.
So it’s clear which is the "correct" answer.
Yes. You just choose the less complex one.
And so you will favor 25, as the Turing machine that computes
+2, +4, +2, +4... is smaller than the one that computes prime numbers.
No, I was about to say that 29 was simpler!
You can’t say that!
Yes I can! If you already know what primes numbers are, it’s simpler
to say "retrieve the next prime".
Ok, I agree that the best induction depends on what you already know.
Anyway, I had a look at what oeis.org says about this series.
"Numbers congruent to 1 or 5 mod 6",
"Numbers n such that (2^{n}+1)/3 is prime",
"n^{th} irreducible polynomial over Q (with coefficients 0 or 1) evaluated at x=2".
The first one is the same as +2, +4, +2...
Yes, no... Not exactly. But anyway, the other ones are really crazy!
You say this because they are way too complex, as compared with merely saying 5, 7, 11, 13, 17, 19, 23.
Yes, I guess so.
Maybe complexity offers us a way to measure "craziness" 🤪!
Minimum Description Length (MDL)
Problem:
Can we find a computable version of the induction principle?
Complexity theory had a big impact on Artificial Intelligence, especially in the domain of machine learning. Ray Solomonoff used it to solve the problem of induction in practice. His method, which merges Bayesianism and Kolmogorov complexity, mainly consists of complexity minimization. Though it led to strong theoretical results, it remains incomputable.As an attempt to come closer to computability, one may restrict the set of ‘models’ that are considered to account for the data. This idea has been implemented in two principles: MDL (minimum description length, developed by Jorma Rissanen) and MML (minimum message length, developed by Chris Wallace).
In what follows, we present a version of MDL that we will call "crude MDL" or "ideal MDL" (which is very close to MML).
Suppose you want to classify a set of N data D ={x_{k}} in two classes, "blue"/"red". Your toolbox contains different computable models M_{i} that can be used to label the data as blue or red.
According to the (crude) Minimum Description Length principle, you should choose the model that achieves the best compromise between simplicity and accuracy:
Suppose you don’t know the true category (blue or red) of the \(N\) objects.
Among how many different classifications should we search the correct one?
Algorithmic Information Theory describes the learning task as a compression operation. The "null model"M_{0} consists of remembering the class of each object. It requires N bits (1 bit for each element to be labelled as red or blue). We may take this as a baseline.
The different models M_{i} are attempts to classify the data in a more economic way. In other words, the MDL problem consists of choosing among the {M_{i}} the model that leads to maximal overall compression.Maybe M_{i} is not perfect, which means that M_{i} classifies only n_{i} items among the {x_{k}} correctly. We want to decide which is the "best" M_{i}.
Minimum description length (2)
What is the compression (in number of bits saved by comparison with the null model) if \(M_i\) is perfect (\(n_i = N)\)?
Suppose now that M_{i} isn’t perfect (n_{i}< N). To reach a correct classification, one needs to designate all misclassified data, for instance by providing their indices.
Minimum description length (3)
How much worse (in number of bits) is the compression of an imperfect model, as compared with the perfect case?
Note that the complexity of M_{i} includes the complexity of designating it among all models (this may amount to the complexity of i, i.elog_{2}(i) plus the complexity of all parameters values that may have to be determined within M_{i}).
Minimum description length (4)
Can you provide a formal characterization of the best model in the \({M_i}\) family?
Minimum description length (5)
Can you describe how the worst model behaves (if we ignore its own complexity)?
The MDL principle illustrates Gregory Chaitin’s famous aphorism:
〝Comprehension is compression〞
(Gregory Chaitin, 2004)
This means that the best theory is the one that achieves the best compression by reaching a compromise between its own complexity and the complexity of describing the facts that are left unexplained.
This principle has far-reaching consequences in epistemology (read for instance The symmetry and simplicity of the
laws of physics and the Higgs boson).
Question:
Does learning really amount to a compression task?
Answer:
Yes, definitely!
Prototype-based models are one of the most popular machine learning techniques.
They offer a nice illustration of the idea that learning amounts to achieving compression.
Understanding Artificial Intelligence
through Algorithmic Information Theory
Prototype-based models can be used in unsupervised learning to group similar data points together (e.g. in marketing or recommender systems).
Prototype-based models
Prototype-based models can be used in unsupervised learning to group similar data points together.
A prototype is a point of the input space used to provide information about the distribution of data points. Prototype-based models
Prototype-based models can be used in unsupervised learning to group similar data points together.
A prototype is a point of the input space used to provide information about the distribution of data points. Intuitively, data are distributed "around" the prototypes.Prototypes models
One of the most famous prototype-based methods in machine learning is the K-Means algorithm.
Several improvements have been proposed (Learning Vector Quantization, Online Learning Vector Quantization, Self-Organizing Maps, Neural Gas algorithms, etc.). Prototypes models
One of the most famous prototype-based methods in machine learning is the K-Means algorithm.
Prototype-based models can be seen as compression methods:
Instead of describing data points by specifying their absolute coordinates, Compression
Prototype-based models can be seen as compression methods:
Instead of describing data points by specifying their absolute coordinates,
one considers their coordinates relative to the closest prototype. Compression
Prototype-based models can be seen as compression methods:
Instead of describing data points by specifying their absolute coordinates,
one considers their coordinates relative to the closest prototype. We get smaller numbers, hence the compression. Compression
The complexity of a point \(x_j\) in the dataset \(X\) is bounded by the complexity of its coordinates \((x_{j_1}, \dotsc, x_{j_d})\):
$$ C(x_j) \leq C(x_{j_1}) + \dotsc + C(x_{j_d}).$$
where the complexity of coordinates depends on the precision \(\delta\) used to convert them into integers.
We get a "naive" upper bound of the complexity of the whole dataset \(X\):
$$C(X) \leq \sum_j C(x_j).$$
Compression
If prototypes \(\{p_k\}\) are introduced,
the complexity of \(X\) is now:
$$ C(X) \leq \sum_k C(p_k) + \sum_i C(x_j|\{p_k\})$$
where:
\(C(x_j|\{p_k\}) = \min_k C(x_j-p_k).\)
If the dataset exhibits some structure, suitable prototypes can be found and this estimate of \(C(X)\) will turn out to be significantly smaller that the "naive" one (hence the compression).
Conversely, and quite remarkably, this expression can be used to find the best prototype set \(\{p_k\}\)
(Cilibrasi & Vitányi, 2005).Classification
Now consider a classification task as an example of supervised learning.
The problem is to associate a label (e.g. blue/red/dark) to each data point.
Classification
Now consider a classification task as an example of supervised learning.
The problem is to associate a label (e.g. blue/red/dark) to each data point.
Prototype-based models are routinely used for this.
In doing so, they offer a compressed version of the classification.Classification
In a classification task, we consider a dataset \(X\) of \(|X|\) points \(x_j\). Each \(x_j\) is supposed to belong to a class labelled as \(y_j\). The problem is to learn which label \(y_j\) taken from the set of labels \(Y\) corresponds to \(x_j\).
The complexity of the classification is bounded by the amount of information required by rote learning.
$$C(\{(x_j, y_j)\} | X) \leq \sum_j C(\{y_j\}) \approx \sum_j |X| \times \log(|Y|)$$Classification
A prototype-based model \(\mathcal{M}\) would deduce the label of all points from those of its \(K\) prototypes \(\{p_k\}\), in the hope that:
$$\beta(x_i) = y_i$$
Classification
A prototype-based model \(\mathcal{M}\) would deduce the label of all points from those of its \(K\) prototypes \(\{p_k\}\), in the hope that:
$$\beta(x_i) = y_i$$
The decision function \(\beta\) just returns the label of the closest prototype.
For instance, if the point is closer to prototype 3 (labeled "blue") than to any other prototype, the decision function will return "blue". Classification
The complexity of the model is given by the complexity of its prototypes and their labels:
$$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$
Now:
$$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) + C(\{(x_j, y_j)\} | X, \mathcal{M})$$
Since the decision function \(\beta\) is just a loop over the prototypes,
\(C(\{(x_j, y_j)\} | X, \mathcal{M}) = 0\). So \(C(\{(x_j, y_j)\} | X)\) is found to be much smaller if \(K \ll |X|\) than if we only rely on rote learning.
Classification
The complexity of the model is given by the complexity of its prototypes and their labels:
$$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$
Now:
$$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) \ll \sum_j |X| \times \log(|Y|)$$
We just have to remember the prototypes and their label to retrieve all labels, hence the compression.
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.
Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.
If not:
\(C(\{(x_j, y_j)\} | X, \mathcal{M}) \neq 0\).
For each misclassified point, we may just specify its index the correct value of its label.
$$C(\{(x_j, y_j)\} | X, \mathcal{M}) \leq \sum_{i=1}^{|X|} (C(i) + C(y_i)) \times \mathbb{I}(y_i \neq \beta(x_i))$$
where \(\mathbb{I}\) is a function that converts a Boolean value into 0 or 1.Classification
This is only true, however, if the model \(\mathcal{M}\) is perfect.
If not:
\(C(\{(x_j, y_j)\} | X, \mathcal{M}) \neq 0\).More simply, if there are \(n\) misclassified points, we get:
$$C(\{(x_j, y_j)\} | X, \mathcal{M}) \leq n \times (\log(|X|) +\log(|Y|)).$$Classification To sum up:
$$C(\{(x_j, y_j)\} | X) \leq \sum_k C(p_k) + K \times \log (|Y|) + n \times (\log(|X|) +\log(|Y|))$$
If \(n\) is small, this may still represent a huge compression as compared to rote learning.
The right-hand side of this inequality can be used as a loss-function to determine the best prototype-based classification.
Prototype-based models (1)
In what way do unsupervised prototype-based models achieve compression?
Prototype-based models (2)
In what way does the supervised learning of classification achieve compression?
Clustering and compression
Problem:
Could we use the compression criterion to get parameter-free learning techniques?
Unsupervised Learning is a branch of machine learning in which data arrive with no labels. The purpose of unsupervised learning is to attribute labels to unlabeled and unstructured data in order to give them a structure. A clustering system receives a set of input data \(X\) and aims at grouping related points together. A group of related points is called a cluster. In basic clustering, a point can belong to only one cluster.
The most popular clustering algorithm, called K-Means, is a prototype-based algorithm. The algorithm initializes random prototype positions and alternates two phases:
it associates the data points to their closest prototype
it moves the prototype position to the centroid of its associated data points.
Implementation
We will be using the program . To run this script, you will need the packages numpy and scikit-learn installed on your machine. Numpy is a standard package for scientific calculus in Python; Scikit-learn is a comprehensive library implementing basic machine learning algorithms (if necessary, execute the command pip install --user sklearn to install Scikit-learn).
Run the program. You can observe the complexity reduction due to the use of prototypes.
$ python Prototype.py Iris dataset with 3 prototypes: ------------------------------- >> prototypes found by K-Means: [[5.9016129 2.7483871 4.39354839 1.43387097] [5.006 3.428 1.462 0.246 ] [6.85 3.07368421 5.74210526 2.07105263]]
>> Complexity reduction when using K-Means: Dataset complexity: 1191 Complexity after K-Means: 219
The program computes complexity by adding the prototypes’ complexity and the complexity of the data points’ distances to the closest prototype (see the "basic complexities" section of the program). Choosing the right number of clusters is a well-known issue. If we apply the MDL Principle, the ideal number \(K\) of clusters should minimize the previous complexity. With too few clusters, the model is simple, but the complexity of the data knowing the model may be large; conversely, too many clusters would make the model unnecessarily complex and would produce overfitting.
KMeans and Compression
Run the algorithm on the ‘Iris’ data set.
For which value of the number \(K\) of prototypes do we reach a minimum of complexity?
Number of clusters
Choose the ‘Diabetes’ data set. Data now have 10 dimensions (instead of 4 for ‘Iris').
What is the optimal value of \(K\) with respect to the MDL principle?
(you may loop over increasing values of \(K\) up to 100)
We just saw how the MDL principle was used to determine the ideal number of prototypes in a clustering task. If we extend the idea, maximal compression is
ideally the ultimate criterion that could lead to parameter-free learning techniques (read for instance Cilibrasi & Vitányi 2005)).
Cilibrasi, R. & Vitányi, P. (2005). Clustering by compression. IEEE transactions on Information Theory, 51 (4), 1523-1545.
Kleinberg, J. M. (2003). An impossibility theorem for clustering. In S. Becker, S. Thrun & K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15 (NIPS 2002), 463-470.
I wonder how neural networks fit with this idea of
"learning = compression".
They do more than rote learning. So they must achieve compression.
Consider GPT-3, for instance. Its training resulted in the fitting of 175 BILLION parameters. You call it compression!
I guess, it all depends what you consider is compressed.
Suppose you want to tell cats from dogs in a set of images.
Yes. There is one bit to be learned per image.
If you must fit 60 millions parameters to decide on 1000 images, I don’t see the point.
One single relevant statement is worth a great deal of information, not just 10 bits.
Why?
Because you need a certain amount of information to select that statement among all the possible silly ones in the current context.
And so, according to you, it justifies the 175 billion parameters?
Indeed! It still achieves huge compression.
Anomaly Detection
What is an anomalY?
Question:
Can we characterize anomaly just as low-probability events?
Answer:
No. The probability of any event, described with sufficient precision within its context of occurrence, is zero! Maybe anomaly has to do with complexity...
Anomaly Detection is another typical problem in Unsupervised Learning. Anomalies of interest in Machine Learning are outliers. The problem of detecting outliers is close to the clustering task, in the sense that the learner tries to infer a model that describes "normal" data. Contrary to clustering, however, the point of anomaly detection is not to group similar data but to discard data that do not fit the model. We use the same previously-mentioned . The complexity criterion, however, will be used for each data point to decide whether it is well described by the model. The compression rate is defined as \(\frac{\text{complexity after}}{\text{complexity before}}\). If it is close to 1, the point does not fit with the model and may be regarded as an anomaly.
Anomaly detection (1)
Modify the anomalyDetection function in order to select abnormal points. A point is considered abnormal if the compression rate exceeds the threshold value. The function should return a list of indices. Use the numpy method where. Copy the ‘return’ line here.
Anomaly detection (2)
Choose the dataset ‘diabetes’, and set the number of prototypes to the best value previously found. Uncomment the last lines to call anomalyDetection. Determine the threshold value that will select the 10 most abnormal points. Which value did you find?
Anomaly detection (3)
The preceding exercises aimed at detecting outliers.
What do you think...?
Suggestion:
The threshold method is a little bit difficult to use in practice (which value of the threshold?). You could try to think of other methods to select abnormal points from the list of compression rates.
Lossless and lossy compression
Algorithmic Information Theory is primarily about lossless compression. The point is to summarize objects in the most concise way, while keeping all the information needed to recover them. However, anomaly detection points to the possibility of getting rid of abnormal points, considered as mere noise.
More generally, lossy compression consists of discarding noisy or contingent data in the hope of capturing the gist of a phenomenon.
An extreme form of lossy prototype-based compression would be to keep the prototypes and to forget all data. This amounts to keeping only the model part in the expression of MDL. Simplification and sketching (as in comic strips) may be modelled as resulting from lossy compression.
Note that such a simplification can only occur after the MDL optimization has been performed.In the next chapter, we will adopt the exact opposite perspective: situations in which only anomalies matter.
Forum: Lossy compression
Please mention cases in which lossy compression is mostly appropriate.
Universal Induction
Problem:
Is there a link between Bayes Rule and Solomonoff’s induction principle?
Understanding Artificial Intelligence
through Algorithmic Information Theory
Given an observation \(x = x_1, x_2, ... x_k\)
about an ongoing phenomenon,
which hypothesis should we choose
among a set of hypotheses {H_{i}} to explain \(x\)
and predict its future?
Merging Epicurus, Occam, Bayes and others
Let’s proceed step by step:
According to Bayes rule, we need to attribute a priorp(H_{i}) to all H_{i} before observing any data.
According to Epicurus’ principle, p(H_{i})=0 if and only if H_{i} is not consistent with the data.
According to Occam’s razor, p(H_{i}) must be small for complex hypotheses.
How do you quantify the complexity of a hypothesis? Answer: by seeing it as a computer program and by measuring its length!
Universal distribution for finite sequences
The expression of algorithmic probability reveals the respective contributions
of all consistent hypotheses \(p\):
$$\mathbf m(x) = \sum_{p:U(p)=x} 2^{-l(p)} \approx 2^{-K(x)}$$
where \(U\) is a prefix Turing machine.
Any computable hypothesis for \(x\) (i.e. which can be represented as a program \(p\) that generates \(x\)) receives a non-zero probability, and thus the distribution \(\mathbf m\) respects Epicurus’ principle.
Occam’s razor principle is respected as well, since longer hypotheses get lower probabilities under distribution \(\mathbf m\).
If we were to keep only one hypothesis, it would be the shortest program that generates \(x\).
Universal distribution for finite sequences
\(\mathbf m(x)\) is a semi-computable distribution^{*}, which means that it can be approximated from below.
\(\mathbf m(x)\) is universal, in the following sense:
For any lower semi-computable distribution \(\rho\), then for all \(x\),
\(\mathbf m(x) \geq 2^{-K(\rho)} \rho(x)\).
This means that \(\mathbf m(x)\) cannot grossly underestimate
the probability of objects.
^{*} Strictly speaking, \(\mathbf m(x)\) is a semi-measure, as it does not sum up to 1.
Sequence prediction
A good way to check a hypothesis for a sequence is to predict what comes next.
12345...
Sequence prediction
12345...
We suppose we are in the deterministic case, which means that the sequence is produced by a program p.
What is the probability that the next symbol will be 6?
The probability is the sum of the probabilities of "picking" a program generating 123456 among all programs that generate 12345.
What is the probability of "picking" such a program?
If we flip a fair coin to build a program \(p\) of length \(l(p)\) bit by bit, the probability of that program is \(2^{-l(p)}\).
Sequence prediction
\(x = 12345...\)
$$\textrm{Pr}(x_6 = 6|x_{1:5} = 12345) = \sum_{p: U(p)=123456...} 2^{-l(p)}$$
where the sum is over all programs generating sequences
that start with 123456.
Note that we deal now with programs that generate
possibly infinite sequences.Universal distribution for infinite sequences
The notion of universal distribution can be extended to infinite sequences.
In the binary case:
$$\mathbf M(x) = \sum_{p:U(p)=x*} 2^{-l(p)}$$
where \(x*\) means any continuation of \(x\).
\(\mathbf M\) is also called the Solomonoff distribution.
\(\mathbf M(x)\) shows a complexity-weighted combination of all hypotheses explaining the infinite sequence knowing the evidence \(x\).Universal distribution for infinite sequences
The notion of universal distribution can be extended to infinite sequences.
In the binary case:
$$\mathbf M(x) = \sum_{p:U(p)=x*} 2^{-l(p)}$$
where \(x*\) means any continuation of \(x\).
As in the finite case, \(\mathbf M\) multiplicatively dominates any semi-computable distribution \(\mu(x)\), which means that it cannot grossly underestimate probabilities.Conclusion
Among all hypotheses consistent with the data, the one with the least Kolmogorov complexity is the most likely one.
The universal distribution provides the best combination of all hypotheses by weighting them by their complexity.
This holds for finite objects, using universal distribution \(\mathbf m\).
It also holds for prediction tasks in which one wants to explain infinite sequences (using universal distribution \(\mathbf M\)).
Prediction
What would be a good way to bet on \(x_n\) knowing the sequence \(x\) up to \(x_{n-1}\), among the following?
Suggested reading:
Li, M. & Vitányi, P. (1993). An introduction to Kolmogorov complexity and its applications (3rd ed.). New York: Springer Verlag, ed. 1997, Chapters 4.3 and 5.1.
Universal Artificial Intelligence
Problem:
Can we define, at least in theory, the "most intelligent" action to take?
In some way, the "most intelligent" program one may think of would perform optimal decision making. Some scientists, such as Marcus Hutter, built on Solomonoff’s ideas about learning and induction to define what they call "universal artificial intelligence". Such an AI would take the best possible decisions at any time. Really?
Hutter’s AIXI instantiates an ideal version of reinforcement learning, based on algorithmic information. AIXI performs actions that maximize expected future rewards. At any moment, the expected reward results from an inductive process based on past observations and past actions. Future rewards are weighted by their probability, which depends on the complexity of the programs that generate them. Hutter’s equation for "Ultimate intelligence" reads:
As Hutter puts it:
The expression shows that AIXI tries to
maximize its total
future reward r_{k}+...+r_{m}. If the
environment is modeled by a deterministic program q, then the future perceptions
...o_{k}r_{k}...o_{m}r_{m} = U(q,a_{1}..a_{m}) can be
computed, where U is a universal [...] Turing machine executing
q given a_{1}..a_{m}. Since
q is unknown, AIXI has to maximize its expected reward, i.e. average r_{k}+...+r_{m} over all
possible future perceptions created by all
possible environments q
that are consistent with past perceptions. The
simpler an environment, the higher is its a-priori contribution
2^{-l(q)},
where simplicity is measured by the length l
of program q.
We hope you enjoyed this chapter on AIT and Machine Learning.
Let’s leave the last word to Angel@ and Alexandr@.
This chapter on AIT and machine learning was nice, don’t you think?
Sure! I liked the idea that analogy is just a matter of minimizing complexity.
Oh, it’s just a special case of induction.
No, I disagree. Analogy is a form of reasoning.
Yes, but, induction is more general. I liked the last thing about universal induction and universal AI.
Yes, but it’s not computable.
It helps me understand that efficient reinforcement learning depends on your ability to compute the future.
To me, analogy and MDL are closer to practical use. I like the idea of optimal trade-off between model simplicity and accuracy.
And so you must be exited to see that any form of learning amounts to performing data compression.
Yes, we talked about that!
Yeah. As for me, my favorite remains the idea that anomalies must be simple to characterize to be genuine anomalies.
Yes, that one also!
Test (1)
The solution to the analogical equation: \(A\) is to \(B\) as \(C\) is to \(X\)
is the value of \(X\) that has minimal complexity.
Test (2)
Unsupervised prototype-based clustering achieves compression because distances to prototypes require less information.
Test (3)
Outliers in clustering are data points with low-complexity coordinates.
Test (4)
Trained neural networks invove millions, sometimes billions of parameter values. Yet, the point of using neural networks is still to achieve data compression.
Test (5)
MDL achieves lossless compression.
Forum: Chapter4
Do you have feedback or suggestions about the topics addressed in this chapter?
Cornuejols, A. & Ales-Bianchetti, J. (1998). Analogy and Induction: which (missing) link?. Advances in analogy research: Integration of theory and data from the cognitive, computational, and neural sciences, 365-372. Sofia, Bulgaria: New Bulgarian University.
Hofstadter, D. R. (1995). Fluid concepts and creative analogies. New York: Basic Books.
Kleinberg, J. M. (2003). An impossibility theorem for clustering. In S. Becker, S. Thrun & K. Obermayer (Eds.), Advances in Neural Information Processing Systems 15 (NIPS 2002), 463-470.