IA325/IA703: Algorithmic Information and Artificial Intelligence

Lecturers:
 and Pierre-Alexandre Murena Etienne Houzé

# Machine Learning and Algorithmic Information

#### What you will learn in this chapter

What you will learn in this chapter:
• a criterion to make optimal hypotheses in learning tasks,
• a method to solve analogies,
• a new way to detect anomalies,
• a new understanding of machine learning as a way to achieve compression,
• a method to define an unbeatable AI.

## The induction problem

Problem: What can we learn from (sometimes very few) examples?
You may have a look at the "Sheep in Scotland" joke to judge the difficulty of the task.

Understanding Artificial Intelligence
through Algorithmic Information Theory

Induction:
A fundamental component of Intelligence

aiai.telecom-paris.fr

Induction is a fundamental component of intelligence.

Thanks to induction, humans and robots can
• learn
• recognize situations
• detect new structures
• make appropriate decisions
• and much more.
Induction differs from Deduction.

Deduction generates knowledge through a proof.

All humans are mortal.
Socrates is human.
Therefore Socrates is mortal.

It is sometimes considered that deduction doesn’t produce new knowledge, as the conclusion was in some way already present, in a potential form, within the hypotheses.

Will it rain tonight?

Any intelligent being faces the problem of adapting to changing conditions, of deciding what to do next or what to believe next.
For this, deduction proves insufficient.
Induction is a process through which one can produce new knowledge from past experiences.

Induction is one of the oldest philosophical problems:
How can we extract general rules from limited data?
To be intelligent, induction must be guided.

Observe the following sequence.     12345
What symbol should come next?

Possible solutions:
1. 6: because sequence of integers
2. 7: because numbers k such that (7 × 10k + 71)/3 is prime.
3. 1: because repetition of the sequence 12345
4. 42: because roots of polynomial
x657x5 + 715x43795x3 + 9724x211628+ 5040
Guessing what comes next is a fundamental issue. For instance:

Prediction: Knowing past weather, can I predict if it will rain?
Or in other words: continue the sequence
(rain, rain, sun, rain, rain, sun, snow, mist, rain, rain)

Classification: Having observed a sequence of pairs (object, label), can I predict the label of a new object?
Or in other words: continue the sequence
(object 1, label 1, object 2, label 2, ... , object n, label n, object n+1)
What is the best strategy for induction?

Epicurus    341–270 BC

Principle of multiple explanations:
If more than one theory is consistent with the observations, keep all theories.
What is the best strategy for induction ?

Sextus Empiricus    160-210

Validity of inductive reasoning:
‟When proposing to establish the universal from the particulars [...] induction will be insecure, since some of the particulars omitted in the induction may contravene the universal; while if one is to review all, one will be toiling at the impossible, since the particulars are infinite and indefinite.”
What is the best strategy for induction ?

 "some of the particulars omitted in the induction may contravene the universal"
What is the best strategy for induction ?

William of Ockham 1287-1347

Occam’s Razor Principle:
Entities should not be multiplied beyond necessity
What is the best strategy for induction ?

Thomas Bayes    1701-1761

Bayes’ Rule:
The probability of hypothesis H being true given some data D is proportional to the learner’s prior belief in H multiplied by the conditional probability of D given H.
What is the best strategy for induction ?

• Principle of multiple explanations
• Validity of inductive reasoning
• Occam’s Razor Principle
• Bayes’ Rule

These principles are often insufficient for determining the best induction.
For ages, the problem of induction has been considered to have no general solution...
until 1964, when Ray Solomonoff solved it by creating
the Algorithmic Information Theory framework.

Most IQ tests are based on induction. The problem is to find a relevant rule based on a few examples only.
 What is the odd object out in this list?        cow, hen, pig, sheep. This is a genuine induction problem: form a rule that covers three of the four items (but excludes the fourth one).

Choose valid solutions.
hen, because of biological classification
hen, because of the number of legs
sheep, because of the number of letters in the word
pig, because of some culinary restrictions
sheep, because of its long hair
cow, because of its weight

 Why are some solutions less relevant than others in the previous exercise?

Less relevant rules are more complex because they require determining a threshold value
Less relevant rules are more complex because they invoke new concepts
Less relevant rules are too simple because they are based on one single numerical value
Less relevant rules are too simple because they are based on one single concept

If you are intested in tempting and yet false inductions, have a look at Borwein integrals!

### An illustration: Analogy making

Question: Can we say that Analogy is like building bridges between neighbouring islands?

Analogy is a form of reasoning and mental representation that works by connecting similar situations and transferring knowledge between them. This way of reasoning is commonly found in IQ tests, which means that it is thought to be central to human intelligence. It is also manifest in language use, as when we draw comparisons or speak in metaphors.

Understanding Artificial Intelligence
through Algorithmic Information Theory

Analogy making
and complexity

aiai.telecom-paris.fr

Why is analogical reasoning so fundamental?

- In mathematics and science: discover new concepts, or generalize notions to other domains
- Justice: use of relevant past cases
- Art: metaphors, parody, pastiche
- Advertising: exploit similarity of products to influence people
- Humor: jokes are often based on faulty analogies
Examples of analogies

• Gills are to fish as lungs are to humans.
• Donald Trump is to the US as Vladimir Putin is to Russia
• Donald Trump is to Barack Obama as Barack Obama is to George Bush
• 37 is to 74 as 21 is to 42
• The sun is to Earth as the nucleus is to the electron
Formalization

A is to B as C is to D

is also noted:                  : B :: C : D

If R is a relation that holds between A and B: R(A,B)
then the analogy says that the same relation holds between the two pairs: R(A,B) and R(C,D).

Understanding an analogy means:
find the most appropriate relation R
Formalization

: B :: C : D

Following (Cornuejols, 1996; Cornuejols & Ales-Bianchetti, 1998), we may say that "the most appropriate relation R" is the one that makes the quadruplet (A,B,C,D) simplest
or, in other words,
the one that achieves the best compression.

This is a special case of the induction principle.
Formalization

: B :: C : D

Such an analogy 𝒜 is often considered to obey the following axioms:
• identity:    𝒜(A,B,A,B)
• symmetry:    𝒜(A,B,C,D) ⇒ 𝒜(C,D,A,B)
• central permutation:    𝒜(A,B,C,D) ⇒ 𝒜(A,C,B,D)

Complexity minimization explains why these properties generally hold.
Analogical equation

We call analogical equation the problem of finding the missing term x in an analogy:

A is to B as C is to x

or:                     : B :: C : x

Examples:
• Paris is to France as Helsinki is to ... ?
• ABC is to ABD as IJK is to ... ?
Analogical equation

: B :: C : x

Following the induction principle, solving the analogical equation amounts to finding $$\mathbf{x}$$ such that:

$$\mathbf{x} = argmin(K(A, B, C, \mathbf{x}))$$

where K designates Kolmogorov complexity.
Analogical equation

: B :: C : x

Doug Hofstadter is well-known for his original study of analogy. Let’s consider one of his favourite examples.

abc : abd :: ppqqrr : x

Have you come up with a solution?
Analogical equation

abc : abd :: ppqqrr : x

= ppqqss.

Why should we consider it a "better" solution than, say, = ppqqrd?
Analogical equation

abc : abd :: ppqqrr : x

= ppqqss.                  = ppqqrd.

We have to compare two rules:
• increment the last item
• replace the last letter by d

The second rule has two drawbacks in terms of complexity:
1. - it introduces a constant, d.
1. - if we do not want to lose the structure that led to the compression of ppqqrr, the rule should read:
"break the last item back into letters, and replace the last letter by d"

This may explain why = ppqqss would appear smarter.

#### Computing analogies

Question: Can we process analogies automatically, knowing that complexity is not computable?

Answer:   Yes, and we show it in the case of morphological analogies.

When you learn a new language, you often infer grammar rules long before discovering their explicit form in a grammar book. This process is typical of analogical reasoning.
Let’s say you are studying Finnish and that you learn the verb puhua. You discover that the first person in the present tense of this verb is puhun. Now, you want to conjugate the verb nauraa in the same tense and in the same person. What do you suggest? The similarity of the structure lets you think that the answer is probably nauran. Congratulations! You are now able to conjugate the Finnish first group of verbs in the first person of the present tense! You can do it because you intuitively know how to process morphological analogies.

Understanding Artificial Intelligence
through Algorithmic Information Theory

Morphological Analogies

aiai.telecom-paris.fr

Morphological analogies

Morphological analogies concern the morphology of words.

• semantic analogy: "king is to queen as man is to woman"
• morphological analogy:    "king is to kings as queen is to queens"

Morphological analogies match different grammatical forms of a same word. They concern the symbols themselves, not their semantics.
Example of morphological analogy

Latin’s accusative case

rosa : rosam :: vita : x

Source: transformation of "rosa" into "rosam".

Target: the same transformation may apply to the target ("vita"). The result would be x = "vitam".

BUT other transformations are also possible (for instance: always answer "rosam")
Can we compute the ‘correct’ transformation?
Computing morphological analogies

In (Murena et al., 2020), we introduced a simple language to implement morphological transformations.

let, ?0 : ?0, m, let
mem, ?0 = rosa
::
mem, ?0 = vita

line 1: define a function based on a variable ?0 that adds m to it
line 2: Call this function with variable ?0 = rosa
line3: Move to the second half of the analogy
line 4: Call this function with variable ?0 = vita
Computing morphological analogies

rosa : rosam :: vita : x

1. We use this simple language to try transformations that can solve the analogical equation.

2. We only keep transformations that match the two first terms (source) (rosa rosam) and can also be applied to the third term (in the target) (vita).

3. The algorithm is almost brute force, as there are few such candidate transformations.

4. We keep the simplest transformation.
Computing morphological analogies

And it works!

The algorithm has been used to guess the fourth term of morphological analogies.

Our algorithm based on minimum of complexity (here named NLG_COMP) favourably compares with a state-of-the-art method NLG_PROP (Fam & Lepage, 2018) on the dataset SIGMORPHON’16    . . .
Computing morphological analogies

Computing morphological analogies

This example illustrates the fact that Kolmogorov complexity can inspire powerful learning algorithms. 🙂

##### Computing morphological analogies
Let’s show that we can solve analogies on words using a minimal complexity principle. To do so, we consider a specialized programming language.
• A program in this language corresponds to a sequence of comma-delimited instructions.
• A sequence describes a series of concatenations. For instance, the program 'ab','cd' outputs the string abcd.
• The main characteristic of our language is the implementation of a memory as a push-down stack. At any time, an element can be stored into memory, using two let instructions (an opening one and a closing one). For instance, let, 'abc', let stores the string abc in memory.
• The operator mem is used to retrieve the last elements stored in memory. The ith last element stored in memory can be retrieved with the instruction mem, i. By convention, the index starts with 0 (which means that mem,0 retrieves the very last element saved in memory).
 Which would be the output of the program let,'ab',let,let,'cd',let,mem,1?

ab         cd
abcd     abcdcd

• In addition, we allow the program to store operations in memory, in which case the parameters are denoted by ?n, where n is the index of the parameter. For instance, the instruction let, ?0, ?1, let stores an operation that concatenates its two operands.
 Which of the following programs defines an operation which duplicates a string and introduces a hyphen to separate the duplicates?

let, ?0, '-', ?1, let
let, '-', let
let, ?0, ?0, let
let, ?0, '-', ?0, let

Once an instruction is stored in memory, it can be retrieved using the mem codeword, followed by the correct number of arguments. For instance, the code let, ?0, ?1, let, mem, 0, 'abc', 'ijk' defines a concatenation operators, then retrieves it from memory, applies it to two strings and finally outputs abcijk. Note that if the number of arguments is incorrect, the code will raise a syntax error.
 Which of the following programs will not raise a syntax error?

let, ?0, '-', ?1, let, mem, 0
let, ?0, '-', ?1, let, mem, 0, let, 'abc', let
let, ?0, '-', ?1, let, mem, 0, 'abc', 'abd'
let, ?0, '-', ?1, let, mem, 0, 'abc'

A compiler for this language is given in the python script . The compilation is called by the method generate_string, which takes as argument the instruction as a string. For instance:

$python >>> import analogy >>> analogy.generate_string("abc, ijk") >>> 'abcijk' >>> analogy.generate_string("let, ?0, -, ?0, let, mem, 0, orang") >>> 'orang-orang' >>> analogy.instruction_complexity("let, ?0, -, ?0, let, mem, 0, orang") >>> 59 How can such a program be used to evaluate good analogies? A solution consists of assuming that the quality of an analogy A:B::C:D is measured by the complexity of the string 'A:B::C:D'. The program estimates this complexity through the function instruction_complexity(instruction_string) which takes as argument the instruction as a string.  Select all programs that generate the string: rosa:rosam. let, ?0, :, ?0, ?1, let, mem, 0, rosa, m let, ?1, :, ?0, let, mem, 0, m, rosa let, ?0, a, :, ?0, a, m, let, mem, 0, ros let, ?0, :, ?0, m, let, mem, 0, rosa rosa:rosam  Based on the estimates of , how complex is the less complex among these programs? Solving analogies in this way is very time-consuming for two reasons. First, it requires the exploration of all possible solutions. Second, we have to explore all possible programs that generate each solution. A simplified solver is proposed in the function restricted_search. To execute it, just provide the desired analogy. The method will suggest a result based on complexity minimization. For instance, you can check the analogy rosa:rosam::vita:vitam by calling restricted_search("rosa:rosam::vita:vitam").$ python
>>> import analogy
>>> analogy.restricted_search('rosa:rosam::vita:vitam')
Best instruction: let,?0,a,:,?0,a,m,let,mem,0,ros,::,mem,0,vit
Expected result: vitam
Obtained result: vitam

Be careful: though the exploration space is restricted, the exploration can still take some time, especially if you ask for difficult analogies!
 Execute the code to solve an analogy in Finnish: puhua:puhun::katsoa:katson. Which instruction do you get as a result?

 Does the restricted_search method work in all cases? Does it work on analogies such as "abc:abd::ijk:ijl" (known as Hofstadter’s analogies)? Can you explain why?

It does not work in this case, because there are no vowels in the substitution.
It does not work in this case, because there are no common letters between ‘abc’ and ‘ijk’.
It does not work in this case, because the order of letters is not implemented.

 Can you think of analogies in your mother tongue (or in any language you know) that won’t work with this method?

I saw a funny thing on Twitter. What does it say?

It asks what comes after 5, 7, 11, 13, 17, 19, 23. There are two options: 25 and 29. I can see that people hesitate: It’s almost half-half! I don’t understand why!

Why what? There’s an obvious answer.

Which is? You have to select the next prime, which is 29.

Ha ha! You are reacting exactly as "MC Pablo"!
It’s in French.

So you didn’t see the pattern +2 +4 +2 +4 +2 +4. Ah! OK, I understand why 25 makes sense, now.

So it’s clear which is the "correct" answer. Yes. You just choose the less complex one.

And so you will favor 25, as the Turing machine that computes +2, +4, +2, +4... is smaller than the one that computes prime numbers. No, I was about to say that 29 was simpler!

You can’t say that! Yes I can! If you already know what primes numbers are, it’s simpler to say "retrieve the next prime".

"Numbers congruent to 1 or 5 mod 6",
"Numbers n such that (2n + 1)/3 is prime",
"nth irreducible polynomial over Q (with coefficients 0 or 1) evaluated at x=2".

The first one is the same as +2, +4, +2... Yes, no... Not exactly. But anyway, the other ones are really crazy!

You say this because they are way too complex, as compared with merely saying 5, 7, 11, 13, 17, 19, 23. Yes, I guess so.
Maybe complexity offers us a way to measure "craziness" 🤪!

## Minimum Description Length (MDL)

Problem: Can we find a computable version of the induction principle?
Complexity theory had a big impact on Artificial Intelligence, especially in the domain of machine learning. Ray Solomonoff used it to solve the problem of induction in practice. His method, which merges Bayesianism and Kolmogorov complexity, mainly consists of complexity minimization. Though it led to strong theoretical results, it remains incomputable.

As an attempt to come closer to computability, one may restrict the set of ‘models’ that are considered to account for the data. This idea has been implemented in two principles: MDL (minimum description length, developed by Jorma Rissanen) and MML (minimum message length, developed by Chris Wallace).
In what follows, we present a version of MDL that we will call "crude MDL" or "ideal MDL" (which is very close to MML).

Suppose you want to classify a set of N data = {xk} in two classes, "blue"/"red". Your toolbox contains different computable models Mi that can be used to label the data as blue or red.

According to the (crude) Minimum Description Length principle, you should choose the model that achieves the best compromise between simplicity and accuracy:

(crude) MDL principle

Choose Mi such that Mi = argminj { C(Mj) + C(D|Mj) }
 Suppose you don’t know the true category (blue or red) of the $$N$$ objects. Among how many different classifications should we search the correct one?

a.    $$2 \times N$$             b.    $$N^2$$             c.    $$2^N$$

Algorithmic Information Theory describes the learning task as a compression operation. The "null model" M0 consists of remembering the class of each object. It requires N bits (1 bit for each element to be labelled as red or blue). We may take this as a baseline. The different models Mi are attempts to classify the data in a more economic way. In other words, the MDL problem consists of choosing among the {Mi} the model that leads to maximal overall compression.

Maybe Mi is not perfect, which means that Mi classifies only ni items among the {xk} correctly. We want to decide which is the "best" Mi.
 What is the compression (in number of bits saved by comparison with the null model) if $$M_i$$ is perfect ($$n_i = N)$$?

$$N$$         $$C(M_i)$$         $$N - C(M_i)$$         $$C(M_i) - N$$

Suppose now that Mi isn’t perfect (ni < N). To reach a correct classification, one needs to designate all misclassified data, for instance by providing their indices.
 How much worse (in number of bits) is the compression of an imperfect model, as compared with the perfect case?

a:    $$n_i \times log_2(N)$$ bits             d:    $$(N - n_i) \times log_2(N)$$ bits
b:    $$(N - n_i)$$ bits                       e:    $$(N - n_i \times log_2(N))$$ bits
c:    $$N \times log_2(n_i)$$ bits             f:    $$N \times log_2(N - n_i)$$ bits

Note that the complexity of Mi includes the complexity of designating it among all models (this may amount to the complexity of i, i.e log2(i) plus the complexity of all parameters values that may have to be determined within Mi).
 Can you provide a formal characterization of the best model in the $${M_i}$$ family?

a.    $$argmin(C(N) + C(M_i | N) + n_i \times log_2(N))$$
b.    $$argmin(C(i) + C(M_i | i) + (N-n_i) \times log_2(N))$$
c.    $$argmin(C(N) + C(M_i | i) + N \times log((N-n_i)))$$

 Can you describe how the worst model behaves (if we ignore its own complexity)?

The worst model makes a random decision on each data.
The worst model assigns the wrong class to each data.
The worst model assigns the wrong class to half of the data.

The MDL principle illustrates Gregory Chaitin’s famous aphorism:

Comprehension is compression

(Gregory Chaitin, 2004)

This means that the best theory is the one that achieves the best compression by reaching a compromise between its own complexity and the complexity of describing the facts that are left unexplained. This principle has far-reaching consequences in epistemology (read for instance The symmetry and simplicity of the laws of physics and the Higgs boson).

## Prototype-based models and compression

Question: Does learning really amount to a compression task?

Prototype-based models are one of the most popular machine learning techniques. They offer a nice illustration of the idea that learning amounts to achieving compression.

Understanding Artificial Intelligence
through Algorithmic Information Theory

Prototype-based Models
and compression

aiai.telecom-paris.fr

Prototype-based models

Prototype-based models can be used in unsupervised learning to group similar data points together (e.g. in marketing or recommender systems).

Prototype-based models

Prototype-based models can be used in unsupervised learning to group similar data points together.

A prototype is a point of the input space used to provide information about the distribution of data points.
Prototype-based models

Prototype-based models can be used in unsupervised learning to group similar data points together.

A prototype is a point of the input space used to provide information about the distribution of data points.

Intuitively, data are distributed "around" the prototypes.
Prototypes models

One of the most famous prototype-based methods in machine learning is the K-Means algorithm.

Several improvements have been proposed (Learning Vector Quantization, Online Learning Vector Quantization, Self-Organizing Maps, Neural Gas algorithms, etc.).
Prototypes models

One of the most famous prototype-based methods in machine learning is the K-Means algorithm.

You can watch nice Neural Gas demos
 here
 and there
Compression

Prototype-based models can be seen as compression methods:

Instead of describing data points by specifying their absolute coordinates,
Compression

Prototype-based models can be seen as compression methods:

Instead of describing data points by specifying their absolute coordinates,
one considers their coordinates relative to the closest prototype.
Compression

Prototype-based models can be seen as compression methods:

Instead of describing data points by specifying their absolute coordinates,
one considers their coordinates relative to the closest prototype.

We get smaller numbers, hence the compression.
Compression

The complexity of a point $$x_j$$ in the dataset $$X$$ is bounded by the complexity of its coordinates $$(x_{j_1}, \dotsc, x_{j_d})$$: $$C(x_j) \leq C(x_{j_1}) + \dotsc + C(x_{j_d}).$$ where the complexity of coordinates depends on the precision $$\delta$$ used to convert them into integers.
We get a "naive" upper bound of the complexity of the whole dataset $$X$$: $$C(X) \leq \sum_j C(x_j).$$

Compression

If prototypes $$\{p_k\}$$ are introduced, the complexity of $$X$$ is now: $$C(X) \leq \sum_k C(p_k) + \sum_i C(x_j|\{p_k\})$$ where: $$C(x_j|\{p_k\}) = \min_k C(x_j-p_k).$$

If the dataset exhibits some structure, suitable prototypes can be found and this estimate of $$C(X)$$ will turn out to be significantly smaller that the "naive" one (hence the compression).

Conversely, and quite remarkably, this expression can be used to find the best prototype set $$\{p_k\}$$ (Cilibrasi & Vitányi, 2005).
Classification

Now consider a classification task as an example of supervised learning.
The problem is to associate a label (e.g. blue/red/dark) to each data point.

Classification

Now consider a classification task as an example of supervised learning.
The problem is to associate a label (e.g. blue/red/dark) to each data point.

Prototype-based models are routinely used for this.

In doing so, they offer a compressed version of the classification.
Classification

In a classification task, we consider a dataset $$X$$ of $$|X|$$ points $$x_j$$. Each $$x_j$$ is supposed to belong to a class labelled as $$y_j$$. The problem is to learn which label $$y_j$$ taken from the set of labels $$Y$$ corresponds to $$x_j$$.

The complexity of the classification is bounded by the amount of information required by rote learning. $$C(\{(x_j, y_j)\} | X) \leq \sum_j C(\{y_j\}) \approx \sum_j |X| \times \log(|Y|)$$
Classification

A prototype-based model $$\mathcal{M}$$ would deduce the label of all points from those of its $$K$$ prototypes $$\{p_k\}$$, in the hope that: $$\beta(x_i) = y_i$$

Classification

A prototype-based model $$\mathcal{M}$$ would deduce the label of all points from those of its $$K$$ prototypes $$\{p_k\}$$, in the hope that: $$\beta(x_i) = y_i$$ The decision function $$\beta$$ just returns the label of the closest prototype.

For instance, if the point is closer to prototype 3 (labeled "blue") than to any other prototype, the decision function will return "blue".
Classification

The complexity of the model is given by the complexity of its prototypes and their labels: $$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$ Now: $$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) + C(\{(x_j, y_j)\} | X, \mathcal{M})$$ Since the decision function $$\beta$$ is just a loop over the prototypes, $$C(\{(x_j, y_j)\} | X, \mathcal{M}) = 0$$. So $$C(\{(x_j, y_j)\} | X)$$ is found to be much smaller if $$K \ll |X|$$ than if we only rely on rote learning.

Classification

The complexity of the model is given by the complexity of its prototypes and their labels: $$C(\mathcal{M}) \leq \sum_k C(p_k) + K \times \log (|Y|).$$ Now: $$C(\{(x_j, y_j)\} | X) \leq C(\mathcal{M}) \ll \sum_j |X| \times \log(|Y|)$$ We just have to remember the prototypes and their label to retrieve all labels, hence the compression.

Classification

This is only true, however, if the model $$\mathcal{M}$$ is perfect.

Classification

This is only true, however, if the model $$\mathcal{M}$$ is perfect.

If not: $$C(\{(x_j, y_j)\} | X, \mathcal{M}) \neq 0$$.

For each misclassified point, we may just specify its index the correct value of its label. $$C(\{(x_j, y_j)\} | X, \mathcal{M}) \leq \sum_{i=1}^{|X|} (C(i) + C(y_i)) \times \mathbb{I}(y_i \neq \beta(x_i))$$ where $$\mathbb{I}$$ is a function that converts a Boolean value into 0 or 1.
Classification

This is only true, however, if the model $$\mathcal{M}$$ is perfect.

If not: $$C(\{(x_j, y_j)\} | X, \mathcal{M}) \neq 0$$.

More simply, if there are $$n$$ misclassified points, we get: $$C(\{(x_j, y_j)\} | X, \mathcal{M}) \leq n \times (\log(|X|) +\log(|Y|)).$$
Classification

To sum up: $$C(\{(x_j, y_j)\} | X) \leq \sum_k C(p_k) + K \times \log (|Y|) + n \times (\log(|X|) +\log(|Y|))$$ If $$n$$ is small, this may still represent a huge compression as compared to rote learning. The right-hand side of this inequality can be used as a loss-function to determine the best prototype-based classification.

 In what way do unsupervised prototype-based models achieve compression?

Because distances to prototypes are small.
Because one gets rid of data points and only keeps prototypes.
Because prototypes are kept far from each other.

 In what way does the supervised learning of classification achieve compression?

Because distances to prototypes are small.
Because one gets rid of data points and only keeps prototypes.
Because prototypes are kept far from each other.

### Clustering and compression

Problem: Could we use the compression criterion to get parameter-free learning techniques?

Unsupervised Learning is a branch of machine learning in which data arrive with no labels. The purpose of unsupervised learning is to attribute labels to unlabeled and unstructured data in order to give them a structure.

A clustering system receives a set of input data $$X$$ and aims at grouping related points together. A group of related points is called a cluster. In basic clustering, a point can belong to only one cluster.
The most popular clustering algorithm, called K-Means, is a prototype-based algorithm. The algorithm initializes random prototype positions and alternates two phases:
• it associates the data points to their closest prototype
• it moves the prototype position to the centroid of its associated data points.

#### Implementation

We will be using the program . To run this script, you will need the packages numpy and scikit-learn installed on your machine. Numpy is a standard package for scientific calculus in Python; Scikit-learn is a comprehensive library implementing basic machine learning algorithms (if necessary, execute the command pip install --user sklearn to install Scikit-learn).

Run the program. You can observe the complexity reduction due to the use of prototypes.

\$ python Prototype.py

Iris dataset with 3 prototypes:
-------------------------------
>> prototypes found by K-Means:
[[5.9016129 2.7483871 4.39354839 1.43387097]
[5.006 3.428 1.462 0.246 ]
[6.85 3.07368421 5.74210526 2.07105263]]

>> Complexity reduction when using K-Means:
Dataset complexity: 1191
Complexity after K-Means: 219

The program computes complexity by adding the prototypes’ complexity and the complexity of the data points’ distances to the closest prototype (see the "basic complexities" section of the program).

Choosing the right number of clusters is a well-known issue. If we apply the MDL Principle, the ideal number $$K$$ of clusters should minimize the previous complexity. With too few clusters, the model is simple, but the complexity of the data knowing the model may be large; conversely, too many clusters would make the model unnecessarily complex and would produce overfitting.
 Run the algorithm on the ‘Iris’ data set. For which value of the number $$K$$ of prototypes do we reach a minimum of complexity?

 Choose the ‘Diabetes’ data set. Data now have 10 dimensions (instead of 4 for ‘Iris'). What is the optimal value of $$K$$ with respect to the MDL principle? (you may loop over increasing values of $$K$$ up to 100)

We just saw how the MDL principle was used to determine the ideal number of prototypes in a clustering task. If we extend the idea, maximal compression is ideally the ultimate criterion that could lead to parameter-free learning techniques (read for instance Cilibrasi & Vitányi 2005)).

I wonder how neural networks fit with this idea of
"learning = compression".
They do more than rote learning. So they must achieve compression.

Consider GPT-3, for instance. Its training resulted in the fitting of 175 BILLION parameters. You call it compression! I guess, it all depends what you consider is compressed.

Suppose you want to tell cats from dogs in a set of images. Yes. There is one bit to be learned per image.

If you must fit 60 millions parameters to decide on 1000 images, I don’t see the point. In the paper that made Deep learning famous, there were 1000 categories, not just two.

Maybe. That makes 10 bits per image. If you have 1000 images to categorize, that makes only 104 bits to learn. The network was trained on 1.2 million images.

Even so! 12 million bits against 60 million gradual parameters, I don’t see the compression. But that’s not the point. You want to decide on unseen images.

So what? The point is that what you’ve learned with 60M parameters is potentially unbounded. The network can now categorize an infinity of different images.

So, you mean that you compress an infinity down to "only" 60M parameters? Yes.
And in the case of GPT-3, ...

What about GPT-3? Did you see that it is able to make up relevant answer to questions?

Yes, that’s impressive! One single relevant statement is worth a great deal of information, not just 10 bits.

Why? Because you need a certain amount of information to select that statement among all the possible silly ones in the current context.

And so, according to you, it justifies the 175 billion parameters? Indeed! It still achieves huge compression.

## Anomaly Detection

### What is an anomalY?

Question: Can we characterize anomaly just as low-probability events?

Answer:   No. The probability of any event, described with sufficient precision within its context of occurrence, is zero! Maybe anomaly has to do with complexity...

Anomaly Detection is another typical problem in Unsupervised Learning. Anomalies of interest in Machine Learning are outliers. The problem of detecting outliers is close to the clustering task, in the sense that the learner tries to infer a model that describes "normal" data. Contrary to clustering, however, the point of anomaly detection is not to group similar data but to discard data that do not fit the model.

We use the same previously-mentioned . The complexity criterion, however, will be used for each data point to decide whether it is well described by the model. The compression rate is defined as $$\frac{\text{complexity after}}{\text{complexity before}}$$. If it is close to 1, the point does not fit with the model and may be regarded as an anomaly.
 Modify the anomalyDetection function in order to select abnormal points. A point is considered abnormal if the compression rate exceeds the threshold value. The function should return a list of indices. Use the numpy method where. Copy the ‘return’ line here.

 Choose the dataset ‘diabetes’, and set the number of prototypes to the best value previously found. Uncomment the last lines to call anomalyDetection. Determine the threshold value that will select the 10 most abnormal points. Which value did you find?

 The preceding exercises aimed at detecting outliers. What do you think...?

An outlier is an anomaly because it is too simple.
An outlier is an anomaly because it is too complex.

Suggestion:
• The threshold method is a little bit difficult to use in practice (which value of the threshold?). You could try to think of other methods to select abnormal points from the list of compression rates.

#### Lossless and lossy compression

Algorithmic Information Theory is primarily about lossless compression. The point is to summarize objects in the most concise way, while keeping all the information needed to recover them. However, anomaly detection points to the possibility of getting rid of abnormal points, considered as mere noise.
More generally, lossy compression consists of discarding noisy or contingent data in the hope of capturing the gist of a phenomenon. An extreme form of lossy prototype-based compression would be to keep the prototypes and to forget all data. This amounts to keeping only the model part in the expression of MDL. Simplification and sketching (as in comic strips) may be modelled as resulting from lossy compression.
Note that such a simplification can only occur after the MDL optimization has been performed.

In the next chapter, we will adopt the exact opposite perspective: situations in which only anomalies matter.
 Please mention cases in which lossy compression is mostly appropriate.

## Universal Induction

Problem: Is there a link between Bayes Rule and Solomonoff’s induction principle?

Understanding Artificial Intelligence
through Algorithmic Information Theory

Universal Distribution
and Prediction

aiai.telecom-paris.fr

What is the best way to predict the future?

Given an observation $$x = x_1, x_2, ... x_k$$
which hypothesis should we choose
among a set of hypotheses {Hi} to explain $$x$$
and predict its future?

Merging Epicurus, Occam, Bayes and others

Let’s proceed step by step:

• According to Bayes rule, we need to attribute a prior p(Hi) to all Hi before observing any data.

• According to Epicurus’ principle, p(Hi) = 0 if and only if Hi is not consistent with the data.

• According to Occam’s razor, p(Hi) must be small for complex hypotheses.

• How do you quantify the complexity of a hypothesis?
Answer: by seeing it as a computer program and by measuring its length!
Universal distribution for finite sequences

The expression of algorithmic probability reveals the respective contributions
of all consistent hypotheses $$p$$: $$\mathbf m(x) = \sum_{p:U(p)=x} 2^{-l(p)} \approx 2^{-K(x)}$$

where $$U$$ is a prefix Turing machine.

• Any computable hypothesis for $$x$$ (i.e. which can be represented as a program $$p$$ that generates $$x$$) receives a non-zero probability, and thus the distribution $$\mathbf m$$ respects Epicurus’ principle.
• Occam’s razor principle is respected as well, since longer hypotheses get lower probabilities under distribution $$\mathbf m$$.
• If we were to keep only one hypothesis, it would be
the shortest program that generates $$x$$.
Universal distribution for finite sequences

• $$\mathbf m(x)$$ is a semi-computable distribution*, which means that it can be approximated from below.

• $$\mathbf m(x)$$ is universal, in the following sense:

For any lower semi-computable distribution $$\rho$$, then for all $$x$$,
$$\mathbf m(x) \geq 2^{-K(\rho)} \rho(x)$$.
This means that $$\mathbf m(x)$$ cannot grossly underestimate
the probability of objects.

* Strictly speaking, $$\mathbf m(x)$$ is a semi-measure, as it does not sum up to 1.

Sequence prediction

A good way to check a hypothesis for a sequence is to predict what comes next.

12345...

Sequence prediction

12345...

We suppose we are in the deterministic case, which means that the sequence is produced by a program p.

What is the probability that the next symbol will be 6?

• The probability is the sum of the probabilities of "picking" a program generating 123456 among all programs that generate 12345.

• What is the probability of "picking" such a program?

• If we flip a fair coin to build a program $$p$$ of length $$l(p)$$ bit by bit,
the probability of that program is $$2^{-l(p)}$$.
Sequence prediction

$$x = 12345...$$

$$\textrm{Pr}(x_6 = 6|x_{1:5} = 12345) = \sum_{p: U(p)=123456...} 2^{-l(p)}$$

where the sum is over all programs generating sequences

Note that we deal now with programs that generate
possibly infinite sequences.
Universal distribution for infinite sequences

The notion of universal distribution can be extended to infinite sequences. In the binary case: $$\mathbf M(x) = \sum_{p:U(p)=x*} 2^{-l(p)}$$

where $$x*$$ means any continuation of $$x$$.

$$\mathbf M$$ is also called the Solomonoff distribution.

$$\mathbf M(x)$$ shows a complexity-weighted combination of all hypotheses explaining the infinite sequence knowing the evidence $$x$$.
Universal distribution for infinite sequences

The notion of universal distribution can be extended to infinite sequences. In the binary case: $$\mathbf M(x) = \sum_{p:U(p)=x*} 2^{-l(p)}$$

where $$x*$$ means any continuation of $$x$$.

As in the finite case, $$\mathbf M$$ multiplicatively dominates any semi-computable distribution $$\mu(x)$$, which means that it cannot grossly underestimate probabilities.
Conclusion

• Among all hypotheses consistent with the data, the one with the least Kolmogorov complexity is the most likely one.
• The universal distribution provides the best combination of all hypotheses by weighting them by their complexity.
• This holds for finite objects,
using universal distribution $$\mathbf m$$.
• It also holds for prediction tasks in which one wants to explain infinite sequences (using universal distribution $$\mathbf M$$).

 What would be a good way to bet on $$x_n$$ knowing the sequence $$x$$ up to $$x_{n-1}$$, among the following?

(knowing $$x_{1:n-1}$$)
a- Choose the hypothesis that minimizes the complexity of $$x_n$$.
b- Choose the hypothesis that minimizes the complexity of $$x_{1:n}$$.
c- Choose the hypothesis that minimizes the complexity of $$x_{1:n-1}$$.
d- Choose the hypothesis that minimizes the complexity of $$x_{1:n+1}$$.
e- Choose the hypothesis that minimizes the complexity of $$x_{1:\infty}$$.

• Li, M. & Vitányi, P. (1993). An introduction to Kolmogorov complexity and its applications (3rd ed.). New York: Springer Verlag, ed. 1997, Chapters 4.3 and 5.1.

### Universal Artificial Intelligence

Problem: Can we define, at least in theory, the "most intelligent" action to take?
In some way, the "most intelligent" program one may think of would perform optimal decision making. Some scientists, such as Marcus Hutter, built on Solomonoff’s ideas about learning and induction to define what they call "universal artificial intelligence". Such an AI would take the best possible decisions at any time. Really?

Hutter’s AIXI instantiates an ideal version of reinforcement learning, based on algorithmic information. AIXI performs actions that maximize expected future rewards. At any moment, the expected reward results from an inductive process based on past observations and past actions. Future rewards are weighted by their probability, which depends on the complexity of the programs that generate them. Hutter’s equation for "Ultimate intelligence" reads:

As Hutter puts it:
The expression shows that AIXI tries to maximize its total future reward rk+...+rm. If the environment is modeled by a deterministic program q, then the future perceptions ...okrk...omrm = U(q,a1..am) can be computed, where U is a universal [...] Turing machine executing q given a1..am. Since q is unknown, AIXI has to maximize its expected reward, i.e. average rk+...+rm over all possible future perceptions created by all possible environments q that are consistent with past perceptions. The simpler an environment, the higher is its a-priori contribution 2-l(q), where simplicity is measured by the length l of program q.

## Recap

### Take-home message

We hope you enjoyed this chapter on AIT and Machine Learning.
Let’s leave the last word to Angel@ and Alexandr@.

This chapter on AIT and machine learning was nice, don’t you think? Sure! I liked the idea that analogy is just a matter of minimizing complexity.

Oh, it’s just a special case of induction. No, I disagree. Analogy is a form of reasoning.

Yes, but, induction is more general. I liked the last thing about universal induction and universal AI. Yes, but it’s not computable.

It helps me understand that efficient reinforcement learning depends on your ability to compute the future. To me, analogy and MDL are closer to practical use. I like the idea of optimal trade-off between model simplicity and accuracy.

And so you must be exited to see that any form of learning amounts to performing data compression. Yes, we talked about that!

Yeah. As for me, my favorite remains the idea that anomalies must be simple to characterize to be genuine anomalies. Yes, that one also!

 The solution to the analogical equation: $$A$$ is to $$B$$ as $$C$$ is to $$X$$ is the value of $$X$$ that has minimal complexity.

True
False

 Unsupervised prototype-based clustering achieves compression because distances to prototypes require less information.

True
False

 Outliers in clustering are data points with low-complexity coordinates.

True
False

 Trained neural networks invove millions, sometimes billions of parameter values. Yet, the point of using neural networks is still to achieve data compression.

True
False

 MDL achieves lossless compression.

True
False

 Do you have feedback or suggestions about the topics addressed in this chapter?