logo_ipparis.png     TelecomParis_endossem_IPP_RVB_100pix.png Telecom Paris
Dep. Informatique & Réseaux

Dessalles_2018.png J-L. DessallesHome page

July 2021
5

AIAI0.png
IA325/IA703: Algorithmic Information and Artificial Intelligence

Lecturers:     Jean-Louis Dessalles
and             Pierre-Alexandre Murena
Etienne Houzé

                                                other AI courses 5

Contents
























Algorithmic information applied to mathematics


5
What you will learn in this chapter:
  • That objective complexity cannot be computed, only approximated.
  • That probability can be defined algorithmically.
  • That randomness can only be defined algorithmically.
  • That algorithmic information sets limits to the power of AI (Gödel’s theorem).

Forum: Random by hand
For later use, we need random binary sequences generated by your (conscious) brain. Try to slowly write 30 bits (0 or 1) so as to get a binary string that looks random. Please think about each bit you are writing down. The experiment won’t work if you write them too fast.

        








Kolmogorov complexity can’t be computed! But...

Problem: Complexity is not computable. Does it mean that it is useless?

Understanding Artificial Intelligence
through Algorithmic Information Theory

    

Why Kolmogorov Complexity
can’t be computed

    

    

aiai.telecom-paris.fr

Let’s prove the incomputability of Kolmogorov complexity by contradiction.
Suppose that there is a program cc that for any object s computes its complexity C(s):

cc(s) = C(s).

cc(s) = C(s).

The central idea of the proof is to use cc to compute some complex string s.

The contradiction will come from the fact that since s has been computed, it cannot be complex!
Let’s go...
cc(s) = C(s).

If cc exists, then for any n, one can compute a binary string s that is more complex than n.

C(s) > n.

cc(s) = C(s) > n.

The following Python program uses cc to compute such an s.


def MoreComplex(n):
    i =1
    while True:
        for m in range(2**i):
            s = bin(m)[2:]    
            if cc(s) > n:    return s
        i += 1

bin(m)[2:] is just a Python trick to convert m into a binary string s.
cc(s) = C(s) > n.

The following Python program uses cc to compute such an s.


def MoreComplex(n):
    i =1
    while True:
        for m in range(2**i):
            s = bin(m)[2:]    
            if cc(s) > n:    return s
        i += 1

The program is not very efficient: it starts over the same computations again and again until it reaches its goal: find an s that is more complex than n.
cc(s) = C(s) > n.

The following Python program uses cc to compute such an s.


def MoreComplex(n):
    i =1
    while True:
        for m in range(2**i):
            s = bin(m)[2:]    
            if cc(s) > n:    return s
        i += 1

But the point is that MoreComplex always terminates and outputs some complex string s.
cc(s) = C(s) > n.

The program MoreComplex can be used to define s from n.

As a consequence, C(s) can’t be much larger than C(n).

More precisely:    C(s) < C(MoreComplex) + C(n).

But wait...

By definition of s, < C(s) = cc(s).

We get < C(MoreComplex) + C(n) for all integers n.
< C(MoreComplex) + C(n) for all integers n.

    
This is absurd, since C(n) ~ log(n)
(one needs log(n) bits to define n)
(all logarithms are in base 2).

Conclusion: the hypothesis that cc exists
was wrong in the first place.

In other words, C cannot be computed.


Incomputability of C
Can we compute the complexity \(C(s)\) of an object \(s\) on a specific universal Turing machine
if one is sure that \(C(s) \lt 20\) bits?

        Yes.
        Possibly.
        No.

    

Four reasons to live well with the incomputability of complexity

Problem: The incomputability of C(s) seems to ruin the concept. For superficial observers, Kolmogorov complexity may have a whiff of mysticism. Compressing programs would be more tangible.

Question: Is complexity just a fancy name to mean efficient compression?

Answer:   No! Complexity is a fundamental mathematical notion, independently from its approximation through compressors.

Algorithmic Information Theory is an essential theoretical framework, regardless of its implementations.
Let’s mention four reasons for this.
  1. C(s) as it stands is essential to establish important theoretical results. We will see, for instance, that it can be used to prove Gödel’s first incompletude theorem in no more than a couple of lines! It also offers a conceptual framework with far-reaching consequences, in mathematics, in information science, in artificial intelligence, in epistemology, in philosophy and in Cognitive Science.
  2. Restricted definitions of C(s) are computable. It is obviously the case when we consider resource-bounded complexity, where C(s) is the length of the shortest description obtained so far. It is also true for approximations of C(s) based on compressors or on algorithmic probability. The ideal (non computable) definition of complexity provides a lower bound for practical complexity computations.
  3. Results from complexity theory can be applied to specific "machines" of special interest. This is true, for instance, of computational (and computable) models of human performance.
  4. Few people have noticed that other prominent theories such as probability theory can be regarded as incomputable as well. For instance, probability theory refers to sets (the set of people, the set of windows) that are not guaranteed to be computable (i.e. if there is no program to decide on membership).








Algorithmic information and algorithmic probability

Intuitive approach to algorithmic probability

"Information theory must precede probability theory, and not be based on it. By the very essence of this discipline, the foundations of information theory have a finite combinatorial character."

Kolmogorov, A. N. (1983). Combinatorial foundations of information theory and the calculus of probabilities.
Russian mathematical surveys, 38 (4), 29-40.

Problem: Probability has to do with chance. How can it be defined through algorithms?
For Algorithmic Information Theory (AIT), information depends on complexity.
For traditional definitions of information, information depends on probability.

Question: Is this collision between complexity and probability more than anecdotal?

Answer:   Yes. We can take advantage of it to do nothing less than redefine probability!

The axiomatic theory of probability (also due to Andrei Kolmogorov!) puts constraints on what a probability measure can be, but does not construct it. Kolmogorov complexity can be used to define a new notion of probability, called algorithmic probability.

The link between complexity and probability has ancient roots. The French mathematician Émile Borel wrote in 1909:
         "il est clair que les éléments pour lesquels le nombre de mots nécessaires à la définition est extrêmement grand devront être regardés comme ayant une probabilité extrêmement petite."      "it is clear that elements requiring an extremely large number of words for their definition should be considered as having an extremely low probability."
(Borel E., 1909 p. 272).

In other words, numbers that have high program-size complexity are less probable.

To formalize this idea, Ray Solomonoff and his followers considered the probability that an object will be generated by chance using a universal Turing machine when the machine is given random programs as input.



Imagine a universal Turing machine fed with random programs (one may flip a coin to determine the successive bits of each program). Let’s say that the algorithmic probability of s is the probability of getting s as output this way. If s can be generated by one of these random programs p of length |p|, then p contributes by 2-|p| to the probability of getting s. So it’s tempting to say that the algorithmic probability of s is the sum of 2-|p| over all programs p that generate s.

This way of founding algorithmic probability is illuminating. However, as explained in the video, there is a difficulty. If s can be generated using a program p, then it can be generated by any program pqi where qi is a continuation of p that does anything except erasing s. If we add up the probabilities of all the pqi, the sum will not converge.

Instead of considering all programs that generate \(s\) when computing the probability of \(s\), why not keep the only shortest one? Since the length of the shortest program is \(C(s)\), we get an alternative definition of algorithmic probability: $$ P_a(s) = 2^{-C(s)} $$ By considering only one term, one solves the problem of the diverging sum. However, we still do not have a probability, as the sum of all \(P_a(s)\) for different \(s\) is infinite instead of being equal to 1.
Alternative Definition of Algorithmic Probability
Let’s only consider integers.
\( P_a \) cannot be a probability, because:

\( P_a(n) \) is close to \( \frac{1}{n} \)         \( P_a(n) \) is close to \( \frac{1}{log_2(n)} \)        
\( P_a(n) \) is close to \( log_2(n) \)         \( P_a(n) \) is close to \( \frac{1}{1 + log_2(n)} \)        

    

The solution found in 1974 by Leonid Levin consists of introducing "prefix machines" and "prefix complexity".

Prefix complexity

Question: Why do we need a new version of Kolmogorov complexity?

Answer:   Because "plain" Kolmogorov complexity (as defined in Chapter 1) is not sufficient to define algorithmic probability and randomness.

In a computer language such as Python, a program that generates x can be extended by merely adding useless instructions (such as pass). As we will see, this possibility of extension makes x "too" probable (in fact, probability defined on extensible programs is meaningless). One natural way to prevent program extension is to forbid it!

Prefix Turing machines are Turing machines with a restriction. No program interpreted by a Prefix machine U is the continuation of another program that U can interpret. This means that if a program p can be interpreted by U, then U must recognize that p has come to an end and must unconditionally stop reading its input after reaching the end of p. This also means that if a program q is the concatenation of two programs p and p1: = pp1, then U does not read anything beyond the limit between p and p1.
Prefix Machine (1)
Consider a Universal Turing Machine \(U\) which accepts the following programs: 01000, 01001, 101 and 010.
Can \(U\) be a prefix machine?

Yes No

    

Prefix Machine (2)
If \(U’\) halts when executing 1, 01, 001, 0001, 00001, and for all programs of the form 0...01, can \(U’\) be a prefix machine?

Yes No

    

The restriction to prefix machines leads to a redefinition of complexity, introduced by Leonid Levin in 1974.
The prefix complexity of s is the size of the shortest program that outputs s when run on a given universal prefix Turing machine Upr.

Prefix complexity

K(s) = minp {|p| : Upr(p) = s}

By contrast, our previous definition of complexity, noted C(s) (without the restriction to prefix machines) will be called plain complexity.

For any prefix machine \(U_{pr}\), there is an ordinary Turing machine \(U\) that can execute all of \(U_{pr}\)’s programs, plus many other (non prefix) programs. For instance, \(U\) may use a systematic binary chunk to indicate that what comes next is a program in prefix form.
K and C
How do plain complexity and prefix complexity compare on \(U\)?
(here, \(s\) is any object and \(c\) is some constant that does not depend on \(s\))

    \(K(s) \leq C(s) + c\)
    it depends
    \(C(s) \leq K(s) + c\)        

    

Algorithmic Probability

Question: Is the restriction to prefix programs sufficient to define algorithmic probability?

Answer:   Yes. Prefix programs are exclusive of each other, so their probabilities sum up to less than 1.

We will see below that prefix programs can be represented as leaves in a binary tree. The deeper the leaf, the less probable the program. More importantly, all programs together span at most the whole tree, so their cumulative probability is at most 1. Prefix complexity can therefore be used to define algorithmic probability by making the initial suggestion made by Ray Solomonoff in 1964 rigorous. The point is to add up the probabilities 2-|p| of all prefix programs p that generate object s.

Algorithmic probability

$$ P(s) = \sum\limits_{p:\ U_{pr}(p)=s} 2^{-|p|} $$

BinaryTree.png

In the next exercise, we consider programs that explore a binary tree. Such programs choose the next branch to the left if their next digit in the input is 0, and the next branch to the right if their next digit is 1. A program’s outcome is the concatenation of all visited nodes’ content.

    
Binary Tree
In the binary tree example, let’s consider that program bits are randomly drawn. Now let’s add up the probabilities of all possible outcomes. What is a condition for this sum to be less than 1?

The tree is finite.
The average branch length is finite.
The program asks for more input bits until it reaches a leaf.
The program stops when encountering the same \(n\)-bit pattern twice, with \(n\) given in advance.

    

Since the shortest program generating s makes the greatest contribution to the sum, we may write:

$$ P(s) \approx 2^{-K(s)} $$
Shannon’s definition of information for a repeated event s is I(s) = log(1/p(s)), where p(s) is the probability of occurrence of s. If we consider that the occurrences of s can be modelled as the output of a machine fed with random input, then P(s) ≈ p(s). This means that for such a repeated event, I(s) ≈ K(s). Algorithmic information and Shannon’s information make contact here, the former being more general than the latter.

Experimenting with algorithmic probability

Problem: How can we measure algorithmic probability?
The main idea underlying algorithmic probability is that objects are constructed from scratch, and that the probability of their emergence depends on their complexity.

Question: Can we measure algorithmic probability by running random programs on a computer?

Answer:   In a strict sense, no, as some of these programs may loop indefinitely. But in practice, yes, for simple machines.

Not only can algorithmic probability be approximated by trying random programs on simple machines, but this constitutes a direct way of approximating Kolmogorov complexity, since K(s) ≈ log2(1/P(s)). This is the idea underlying The Online Algorithmic Complexity Calculator, developed by the Algorithmic Nature Group.
video+OACC(Estimating Complexity using Algorithmic Probability)
In the OACC experiment, binary strings are generated by simple Turing machines (five states only) (see section methodology). By sampling halting programs (within a maximum runtime) using universal distribution (each additional bit makes the program twice less probable), one can assess the probability of occurrence of the output strings.
Connect to The Online Algorithmic Complexity Calculator.
(warning: does not seem to work properly with all Web browsers (we had problems with Firefox). You may need to try another one.)
Click on the tab ("For Short Strings").
Forum: Algorithmic Probability
Try different 11-bit sequences (choose ‘binary'). Compare their "Kolmogorov complexity", as deduced from their probability of occurrence. Does the hierarchy of these complexity values make sense?

        

Prefix codes

Problem: What do prefix Turing machines look like?
A prefix Turing machine is any Turing machine that only halts for prefix programs (i.e. programs with an unambiguous end).

The binary codes of these program constitute a prefix code. This means that no code word in this set is the beginning of another code word. Prefix codes are uniquely decodable, which means that there is no ambiguity when interpreting the code (note that the converse is not true: {01, 011, 0111} is perfectly decodable, though it is not prefix).
Prefix codes
A code with the codewords {01, 001, 0001, 10, 110, 1110} is:

Not prefix but uniquely decodable
Not prefix and not uniquely decodable
Prefix and uniquely decodable
Prefix but not uniquely decodable

    

Interestingly, Huffman codes are designed to be prefix.
One way for programs to be prefix is to be self-delimited.

Understanding Artificial Intelligence
through Algorithmic Information Theory

    

Self delimited codes

    

    

aiai.telecom-paris.fr

The problem with many codes is that one must delimit coded words.

Morse.png A code like the Morse code used for telegraphy makes use of pauses as explicit delimiters between letters and words.

The phrase "Algorithmic Information" is coded in Morse code as:


⸳– ⸳–⸳⸳ ––⸳ ––– ⸳–⸳ ⸳⸳ – ⸳⸳⸳⸳ –– ⸳⸳ –⸳–⸳    ⸳⸳ –⸳ ⸳⸳–⸳ ––– ⸳–⸳ –– ⸳– – ⸳⸳ ––– –⸳

⸳– ⸳–⸳⸳ ––⸳ ––– ⸳–⸳ ⸳⸳ – ⸳⸳⸳⸳ –– ⸳⸳ –⸳–⸳    ⸳⸳ –⸳ ⸳⸳–⸳ ––– ⸳–⸳ –– ⸳– – ⸳⸳ ––– –⸳

The Morse code is not self-delimited: one needs a delimiter to delimit words.

Can we get rid of delimiters?

If we omit blank delimiters, codes such as the Morse code are no longer decodable. One can’t say whether ⸳–⸳⸳ means "AI" or the letter "L".

Prefix codes are a convenient way of avoiding the use of delimiters.
With self-delimited codes, also called Prefix codes, no word is the beginning of another one. The code below on the left is a prefix code for decimal digits. The code on the right is perfectly decodable, but is not a prefix code.

Prefix code                      Non  prefix code
0 0          0 0
1 10          1 01
2 110          2 011
3 1110          3 0111
4 11110          4 01111
Doubling codes

A crude way of coding a string s in a self-delimiting way consists of duplicating each bit of s. The end of the coded string is marked by violating the doubling rule, writing 01 to designate a 0 or 10 to designate a 1.

’Crude’ doubling code of 1268:    

10011110100 ---> 1100001111111100110001
Doubling codes

10011110100 ---> 1100001111111100110001

Note that this method is expensive in code length.

To spare bits, let’s code just the length of the string with the duplicating method.
Doubling codes
number: 1268
standard binary representation: 10011110100
length of binary representation: 11
binary representation of the length: 1011
doubling representation of the length: 11001110
doubling length representation of 1268:
1100111010011110100
Doubling length code of 1268:    
10011110100 ---> 1100111010011110100

    
This new code is 2 × log(|s|) + |s| bits long (instead of 2 × |s|)
(all logarithms are in base 2).

If s represents an integer n, then the length of the doubling code for n is about 2 × log(log(n)) + log(n).

    

Binary to decimal converter:


    
Decimal to binary converter:


Doubling length Code (1)
What is the "doubling length" code for 117 ?

11001010001         1111101110101        
11111100110010         1100111010001011101        

    

Doubling length Code (2)
Open and locate the function DoublingLengthCode that is meant to code integers using the doubling length code. Paste the correct "return" line into the answer box.

    

Decoding Doubling Length Code (1)
In , locate the function DoublingLengthDecode. Fill-in the blanks in the line
StrN = BinN[.... : ....]
and paste this line only in the answer box.

    

Decoding Doubling Length Code (2)
You may now use the DoublingLengthDecode function.

$ python
>>> import PrefixInt as PI
>>> PI.DoublingLengthDecode('11000000101111000100100000001')
Can you decode: 11000000101111000100100000001?

11117     482     10000    
4242     123456     493825    

    

Plain complexity vs. prefix complexity

Problem: How do plain complexity and prefix complexity compare?
Thanks to prefix codes, we can establish the following relation between plain complexity C and prefix complexity K
(O(1) designates some constant):

C(s) + O(1) < K(s) < C(s) + 2 × log(C(s)) + O(1)


[Don’t forget to watch the debate "prefix complexity vs. plain complexity" at the end of this chapter.]

Chain rule for prefix complexity

Question: Does the "chain rule" hold for Prefix complexity?

Answer:   Yes, it does, and without restrictions (contrary to plain complexity)

The Chain Rule is a fundamental tool for artificial intelligence. Thanks to it, we can perform complexity assessments step by step. Fortunately, it is obviously valid for prefix complexity K.

Chain Rule for Prefix complexity

K(s1) < K(s2) + K(s1 | s2) + O(1).
The inequality comes from the fact that computing s2 first may lead to a suboptimal way of computing s1.

The Chain Rule for plain complexity required special care: that the program defining s2 be uniquely decodable. This constraint is built-in in K, but not in C. As a consequence, a general chain rule for plain complexity C would be:

C(s1) < C(s2) + C(s1 | s2) + O(log(C(s2))).

Chain Rule for C
Why does the chain rule for plain complexity include a logarithmic term?

    Because \(C(s)\) is always larger than \(K(s)\)
    Because \(C(s)\) is always smaller than \(K(s)\)
    Because it is the price to pay to delimit programs

    


I don’t understand why we bother to use plain complexity. Why not keep only the other one, Prefix complexity? Plain complexity seems much more natural to me.

Oh, really? Yes. You don’t need to put all bits together into a continuous string.

But blank delimiters are kind of third symbol. You have 0, 1 and blank space. Morse.png I don’t agree. In the Morse code, you consider that you have only two symbols, dots and dashes. You just ignore blanks.

You mean that the delimitation of letters is external to the code itself? Yes. To me, it belongs to the physical implementation of the code.

You can’t say this. Otherwise, you would invent a kind of "longer" space to delimit programs, and another non-symbolic delimiter to signal numbers, and so on. No. I didn’t say that. What you mention here is more like syntax.

Yes. You’re hiding syntax in blank delimiters. No, I’m not. I’m just talking of using spaces to delimit code_words, as we did in the compact coding of integers.

Because if you need syntax, you have to "pay" for it. Yes, I totally agree. For instance if you want to represent a closing parenthesis, you need bits to code for it.

That’s why requiring self-delimited programs is more general. You mean more general than, for instance, imposing balanced parentheses?

Yes. I agree with you that allowing for blank delimiters is not sufficient to delimit programs.

So you agree with me that plain complexity is dispensable! Noooo! I just told you that codes with blank delimiters can be very useful and much more natural that strict binary coding.

I agree that the Morse code would be ugly if it were self-delimited, with no blanks. And I suspect that plain complexity has a crucial role to play in the definition of randomness.

Oh, really? Yes. I had a quick glance at the next topic 😉.








Randomness

Randomness and statistics



Spock : Random chance seems to have operated in our favor.
Dr. McCoy : In plain non-Vulcan English, we've been lucky.
Spock : I believe I said that, Doctor.                    



            Star Trek original series, The Doomsday Machine

Problem: Is randomness a statistical notion, or rather an algorithmic notion?

Question: Can we define randomness based on frequency measures?

Answer:   No. Some non-random numbers pass frequency tests.

One of the first achievements of Kolmogorov complexity, that we owe to Andrei Kolmogorov himself, is a definition of randomness. Previous attempts to define randomness were based on ideas such as predictability and statistical symmetry. A binary sequence is predictable if you can win money by betting on the next bit, knowing the past of the sequence. Basic statistical symmetry requires that all thinkable patterns with n instantiated bits should occur with the same frequency 2n. There should be as many 0s as 1s in the sequence, with frequency 0.5. There should be as many 00 as 11, with frequency 0.25. And the pattern 0**11*1 should occur with frequency 1/16.
Forum: Random by hand
You may retrieve random binary strings previously produced by hand. Just press the button below. Save them into a file named RandomByHand.csv (one sequence per line).

RandomByHand.csv (one sequence per line).">         
You may use the program to measure the frequency of patterns in the set of sequences.
The program reads the file RandomByHand.csv that is supposed to contain hand-generated "random" sequences. You can compare with pseudo-random sequences generated by the program. (you may run RandomByHand.py several times to vary results for rare patterns). If the sequences contained in RandomByHand.csv were really generated by people trying to make them look random, you should see a difference.

Ming Li and Paul Vitányi showed in 1994 that a random binary sequence of length n should not contain runs of zeros longer than about log2(n) + log(log(n)). However, and contrary to intuition, a random binary sequence is still expected to contain long runs of zeros, up to about log2(n)log(log(n)) if n is large enough.
What is random
Select all of the statements that are true.

    A random binary sequence of length 1000 should contain the sequence 101010.
    A 1Mb random binary sequence should contain the first 10 binary digits of π.
    A 1Mb random binary sequence should contain the first 13 binary digits of π.
    A black-and-white random image with 4000x4000 pixels should contain vertical white segments of 32 pixels.

    

In 1919, Richard von Mises proposed that a binary sequence should be regarded as random iff any sub-sequence extracted independently from the content has bit frequency ½ at the limit. This idea seems to make sense.
Unfortunately, these frequency tests are obviously inadequate to capture the idea of randomness. The following number, known as the Champernowne constant, is "normal", which means that it passes all frequency tests.

c10 = 0.123456789101112131415161718192021222324252627282930313233343536...

The Champernowne constant is obtained by concatenating successive integers (you may generate cb for several bases b by using the function Champernowne from ).
It is easy to see that cb would pass all basic frequency tests.

Randomness of finite sequences

Problem: Is there a link between randomness and compressibility?
Compressible objects are definitely not random. For instance, π’s decimals look random, but once one knows where they come from, they can be retrieved deterministically.

We could verify that biased binary sequences, in which one of the digits is sparse, are compressible. To get an explicit code, we may represent the relative positions of the minority digits. The following binary string:

10111111111111111111111111110111111111111111011110111

can be represented by the numbers 1 27 16 5, which indicates that they are zeros at locations 1, 1+27, 1+27+16 and 1+27+16+5. This binary string of length 53 would require 65 bits to be represented using the doubling-length code previously considered, as we should prefix it with 111101110101 to signal its length.

If we code the numbers 1 27 16 5 instead, we get (spaces inserted just for legibility):

101 11001011011 11001010000 1110101 00 110111

The two bits 00 are used to signal the end of the code; the next number indicates how many trailing ones there are.
This makes a total of 40 bits instead of 65 (more concise coding methods can of course be found).
Sparse coding
Consider a binary string of length 10,000 that contains only 1% ones.
What is the compression that the above method would achieve?

    about 63%
    about 67%
    about 87%
    about 98%

    

Question: If non-random sequences are compressible, what about the converse?

Answer:   Yes. Random sequences are incompressible.

This answer may seem surprising at first glance. As we saw, talking about incompressibility for a finite sequence without mentioning the compressing machine is impossible, as you can always find a machine for which the sequence has complexity 1 (or 3 for a prefix machine).
One may nevertheless call ‘random’ sequences that your favourite machine U (or your favourite programming language) is unable to compress. A great achievement of Andrei Kolmogorov is to have understood that.


A binary sequence of length n can be considered random for a machine U if it cannot be compressed by more than c bits, where the tolerance value c is small compared to n.

Randomness of finite sequences

    CU(x) > n – c    

Randomness = incompressibility

Non random number
Find a good reason to consider that the number 19683 is not random.

    19683 = (3**3)**3
    19681 is a prime number, just below 19683
    19683 occurs at location 71434 in \(\pi\)’s decimal development
    19683 is the end of Maria Martinez’s phone number

    

Problem: How does the algorithmic definition of randomness escape the Berry paradox?
The Berry paradox, when applied to randomness, seems to ruin any possibility of defining it.
The Berry paradox sounds like the paradox of interesting numbers.

The paradox of interesting numbers
All integers are "interesting!"
You don’t buy it?
So consider the smallest "non interesting" integer.
This number turns out to be terrifically interesting!    

You probably found a solution to this paradox. It has to do with the false hypothesis that ‘interesting’ can be a predicate (i.e. a Boolean function, either true or false).

Here is another version of the paradox.

The Berry paradox
The Berry paradox: Considers "the smallest positive integer not definable in fewer than 100 words".
But... this defining phrase has less than one hundred words!


Can you solve the Berry paradox? This time, the fuzziness of the property cannot be invoked. The solution to the paradox is the same as in the case of Borel’s undefinability-of-randomness paradox.
According to Gregory Chaitin (2005, p. 106), Emile Borel stated a version of the Berry paradox that applies to the definition of randomness.

Borel’s undefinability-of-randomness paradox
The smallest integer that satisfies the criterion of randomness is unique and remarkable and therefore, not random at all!

Borel s paradox
Borel’s "undefinability of randomness" paradox does not apply to the algorithmic definition of randomness, because...

    No integer is really random.
    Compressibility is a fuzzy, gradual notion.
    Compressibility is based on complexity, which is not computable.
    If a number \(n\) is compressible, then \(n-1\) is compressible as well.

    

Randomness of infinite sequences

Problem: Can we extend the algorithmic definition of randomness to infinite sequences?
The link established by Andrei Kolmogorov between randomness and incompressibility could have put an end to a half-century quest for a definition of randomness. Extending the equation ‘randomness = incompressibility’ to infinite sequences didn’t prove straightforward, though.

An infinite series of fair coin tosses produces a random sequence. Conversely, how can we be sure that an infinite binary sequence was not produced by flipping a coin?

Martin-lof.png Per Martin-Löf (1966) defined tests to capture non-randomness. A random sequence must escape all of them. For instance, a sequence cannot be considered random if it has too many 0 at the beginning, or many more 0s than 1s, or a 1 at each even position, or either 01 or 10 at even positions.


Kolmogorov got the intuition that any failed test would made the sequence compressible. Generalizing this property is not immediate. Building on Kolmogorov’s work, Claus Schnorr and Leonid Levin found in 1973 that defining randomness for infinite sequences requires referring to prefix complexity K instead of plain complexity C.

If we call x1:m the beginning of sequence x up to index m, then a series x is random if it is incompressible, which means that any finite chunk of it is roughly as complex as its length.

Randomness of infinite binary sequences

    ( c) ( m) K(x1:m) > m – c    
Randomness of Pi
Should \(\pi\)’s decimal development, as an infinite sequence, be regarded as random?

Yes         No        

    

Randomness generation
Can we generate a random (= incompressible) infinite sequence by keeping a computer running for ever?

Yes         No        

    

Rational numbers (1)
A rational number \(r\) is the quotient of two integers \(r = p/q\). Can some rational numbers be considered random?

Yes         No

    

Rational numbers (2)
Consider the binary expansion of a rational number. If infinite, can it be regarded as a random sequence?

Yes         No

    


Using randomness

Question: Does the definition of randomness prove useful beyond mathematics?

Answer:   Yes. For instance for random number generation and in cryptography.

A successful theory of randomness has many applications. The most immediate one is the generation of pseudo-random sequences.
Pseudo-random numbers can be generated using the following program.

$ python
>>> import random
>>> print(random.randint(0, 10**32))
53217645600250283088552254822697
>>> print(random.randint(0, 10**32))
42682293524798380206208898923066    
Pseudo random numbers
Why are pseudo-random numbers generated by a computer considered "pseudo" random?
(several answers are acceptable)

    Because the computer’s memory is limited, so the pseudo-random number generation will eventually loop.
    Because the pseudo-random generation uses the computer’s clock, which is not random.
    Because the pseudo-random generation program has a small size.
    Because the pseudo-random generation program itself is deterministic.

    


Randomness theory has immediate applications in cryptography. Consider for instance the design of robust passwords. A good password should be easy to remember but hard to guess (see Chapter 1). At face value, the two requirements are contradictory. To reconcile them, you must target the difference between you (human cognitive equipment + cultural knowledge + specific knowledge) and the machine (general Turing-computability + statistics over other human cognitive equipments + cultural knowledge + knowledge extracted from your actions on the Web). password_strength.png

from xkcd.com

RSA
The RSA encryption system is based on the difficulty of factoring the product \(m\) of two large numbers \(p\) and \(q\), i.e. the difficulty of retrieving \(p\) and \(q\) such that \(m = p \times q\). One of the numbers \(p\) and \(q\) is necessary to decode messages. What is the best estimate of the complexity of a binary message \(M\) of length \(k\), when only \(m\) and the encrypted version of \(M\) based on \(m\) are available?

    0
    \(min(\log p, \log q)\)
    \(\log \log m + 1\)
    \((\log m)/2\)
    \(\log k\)
    \(\log k + \log m\)

    

I don’t fully understand how a finite object, such as an integer, can be regarded as "random". Why not?

If you take an 8-digit number, then it may be the end of someone’s phone number, and... You mean, for her owner, it’s non random at all!

Precisely. But for you, it’s still a random number.

No. I just need to refer to that person. But she is equally random to you!

What do you mean? You need as much information to designate her as you would have needed to describe her phone number.

You call her "random"? Yes. Unless you know her somewhat. In which case she is compressible!

"compressible??" Oh, you know what I mean!

Just kidding! 😌 I mean that you need less information to designate her that what you would need for most people.

Yes, in that case, she is definitely compressible! 😊 So you agree with me that numbers can be really random.

Even if they are stored in my computer’s memory? Yeap!

This is nonsense! Once the number is stored, I know it. I know its address in memory. It’s no longer random. Unless its address is as complex as the number itself.

What do you mean? It’s exactly like the phone number example.

Oh! I see. I’ve one terabyte memory. So I would need 40 bits to retrieve any stored item if I have no clue where it is located. Yes. And an 8-digit decimal number can be described with only log2(108) = 27 bits.

And so, much information stored in my computer is random to me. Yes. As long as you have not retrieved it.

Even if that information turns out to be meaningful afterwards? Yes. I’ve been told that we will learn more about this in the last chapter.

I’m curious about it. 🤨 Me too.









Using Complexity in mathematical proofs

Problem: Can we use Complexity to prove theorems, though it is not computable?
Kolmogorov complexity can be used as a tool to discover mathematical proofs, by implementing a form of reasoning called the Incompressibility Method. The key idea of this method is to acknowledge the existence of incompressible strings.

The infinitude of primes proved by the incompressibility method

Understanding Artificial Intelligence
through Algorithmic Information Theory

    

Algorithmic Information
and the infinitude of primes

    

    

aiai.telecom-paris.fr

2 3 5 7 11 13 17 19 23 29 31 37 41 43 47 53 ...

    

We know since Euclid that there are infinitely many prime numbers.

If there were only a finite number of them and \(K\) were the largest, then \((1+K!)\) would be prime, and larger than \(K\).
Contradiction!
chaitin.png Gregory Chaitin found another illuminating proof showing not only that there are infinitely many primes, but also that their number must grow at a certain pace.

Chaitin’s proof is based on an AIT argument. It has to do with the quantity of information contained in integers.

Let’s take a closer look
(see Chaitin, 2005 p.18).
Let’s call \(m\) the number of prime numbers that are smaller than \(n\) (or equal to \(n\) if \(n\) is itself prime).
Let’s call them \(p_1, \dotsc, p_m\).

It is well-known that \(n\) can be written as the product of powers of these numbers: $$n = p_1^{e_1} p_2^{e_2} \dotsc p_m^{e_m}.$$
\(n = p_1^{e_1} p_2^{e_2} \dotsc p_m^{e_m}.\)

Chaitin observed that if there were a finite numbers \(m = m_0\) of primes, this expression of \(n\) would be too compressed:

‟Of course, some numbers can be expressed extremely concisely: For example, \(2^{99999}\) is a very small expression for a very large number. And \(2^{2^{99999}}\) is an even more dramatic example.”

‟But these are exceptions. These are atypical.”
\(n = p_1^{e_1} p_2^{e_2} \dotsc p_m^{e_m}.\)

‟In general, \(n\) requires order of \(\log n\) characters, but a prime factorization [like the one above] with a fixed number of primes would only require order of \(\log \log n\) characters, and this isn’t enough characters to give that many [...] of different \(n\).”

Let’s look at Chaitin’s argument in more detail
(this presentation is borrowed from Mateo Espinosa.)
\(n = p_1^{e_1} p_2^{e_2} \dotsc p_m^{e_m}.\)

Each \(e_i\) can be expressed in about \(\log \log n\) bits.
Why?

Because by construction, the exponents are necessarily smaller than \(\log n\).
Why?

Because the largest possible exponent is when \(n\) is a power of 2: \(n = 2^k\) and \(k = \log n\).

So using the above description, we can express \(n\) in about \(m\times\log \log n\) bits.
\(n = p_1^{e_1} p_2^{e_2} \dotsc p_m^{e_m}.\)
So we can express \(n\) in about \(m\times\log \log n\) bits, where \(m\) is the number of primes smaller than \(n\).
Can we?

There is a problem, though. This only works if we can delimit the binary "words" that code for the \(e_i\). In other words, we must know their size \(\log \log n\) in advance.

We can easily fix this problem by prefixing the string with 111...10, with \(\log \log n\) ones and one zero.
\(n = p_1^{e_1} p_2^{e_2} \dotsc p_m^{e_m}.\)
We can express \(n\) in about \(m\times\log \log n\) bits.

So, with our method, \((m+1)\times\log \log n + 1\) bits are sufficient to represent \(n\).

The length of this representation is therefore an upper bound of the complexity \(K(n)\) of \(n\).     $$K(n) \leq (m+1)\times \log \log n + 1$$
\(K(n) \leq (m+1)\times \log \log n + 1\)

We know from invariance theorem that plain complexity \(C\) is smaller than prefix complexity \(K\): $$C(n) \leq K(n) + O(1)$$ We also know from Chapter 2 that the complexity of most integers \(n\) is about \(\log n\).

Hence, for infinitely many values of \(n\), we have: $$\log n \leq (m+1)\times\log \log n + O(1).$$
\(\log n \leq (m+1) \times \log \log n + O(1)\)

If there were a finite number \(m = m_0\) of primes, this would be a contradiction: \(\log n\) grows faster than \(\log \log n\).
\(\log n \leq (m+1) \times \log \log n + O(1)\)

If we rename \(m\), which is the number of prime numbers that are smaller than \(n\), as \(m = \Pi(n)\), then for infinitely many \(n\):     $$\Pi(n) \geq \frac{\log n}{\log \log n} - 1.$$
This proves that there are infinitely many primes, and that their number grows faster than the above expression.

Infinitude of primes
What is the main argument on which Chaitin’s proof of the infinitude of primes is based?

    Prime numbers that divide a number \(n\) are close to \(\log n\).
    Prime numbers that divide a number \(n\) are smaller than \(\sqrt n\).
    Prime factors can be described with \(\log n\) bits, where \(n\) is the factorized number.
    Prime factorization exponents can be described with \(\log q\) bits, where \(q\) is the number of bits to describe \(n\).
    Prime factors can be described with \(\log q\) bits, where \(q\) is the number of bits to describe \(n\).
    Prime factorization is a compressing method.

    

Gödel’s theorem revisited

Problem: Is it true that AIT puts limits on what mathematics can do?
In 1931, the mathematician Kurt Gödel published theorems that are often considered as the greatest revolution in mathematics ever. He showed that there are true mathematical statements that cannot receive a proof.

In 1974, Gregory Chaitin used AIT to revisit Gödel’s results and make them appear almost obvious. A mathematical theory is nothing more than a compressed version of its theorems, and this has far-reaching consequences.


Here we consider K as an integer function, which means that we restrict it to finite binary sequences that we interpret as integers.

Chaitin in 1974 established a result that entails Gödel’s first incompleteness theorem.

Chaitin’s theorem

Statements such as K(n) > m
cannot be proved above a certain value of m,
though they are true for infinitely many integers n.

The proof of Chaitin’s theorem is amazingly simple (contrary to Gödel’s historic proof).
Chaitin’s Incompletude theorem
Select the main argument used by Chaitin in his theorem.

    Proving \(K(n) \gt m\) makes \(n\) too simple.
    \(K(n) \gt m\) cannot be true if \(m\) is large enough.
    \(K(n)\) is smaller than most integers \(m\).
    One can always increase \(n\) until \(K(n)\) is larger than \(m\).

    


Chaitin’s number Omega

Problem: How can a number be truly "random"?
In 1974, Gregory Chaitin succeeded in defining a random number Ω. This sounds like a contradiction. How can we define something that remains random?

Omega
Why should we consider Ω as random?

    Because Omega is a probability.
    Because no digit of Omega is computable.
    Because although Turing machines are deterministic, their halting is not.
    Because the computation of some of Omega’s digits may loop indefinitely.

    

Question: Does it make sense to "randomly pick any particular real number between 0 and 1"?

Answer:   No! If ever you can "pick" it, then it must be simple. But virtually all real numbers are complex.

In mathematics, one is often asked to choose "any" real number, for instance in [0,1]. Real numbers between 0 and 1 are in correspondence with binary sequences (finite or infinite) (just consider the binary development of numbers 0.1011000010... to see that). We can therefore talk about the complexity of real numbers.

Suppose you can "pick" a real number r in [0,1]. This means that you have a computable procedure p that outputs r. If so, p can be used to define r; therefore K(r) < |p|, where |p| is the length in bits of p. As a result, r (considered as a binary sequence) is compressible down to a finite quantity and is therefore non-random.

We know, however, that most finite binary sequences are incompressible. This result can easily be extended to infinite sequences. All of them are incompressible, but a zero-length subset of them. In other words, the subset of real numbers that are compressible form a zero-length set.
This means that you can’t pick any real number, just the simple ones, which means... virtually nothing.
Definable numbers
Now consider all numbers that can be defined in mathematics (i.e. in a theory such as set theory). It includes rational numbers, \(\sqrt {42}, \pi\), etc. What can we say about the set of definable numbers?

    Mathematical formulas are like programs, so definable numbers are computable real numbers.
    There are only countably many mathematical formulas, so most real numbers cannot be captured by formulas.
    This set includes all real numbers, even an incompressible number such as Omega.

    

5

Interview


5








Recap

Take-home message

We hope you enjoyed this chapter on AIT and maths.
Let’s leave the last word to Angel@ and Alexandr@.

I learned a lot in this chapter. Aren’t you "disturbed" by the incomputability of Kolmogorov complexity?

No. Why would I feel "disturbed"?
I understand that it’s a limit, and that you never know how far you are from it.
It reminds me of Chaitin’s number Omega. Knowing that you can approximate it is useless.

Oh yes, I remember. It’s because you can’t know how good your approximation is. My favorite topic in this chapter is algorithmic probability.

Surprising, isn’t it? Now, probability is a computer science notion! And what about the need to use prefix programs exclusively?

So what? Did you fully grasp why it’s essential?

Oh yes. It’s just to ensure that programs are exclusive of each other. And you? Which topic was your favorite in this chapter?

Oh, there were so many exciting things. Maybe that we get a proper definition of randomness. Yes, and that statistics have nothing to say about it. It’s a purely algorithmic definition.

And also Gödel’s theorem. Yes.
I had heard about it, but as a very tricky theorem to prove.

And now, it seems almost obvious to me. Any axiomatic system is limited by its own complexity. Yes! It makes me feel intelligent now that I have this broader view on mathematics, thanks to AIT.

And at the same time, it’s a little bit disturbing to know that our mind must be limited as well... Ha! You see? I knew you would feel disturbed at some point! 😊


Test (1)
For a given Universal Turing Machine \(U\), there exists a specific Universal Turing Machine \(V\) which can compute the complexity \(C_U\) relative to machine \(U\).

    True
    False

    

Test (2)
Prefix complexity is defined on Turing machines on which all programs terminate.

    True
    False

    

Test (3)
The programs for which a prefix Turing machine halts corresponds to the code-words of a prefix code.

    True
    False

    

Test (4)
Can a decimal sequence of length 100 be random if it appears in the decimal development of \(\pi\)?

    Yes
    No

    

Test (5)
Chaitin’s version of Gödel’s (first) incompletude theorem proves that the complexity of some integers is so large that they cannot be proven to be that complex.

    True
    False

    

Test (6)
Chaitin’s number Omega is the probability that programs will stop when executed. Since proofs are programs, we may imagine that Omega can be used to decide whether Goldbach’s conjecture is true or not, by looking at a specific location in its development. If so, would it constitute a proof of the conjecture?

    Yes
    No

    

    
Forum: Chapter3
Do you have feedback or suggestions about the topics addressed in this chapter?

        


Line.jpg








Bibliography

5

up.png