IA325/IA703: Algorithmic Information and Artificial Intelligence

Lecturers:
 and Pierre-Alexandre Murena Etienne Houzé

Acknowledgments

This course was initially taught at Telecom Paris, a French "Grande Ecole", now part of Institut Polytechnique de Paris (and also the birthplace of the word "Telecommunication" some 120 years ago). It has been significantly improved and augmented to become a MOOC on the EdX platform.
This upgrade into a MOOC was made possible under the lead of Institut Mines-Telecom, thanks to a grant of the Patrick and Lina Drahi Foundation.

# Describing data

What you will learn in this chapter:
• You will discover that there is a hidden link between information and complexity.
• You will discover how computer science defines information, based exclusively on coding and computation.
• You will know why three mathematicians discovered the same notion of complexity at about the same time.
• You will learn how to estimate the complexity of passwords, of π, of lottery draws and of the cups in your grandmother’s cupboard.

Marvin Minsky, one of the founding fathers of Artificial Intelligence,

"Everybody should learn all about it
and spend the rest of their lives working on it."

Algorithmic information is used in various sciences:
 data science see Chapter 2 mathematics see Chapter 3 machine learning see Chapter 4 cognitive sciences see Chapter 5 biology see for instance: Vitanyi, P. & Cilibrasi, R. (2020). Phylogeny of the COVID-19 Virus SARS-CoV-2 by Compression. bioRxiv, 10.1101/2020.07.22.216242. physics see for instance: Bennett, C. H. (1988). Logical depth and physical complexity. In R. Herken (Ed.), The universal Turing machine - A half-century survey, 227-257. Oxford: Oxford University Press. neurosciences see for instance: Szczepański, J., Amigó, J. M., Wajnryb, E. & Sanchez-Vives, M. V. (2003). Application of Lempel--Ziv complexity to the analysis of neural discharges. Network: Computation in Neural Systems, 14 (2), 335-350. linguistics see for instance: Brighton, H. & Kirby, S. (2001). The survival of the smallest: stability conditions for the cultural evolution of compositional language. LNAI Vol. 2159, 592-601. Berlin: Springer Verlag. sociology see for instance: Butts, C. T. (2001). The complexity of social networks: theoretical and empirical findings. Social networks, 23 (1), 31-72. ecology see for instance: Dakos, V. & Soler-Toscano, F. (2017). Measuring complexity to infer changes in the dynamics of ecological systems under stress. Ecological Complexity, 32, 144-155. ethology see for instance: Ryabko, B. & Reznikova, Z. (2009). The use of ideas of information theory for studying "language" and intelligence in ants. Entropy, 11 (), 836-853.

Conversely, it is hard to find a science that would be immune to its influence.
Algorithmic Information’s ambition is to describe what it means for an intelligent entity, be it artificial or natural, to make sense of the world.

Understanding Artificial Intelligence
through Algorithmic Information Theory

Algorithmic Information Theory
and Complexity

aiai.telecom-paris.fr

Let’s consider these two pictures, for instance. The one on the right shows something that is easily recognizable, a "T".

It makes sense to us, because of the structure we recognize. The picture on the left, however, doesn’t make much sense.

According to Algorithmic Information Theory (AIT), the "T" picture "makes more sense" to us because we get a simpler description if the pixels are grouped to form the letter "T" than if the pixels are scattered anywhere.
AIT’s notions such as complexity and simplicity are central to artificial intelligence. One of the main purposes of Artificial intelligence is to make sense of data.

What does it mean?

'abababbababaabababbababaabababbababaabababbababa'

Consider the above string.
'abababbababaabababbababaabababbababaabababbababa'
You may describe it in Python as...
• a string of 48 characters:
→    S = ''.join(['a', 'b', 'a', 'b', 'a', ...])
• an alternation of two strings: abababbababaabababbababaabababbababaabababbababa
→    S = ('ababab' + 'bababa') * 4

• a pattern of patterns:
→    S = ('ab' * 3 + 'ba' * 3) * 4

• repeating a mirrored pattern:
→    T = 'ab' * 3; S = (T + T[::-1]) * 4

• repeating 'ab' and delete every 6th character:
'abababababababababababababababababababababababababababa'
→     T = list('ab' * 28); [T.pop(6 * n) for n in range(1,9)];
S = ''.join(T)
'abababbababaabababbababaabababbababaabababbababa'

S = ''.join(['a', 'b', 'a', 'b', 'a', ...])
S = ('ababab' + 'bababa') * 4
T = 'ab' * 3; S = (T + T[::-1]) * 4
T = list('ab' * 28); [T.pop(6 * n) for n in range(1,9)]; S = ''.join(T)

Which would be considered the "most intelligent" description?

According to Algorithmic Information Theory,
the "most intelligent" description is the less complex one,
that is, the most concise one.

The most concise solution might not be the one we think it is at first glance. We must convert these representations into bits to decide.
The first mathematician who invented the notion of complexity in the mid 1960s, Ray Solomonoff, knew that he was paving the way for artificial intelligence.

He understood that intelligent processing has to do with simplicity, and that simplicity has to do with coding. But not any coding.
The complexity of an object is assessed by finding
the shortest among the available binary descriptions
of that object.

So we will start by talking about description length.

## Description length

The information contained in an object is estimated by the length of its binary representation.
Intuitively, if an object contains a lot of information, then it cannot be described in a concise manner and should be considered complex.
This definition applies to any object or situation, as long as this object or situation can be coded to be represented in a computer.

In this chapter, we will talk a lot about coding, as it is the most basic way to measure information content.
Subsequent chapters will introduce alternative ways to assess complexity.

### Complexity of objects within sets

Problem: In a collection of objects, do the elements have the same complexity?
On many occasions, the complexity of objects is easy to estimate, just because these objects are already stored in the computer’s memory. Consider, for instance, the alphabet: a, b, c,... , z. It can be retrieved as a list. In Python as in most computer languages, letters can be addressed through their ASCII code:

>>> Alphabet = [chr(c) for c in range(97,123)]
>>> print(Alphabet)
['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t', 'u','v','w','x','y','z']

Once the alphabet is made available as a list, each letter can be retrieved by its rank on the list. So its complexity is estimated by the length of the binary representation of its rank.
For instance, one can designate the letter k by mentioning its position: 11 within the alphabet. An estimate of its complexity is 4, as we can code 11 with 4 bits (1011).

Question: Isn’t it easier just to say 'k' instead of 1011?

Answer:   No. 'k' is just a convenient representation of an ASCII code, 107, which is 1101011 in binary form.

Here is what we are able to do so far:

its complexity within the list can be estimated
by the length of the binary representation of its rank.
 About 57 000 people took part in the Paris marathon. You remember a runner and want to check whether you know her. All you know is that she arrived 37th on the finish line. Once the ranking with the names of participants is made available, how many bits do the organizers need to retrieve her name?

16 bits     11 bits     6 bits     5 bits

The letter ‘k’ is a simple character in python.
What about the character ‘♖’ (white chess rook)?
Its Unicode rank is 9814 (10011001010110 in binary form).
So we may estimate the complexity of ‘♖’ as being about 14 (the length of 10011001010110).

Question: Can we assess the complexity of a character if we don’t know its exact Unicode value?

Answer:   Yes. Since there are a total of 1 114 112 different Unicode characters,
we can be sure that the length of its binary representation must be less than 21 bits
(you can check: int('100010000000000000000',2) is 1114112, and len('100010000000000000000') is 21).

More generally:

When an object belongs to a set of size N,
its complexity in that set cannot exceed log2(N).
This is because log2(N) bits are sufficient to represent any rank in the set of size N, ordered by whatever means.

Estimating the complexity of an object this way is useful whenever it belongs to a set that has no obvious order.
Consider for instance the cups in your grandmother’s cupboard. They may be ordered from left to right and bottom to top (if stacked up), or by colour and size if they are different, and so on. We suppose that you wish to designate one of the cups using such an ordering method.
 How many bits would you need to designate one specific cup in your grandmother’s cupboard, if there is a total of 62 cups in it?

16 bits     11 bits     6 bits     5 bits

For now, what we know about measuring the complexity of an object is this:
• we must find a code to represent the object in the computer
(if the object is stored in a list, the code can be the rank in the list),
• we express the code in binary form,
• we measure the length of this binary form.
But it’s not sufficient. The notion of complexity is more demanding. One should find out the shortest possible binary code. So, any code that can spare bits is closer to the complexity of the entity.
 Marie Martin is a French citizen you do not know. Her social security number is 2 36 07 24 322 986. Representing citizens by their social security number does not offer the maximally compact representation, though. Can you provide a good estimate of the complexity of Marie Martin, knowing that she is a living French citizen?

43 bits     35 bits     27 bits
22 bits     11 bits     8 bits

 Suppose the only emojis that are available on your phone app are just these:👺    🤖    ❣    🕳    ㊙    ⚪    🛫How complex is the most complex emoji among them?

3 bits     8 bits     26 bits     127 bits     1 264 bits

Consider these 12 objects.

 Which one is the simplest object in this set?

One of the squares
The odd shape

### Description complexity of integers

Problem: Can numbers be simpler than the length of their binary representation?
Integers are central to an algorithmic information approach to artificial intelligence.
 number binaryrepresentation length 1 1 1 2 10 2 3 11 2 4 100 3 5 101 3 6 110 3 ... ... ... 10 1010 4 11 1011 4
• Integers are used to retrieve objects by their rank in lists. Even seemingly complex objects such as convoluted emojis are no more complex that their address in the set of emojis. And this address is nothing but an integer.
• Integers are much more important than that. Integers are involved in the formal description of most objects AI is dealing with.
• Integers are objects of interest in themselves as well, because of their role in mathematics (note that "real" numbers with limited precision are, in fact, pairs of integers).

So let’s find a way to measure the complexity of integers.

Mathematics (since Leibniz and, before him, ancient Chinese scientists) and Computer science like to represent numbers with bits. A first estimate of complexity consists of counting the number of bits in the standard binary representation (see table).

One can see that with this method, the complexity of n, as measured by the length of its binary representation, is:
log2(1+n)
Decimal to binary converter:

where x (with upper brackets) designates the smallest integer larger than x.

Binary to decimal converter:

This method, however, totally ignores the existence of "round" numbers. This is unfortunate. A round number like one billion (1 000 000 000 in base 10) looks much simpler than the 30 bits of its binary representation (111011100110101100101000000000).

Question: Why are we ignoring space separators when counting the length of round numbers compact codes?

Answer:   Separators are not symbols. You can’t have two separators in a row. They may result from physical signal processing, as for the Morse code, or from low-level word segmentation (e.g. lists of strings in Python: '0010 11010 111' ↔ ['0010','11010','111']).
(Note: In Chapter 3, we will consider the possibility of eliminating all space separators altogether.)

The above video proposed a compact coding method for integers:

compact code for integers

To code n, we consider the standard binary code of (n+2) and we drop the initial 1 (which is redundant).
We then reintroduce a separated heading bit: 1 for "normal" numbers, and 0 for "round" numbers.
 Using the coding convention that was just presented in the video, what would be a good coding for 500?

1 11110100        0 01 11
0 11 00         1 01 11
1 1101         1 11110110

 Which number does the following code represent: 0 110 11?

5000000000000     500000     1200000     120000

Decimal to binary converter:

Our coding system acknowledges the fact that round numbers are simple. What about their immediate neighbours? Intuitively, 5000000011 isn’t much more complex than 5000000000, as it can be expressed as 5000000000 + 11.
Binary to decimal converter:

The coding method illustrated in the figures below captures this idea. We prefix the code by 00 for a positive shift from the round number, and by 01 for a negative shift. Then we indicate the two-part code of the round number, and finally the compact code of the shift.

Caveat: These codes use space delimiters, as in the Morse code.

 Based on the suggested coding, provide a short binary representation of 1000008?

### Complexity is “continuous”

Problem: Does the complexity of integers vary erratically or smoothly?
Thanks to the previous coding method, round numbers and their neighbours receive shorter codes than other numbers of the same magnitude. The two plots below show the complexity of each number n on the x-axis. They have been drawn using the above compact integer coding method, implemented in . The purple curve indicates log2(n) for reference. On a large scale (first image), we can see that the compact code length roughly evolves like log(n). Note that the complexity cannot exceed log2(n) by more than one bit. We can observe the round-number effect: complexity drops for multiples of 10, and even more so for multiples of 100 and of 1000. A zoom around 50000 (second image) shows that C(n) is roughly "continuous".

To capture the fact that the complexity of integers is a "quasi-continuous" function, we may write that for any integers n and h:

|C(n+h) – C(n)| < f(|h|) + O(1)

where f is an increasing function of |h| such that f(0) = 0 and O(1) is a constant.
 Which function $$f$$ among the following is the most appropriate to capture the "quasi-continuity" of complexity?

a.    $$|h|$$
b.    $$2 \times |h|$$
c.    $$log_2(1 + |h|)$$
d.    $$log_2(1 + log_2(1 + |h|))$$
e.    $$exp(|h|) - 1$$

• Suggestion: Improve , which computes compact codes for integers. In its current state, it takes advantage of the proximity of round numbers. Imagine additional ways of compacting integers (e.g. multiple or divisors, repetitive sequences, famous numbers based on Google hits, etc.) and integrate them into the code.

### Complexity and structure

Problem: Can we define the complexity of new objects, based on their structure?

Understanding Artificial Intelligence
through Algorithmic Information Theory

Complexity and structure

aiai.telecom-paris.fr

What is the complexity of this figure?

We can describe it
• pixel by pixel, or observe that there is
• a blue disk on top of a red (mmm... not really red) square.

The second description (disk on square) is more likely to convey the idea that it is a really simple image. How can we be sure that this representation is simpler than the other one?

Answer:    by coding these representations in binary form and by comparing their lengths in bits.
Ok, but there can be a long way from the object to its binary representation.

Objects or situations manipulated by computers are usually coded into a formal representations, such as:

disc(X), blue(X), size(4),
square(Y), red(Y), size(4),
on(X,Y)

This representation can then be converted (through whatever means) into binary form. The length of this binary form gives us an idea of the complexity of the object.

To sum up, one way of estimating algorithmic information consists of
1. finding a formal representation for this object,
2. coding this representation in binary form,
3. measuring its description length as the length of this binary representation.

The intermediate, formal representation represents the structure of the object.

Finding structures for objects is a big challenge of AI.

The most intelligent structure, according to AIT, is the one that turns out to be the most concise in number of bits.

For now, we are able to estimate the complexity of integers and of objects that are stored in memory (through the complexity of their address). What about new objects?

Consider, for instance, the case of passwords. When looking for a good password, we try to outsmart machines. This situation offers a nice illustration of the importance of complexity.

Question: Complex passwords are impossible to remember. Can we define what a good password should be?

Answer:   A good password should be easy to remember, but hard to guess. In AIT’s terms, it should be complex to anyone but you.

 How would you rank the complexity of the following passwords (from 1=simplest to 6=most complex)?

efghijkl         aabbccdd         abcdefgh
amoijdfs         aaaaaaaa         abcdabcd

Passwords are character strings. Some strings are obviously simpler than others. How can we quantify this? A first step consists of finding a compact formal representation of the string, such as repetition(a, 8), and then translate this formal representation into bits using some appropriate code, and finally counts the bits.

 Using the coding scheme presented in the video, determine which string is represented by: mapping(increment(1,a,4), increment(1,X,2)).

cdef        aabbccdd
aaaaaaaa    abbccdde
abcdabcd    ee

 Provide an expression that represents abcdefgh.

 letter position code a 1 0 b 2 1 c 3 00 d 4 01 e 5 10 f 6 11 g 7 000 h 8 001

Suppose we want to measure the complexity of a password such as abcdefgh. We can code for the letters one by one. The codes that appear in the video assume standard binary code. We can get a slightly more compact code using our compact integer coding scheme (see table).

If we add up all the bits used in codes, we get a total of 16 bits for abcdefgh (not counting space delimiters). But we can do better than that, if we consider that abcdefgh can be represented as increment(1, a, 8). To do so, suppose that we limit ourselves to the following list of available operators, with the corresponding codes.

 operator: repetition increment mapping code: 0 1 00

Suppose now that we consider three kinds of entities, coded accordingly:

 entity: operator integer letter code: 0 1 00

Using this coding scheme, an object like letter e will be coded as 00 10, where the first two bits designate letters and the last two bits designate the rank of e among letters.
 What would be the correct code to designate number 6 in our small language?

000
110
1 000
1 110
00 000
11 110

 What would be the correct code to designate operator mapping in our small language?

0 00
0 0 00
000
00

 What would be the correct code to designate letter b in our small language?

10
00 11
00 1
00

Since we are able to represent abcdefgh as increment(1, a, 8), we can now translate it into a binary representation. We still use our compact integer coding system. Note that when the type of the object is known, we can avoid using a complete code and thus get an overall shorter representation.

 element code comment increment 0 1 1 = designation of increment as operator 1 1 an integer is expected a 00 0 complete code is expected (here: letter + a) 8 010 an integer is expected

We end up with the following code: 0 1 1 00 0 010 to represent abcdefgh. It is 9 bits long (ignoring space delimiters), which is more compact that the 16 bits of the letter-by-letter code (note: the code is decodable only if we keep the spaces between code words). Our coding scheme is sufficient to provide an estimate of the description complexity of alphabetic strings.
 Using the same coding method, provide a binary representation of abcdabcd?

Note that the resulting description length of abcdabcd is larger than the description length of abcdefgh. This might explain why abcdabcd is perceived as more complex than abcdefgh.

 What do you think are the qualities of a good password (w.r.t. complexity)?

I feel that I can compute the informational content of any object, now. Compute? Or maybe just assess?

Yes. Take this playing card. I just need a description of it, and then I measure the length of that description. I don’t think you have to describe the structure of the card, element by element or pixel by pixel!

No! Since it is a playing card of my deck, I just have to distinguish it from the other cards. You mean, by saying that it is the ten of spades?

Yes. Or since there are 32 cards in the deck, I’m sure that the complexity can’t exceed log2(32) = 5 bits. And did you get the point about the continuity of complexity?

Sure. It was about the complexity of integers. So complex integers must be far away from simple ones?

Yes, it’s obvious to me. If a number is simple, a round number for instance, you take it as a proxy to describe its neighbors. And so the neighbors can’t be complex?

No, because you just need to code the distance from the proxy number, and if this distance is small, it is simple as well. I see. So we are able to measure the information content of any object, including numbers.

Yes. And we can compare all objects in the world, based on how much they "weigh" in bits. I wonder what is left to be learned about algorithmic information.

## History of the concept of complexity

Problem: The notion of Complexity was invented three times simultaneously and independently. How?
AIT’s fundamental idea is that the informational content of an object can be measured by its shortest binary representation.
This is a profound idea, that did not occur to thinkers until the mid-twentieth century.
And then, quite remarkably, three mathematicians got the idea at the same time independently!

## Defining description complexity

Problem: Complexity is an intuitive notion. Can we define it formally?

Understanding Artificial Intelligence
through Algorithmic Information Theory

Defining Description Complexity

aiai.telecom-paris.fr

Informally, the description complexity of an object s can be estimated:

• by finding a short binary code p(s) that represents s
• and then by measuring the length of that binary representation of s.

To be a faithful image of s,
p(s) must be decodable into s.

For p(s) to be decodable into s, we should be able to feed p(s) into a decoding machine that will then reconstruct s.

We can regard p(s) as a program.

More generally, we are interested in any program p that, once decoded, outputs s on the decoding machine.

More generally, we are interested in any program p that, once decoded, outputs s on the decoding machine.

The description complexity of s is the size of the shortest of these programs.

To make things more formal, we need to consider an ideal decoding machine that is able to perform any computation.

Such a machine came into existence for the first time in 1936... in the mind of the great mathematician Alan Turing.

The machine that Turing imagined can be seen as an ideal computer with no memory limitations. Those idealized computers are called universal Turing machines.

The definition of algorithmic information is based on the idea of computation. Calculus, algorithms and proofs are forms of computations that have always been central to mathematical activity. However, a proper definition of computation had to wait until 1936 when Alan Turing defined what is now known as Turing machines.

Examples of Turing machines running on two simple problems

The description complexity of s, also called Kolmogorov complexity, is the size of the shortest program that outputs s when run on a given universal Turing machine U.

Kolmogorov Complexity

CU(s) = minp { |p| : U(p) = s }.

Expression |p| means the length of program p.
Intuitively, CU(s) measures the length of s when all thinkable redundancy has been removed.

Question: Can we compute CU(s)?

Answer:   No! One can never be sure to have found the shortest program that generates s. However, one can compute approximations of CU(s) in limited time, and this is what really matters for AI. The issue of the non-computability of C will be addressed in Chapter 3.

 Suppose you are about to sort $$n$$ data. Your computer is using a library of 1000 basic programs. sort is one of them. How can we estimate the description complexity of sort as a program?

500 bits                     10 bits
$$n \times log(n)$$ bits             $$n^2$$ bits
1 kbit

## The invariance theorem

Problem: Complexity depends on who is measuring it!
At first sight, description complexity seems to be totally subjective. What is simple for you might be complex for me (think of your phone number). More formally, let’s consider all binary sequences of size N. There are 2N of them. Some look fairly simple:

0000000000000000000000000000000000000000000000000000
1111111111111111111111111111111111111111111111111111
0101010101010101010101010101010101010101010101010101

Some look more complex:

0010100101110110101101000101101011011010001011001011
 We may sort the $$2^N$$ different sequences of length $$N$$ in any order we wish and store them in that order (e.g. into a Python list).If $$N = 52$$, how simple can the following sequence become after such an operation? 0010100101110110101101000101101011011010001011001011

1 bit     26 bits     52 bits     53 bits

Should we conclude that description complexity a totally subjective notion? Could it be that some remote mathematical culture would legitimately regard π as complex while other cultures would find it simple?
 On March 14th (note the date, 3/14 in US notation), 2019, Emma Haruka Iwao broke the record of the number of decimal digits computed for $$\pi$$. She could determine the first 31 415 926 535 897 digits (over 31 trillions! Note the chosen number) using 96 CPUs over 121 days. In what way can we say that $$\pi$$ is less complex now than it was in the pre-computer area?

in no way.
those digits are still complex, but less than before as they were nearly impossible to compute.
those digits were complex, but now their complexity is zero.

The Invariance theorem puts this idea of "perspective" into perspective.

The invariance theorem states that discrepancies in the evaluation of Kolmogorov complexity are bounded.

Invariance theorem

s; |CU(s) – CV(s)| < cU,V

This means that by changing machine U for another universal Turing machine V, the complexity of an object s may be affected at most by a constant cU,V, which depends on both machines, but not on s.
 Does the invariance theorem hold if $$U$$ and $$V$$ designate the same universal Turing machine equipped with two different programming languages $$L_1$$ and $$L_2$$ (such as Java and Python)?

No, because the invariance theorem holds for different machines.
Yes, because we may translate programs from one language into the other.
No, because translation from one language into the other may make some objects much more complex.

The Invariance theorem is good news. It introduces some objectivity in complexity computations, irrespective of the machine or the programming language used. The complexity of π, for instance, will remain finite on any universal Turing machine you consider. π can be computed using a simple series.

π = 4 * (1  1/3 + 1/5  1/7 + 1/9  1/11  . . .)

In Python, you may approach this series with , where N controls the number of iterations:

def pi(N):
PiDividedByFour = 0
for i in range(N):
PiDividedByFour += (-1)**i /float(2*i+1)
return 4* PiDividedByFour
 Execute for N = 60000 and N = 60001. $python Pi.py 60000. . .$ python Pi.py 60001. . .What is the precision we get for $$\pi$$?

$$10^{-3}$$         $$10^{-4}$$         $$10^{-5}$$

This code provides the Nth approximation of π. You may embed this code into an infinite loop, or, if you are very patient, you may prefer the recursive version, which you use by calling pi(0):

def pi(N): return pi(N+1) + 4*(–1)**N /float(2*N+1)

You need to wait for the end of eternity to get the result (provided your computer has infinite memory)! (By the way, our current Python interpreter limits us to 800 recursive calls; you may test yours by adding the exit line: if N >=800: return 0.) Does this mean that only Godly Beings may use these algorithmic definitions of π? In a way yes, but this is not the point. The point is that we do have a finite definition (i.e. without ". . .") of π. This definition provides an upper bound of its complexity C(π).

### Choosing the machine

In a previous exercise, we changed the code in order to turn a complex sequence into a simple one. We can do this for any object, a book for instance. There is a machine (or a code) V in which the content of a book such as the famous An introduction to Kolmogorov complexity and its applications (IKC) has a complexity of only one bit. How can we compress a whole book down to a single bit? This is possible if the book is stored in machine V’s memory. V performs the same computations as a standard universal Turing machine U, but interprets the first bit of any program p in a specific way.

# procedure defining V
if p[0] == 0:    print(IKC_content)
else:    return U(p[1:])

The machine V defined by the above procedure is a univeral Turing machine, exactly as U. Its particularity is to pay attention to the first bit of any program, just in case its value happens to be 0, in which case it prints IKC. For this strange machine, CV(IKC) = 1, since 1 bit is sufficient to retrieve the whole book. On another machine U that does not hold IKC in its memory, however, CU(IKC) is the length of the shortest available identification of the book. If the book can be retrieved from an external resource such as the publisher’s Website, complexity might remain limited (e.g. equal to the complexity of the ISBN number). In the worst case in which no external resource is available, the description of the book would contain some compressed version of its content. The consequence is that the cU,V constant may be of the same order of magnitude as the complexities we are interested in.

Note that V holds IKC’s content in its memory, whereas a representation of its content must be present in the program fed to U. More generally, a machine can use the address of an object in its memory as a code for that object.
 Some Web services such as TinyUrl offer to compress long URLs. You may try to compress two URLs, a short one and a long one. What kind of compression method do you think the service is using? (several answers are admissible, choose one.)

The system uses a compressor to downsize address length.
The system uses a hash table to limit address length.
The system assign growing alphanumeric shortcuts to addresses that depends on their order of arrival.

Isn’t is a problem that complexity would completely depend on the machine that measures it? It doesn’t! You have the invariance theorem.

Yes, I know. The divergence between two machines is bounded. Indeed.

But a whole book may be compressed down to one bit only! Yes, I agree that for finite objects, the divergence might be as large as the complexity itself.

Don’t you find it disturbing? If the book were on my desk, I’d just need to point to it.

But since the book is actually on my desk, you may point to my desk to make it simple. This only works because I know you well. You’re simple to me (no offense 😀).

Sure. It would be useless to mention that the book is on the desk of someone you don’t know. This makes me think that machine-dependence is a good thing after all!

What do you mean?

I mean that objects that are relevant to you, as that book, should have a short description for you.

And not for you? By the way, I like the book too. But yes, it’s normal to diverge on what we consider relevant.

And so it’s normal to get different measures of complexity? Yes. I even begin to consider that it may be fundamental!

## Conditional complexity

Problem: Complexity does not only depend on the machine, but it also varies with the context!

Understanding Artificial Intelligence
through Algorithmic Information Theory

Conditional Complexity

aiai.telecom-paris.fr

Can Complexity be a well-defined notion,
knowing that it depends on context?

It seems that its value may change at any time!
For instance, the complexity of 638 is about log2(638) ≈ 10 bits,
but in a context in which 637 has just been mentioned,

638 = 637 + 1

and find that 638 can be coded in much less than 10 bits.

So the complexity of 638 varies through time!
Conditional complexity does indeed depend on context.

However, far from being a problem, it is essential for AI applications.
It allows to take background knowledge into account.

What is the description complexity of Bill Clinton?

Bill Clinton is much simpler to describe if one knows that he is a former US president.
The purpose of Conditional complexity is to express some piece of available knowledge explicitly and use it.
To unambiguously describe Bill Clinton
knowing that he was a US president,
one just needs to say that he appears in kth position in the list of former US presidents in reverse order.

Using this piece of knowledge, Bill Clinton’s complexity is much less than if he were any US citizen.
We can say that the complexity of Bill Clinton
knowing that he is a former president is the complexity of k (which is about log2(k) bits),

plus one bit to tell the scan direction of the list (latest first).

We write:

C(’Bill Clinton’ | ’US president’) = log2(k) + 1

The vertical bar means ‘knowing that’.

This notation reminds us of conditional probability.
The meaning is roughly the same. However...
We can say that the complexity of Bill Clinton
knowing that he is a former president is the complexity of k (which us about log2(k) bits),

plus one bit to tell the scan direction of the list (latest first).

We write:

C(’Bill Clinton’ | ’US president’) = log2(k) + 1

Conditional complexity is much more natural than conditional probability
(the conditional probability p(s1 | s2) of s1 "knowing that" s2 requires to compute p(s1 & s2) and to divide it by p(s2); this definition of "knowing that" is not as intuitive as what comes next.)
C(s1 | s2) just means that s2 is available for free when it comes to describing s1.

More precisely, the conditional Kolmogorov complexity of s1 relative to s2 is the size of the shortest program p that can produce s1 on a universal Turing machine
that gets s2 and p as input.

CU(s1 | s2) = minp {|p| : U(s2, p) = s1}.

CU(s1 | s2) = minp {|p| : U(s2, p) = s1}.

This definition of Conditional Complexity will prove essential to apply Algorithmic Information Theory to Artificial Intelligence.

Conditional complexity

CU(s1 | s2) = minp{|p| : U(s2, p) = s1}.
Caution: in the definition of conditional complexity,
the notation U(s2, p) supposes that the machine U be given a way
(such as a delimitation) to distinguish s2 from p.

Conditional complexity offers a way of taking context into account. In a context in which the number 637 has just been mentioned, 637 itself and numbers close to it become automatically simple. The following images show C(n) and C(| 637) using our coding scheme. We can verify that C(637 | 637) = 0, and that numbers around 637 have low complexity (hence the trough around 637 in the conditional complexity curve).
 C(n) C(n | 637)

Now consider the number 2811. Using our coding scheme for integers, one can see that the description of 2811 requires no more than 11 bits. The program can be used to verify that 2811 is coded using 2800 as a reference. Let’s first compute absolute complexities.

$python >>> import CompactIntegerCoding as CC >>> CC.Coding(2811) '00 1110 00 101' # == '00' + <compact code of 2800> + <compact code of 11> >>> CC.Coding(2800) '0 1110 00' # == <compact code of 2800> (with heading '0') >>> CC.Coding(11) '1 101' # == <compact code of 11> (with heading '1') Conditional complexity can be used to verify that the description complexity of 2811 can be even smaller if 2800 is already in the context. In , conditional complexity is implemented by using a 3-bit long prefix: '100' for a positive shift, '101' for a negative shift.$ python
>>> import CompactIntegerCoding as CC
>>> CC.ConditionalCoding(2811, 2800)
'100 1 101'    # note the heading '100' that marks conditional computation; '1 101' = code for 11
>>> CC.Coding(2800)        # to put 2800 into memory (in coded form)
'0 1110 00'
>>> CC.ConditionalDecoding('100 1 101')    # uses memory content to retrieve the actual code for 2811
2811

For this program, 2811 is represented by 100 1 101 when 2800 is in short-term memory, instead of the longer code 00 1110 00 101 that contains the code for 2800 in explicit form (1110 00, which means ‘28 followed by 2 (= 00) zeros').
 Compute the conditional description length $$C(2789 | 2800)$$ (not counting space delimiters) given by the coding program . What do you get?

7 bits     8 bits     11 bits     12 bits

 How does conditional complexity C(x|y) compare to non-conditional complexity C(x)?

Conditional complexity is always smaller (up to a constant)
Conditional complexity is always larger (up to a constant)

### The Chain rule

Problem: Can we use Conditional complexity to compute complexity step-by-step?
Conditional complexity can be used to capitalize on what an agent (represented as a machine) may already know. It is also a great tool to compute estimates of description complexity, step-by-step, using the chain rule.

If we already know an object (think for instance of a number) s2, conditional complexity allows us to compute the complexity C(s1|s2) of another object s1 knowing s2. We would like to say that the complexity C(s1) of s1 is simply the complexity C(s2) of s2 plus C(s1|s2). This is almost what we have:

Chain rule 1 for "unambiguous" program codes

C(s1) < C(s2) + C(s1 | s2) + O(1).
Here O(1) designates some constant.
This is almost what we wanted! There is an inequality instead of an equality. It comes from the fact that the detour through s2 might not be optimal.

Caveat: The chain rule is true only if the machine can delimit programs (i.e. it can tell where the program that generates s2 ends and the program that generates s1 from s2 begins). This property is guaranteed if program codes are "unambiguous" (or uniquely decodable). A formal definition of this property will be given in Chapter 3.

 Using the coding program , compute C(m) + C(n | m) for n = 2811 and m = 2700 (do not count space delimiters). What do you get?

7 bits         11 bits         17 bits         18 bits         21 bits

The chain rule can also been written as follows:

Chain rule 2 for "unambiguous" program codes

C(s1, s2) < C(s1) + C(s2 | s1) + O(1).
It comes from the fact that a good way to compute s1 and s2 consists of starting with computing one of them, before computing the other one. That "good way" might not be the shortest way, though (hence the inequality).

### Chain rule and coincidences

Problem: Are there practical applications of (conditional) complexity?
Algorithmic Information Theory has many practical applications. Some of them will be explored in the following chapters. Let’s mention a situation of everyday life in which Conditional complexity plays a central role: Coincidences.

Understanding Artificial Intelligence
through Algorithmic Information Theory

The Bulgarian Lottery story

aiai.telecom-paris.fr

On September 10th, 2009, the numbers 4, 15, 23, 24, 35, 42 were drawn by a machine live on the Bulgarian television. The event would have gone unnoticed, were it not that the exact same numbers had come up in the preceding round, a few days before.

As we will discover in a later chapter, such an event raises people’s interest because it is much simpler than expected.

We intuitively feel that the last draw is incredibly simple. Why?
We can compute the complexity of the last lottery round r0 by referring to the preceding one r-1.
Let’s apply the chain rule:

C(r0) < C(r-1) + C(r0 | r-1) + O(1)

C(r0) < C(r-1) + C(r0 | r-1) + O(1)

Since all lottery draws are recorded, r-1 is entirely determined by its rank, 1, in the list of past draws.

In the lottery context, C(r-1) ≈ C(1) is thus about 1 bit (see section on integer complexity). If the two draws contain the same numbers, C(r0 | r-1) = 0.

C(r0) < C(r-1) + C(r0 | r-1) + O(1).          C(r0) <     1     +        0            + O(1)

r0 appears utterly simple!

This explains why the event is so impressive that it was mentioned in the international news.

 The same winning combination 13, 14, 26, 32, 33, 36 was produced by Israel’s national biweekly (= twice a week) lottery on September 21st and on October 16th, 2010. Can you quantify the simplicity of the second event by the time it occurred ?

1 bit     3 bits     12 bits     30 bits

## Recap

### Take-home message

We hope you enjoyed this chapter on description complexity.
Let’s leave the last word to Angel@ and Alexandr@.

I didn’t know that information was a computer science notion. It’s quite natural, if we think about it. Formally, a description is a program.

So we do not consider the object itself. Only its description? Its shortest description. Complexity is the size of the shortest description.
I find it brilliant.

Yes, I guess that to make sense of an object, you have to describe it to yourself. That’s true for AI as well. The machine must find a way to describe the object to be able to process it.

You mean, a neural network that recognizes a cat is in fact building a program that generates the cat?? By recognizing a cat, it gets a much more concise representation of the pixel array. Then you just need to represent how the actual cat differs from the stored prototype.

I liked this idea of coding for the difference. As in the case of integers. When you represent a number by comparing it to the closest round number?

Yes. It was when complexity was shown to be "continuous". It can’t make abrupt jumps. Oh, now I can make the link with the lottery story.

What link? Ah, when we learned about conditional complexity? Yes. This was also a way of coding the difference.

The difference? You mean, by comparing the lottery draw with the previous one? Yes, using the chain rule. And so that’s how the second draw turned out to be abnormally simple.

I had a problem with the chain rule. Did you understand it? It seems quite natural to me. You compute s2 first as a proxy for s1.

I mean, the restriction to "uniquely decodable" programs. Oh, it was written that we would get a full explanation in Chapter 3!

Meanwhile, I read that the next chapter, Chapter 2, is about AIT and data. Yes, I’m curious about how algorithmic information can be applied to the real world.

 The complexity difference $$C(n+1) - C(n)$$ can become arbitrarily large when the integer $$n$$ grows.

True
False

 The complexity of $$x$$ is defined as the average length over all programs that generate $$x$$.

True
False

 If we knew a program $$p$$ such that $$C_U(x_0) = |p|$$ for a given object $$x_0$$, could we compute the value of $$C_V(x_0)$$ for another universal machine $$V$$ different from $$U$$?

Yes
No

 Suppose that only two programs $$p_1$$ and $$p_2$$ can generate string $$x$$ on the reference machine. $$p_1$$ requires $$n^2$$ milliseconds to run while $$p_2$$ requires $$n \times log(n)$$ milliseconds. Then $$C(x) = |p_2|$$.

True
False

 For any finite object $$x$$, there exists a machine on which the complexity of $$x$$ is equal to 1.

True
False

 $$C(x|x) = 0$$.

True
False

 Do you have feedback or suggestions about the topics addressed in this chapter?