Welcome!Welcome to this course on Algorithmic Information.
We hope you will enjoy it. It includes:
videos,
short texts,
slideshows,
small dialogs
and exercises.
Most exercises are meant to help you to understand and to think about the new notions presented in the course. They are ungraded.
There are a few graded exercises at the end of each of the five chapters. They are used to evaluate your performance in this course.We did our best to make this course interesting. If you follow it to the end, it will bring you a deeper understanding of artificial intelligence, and hopefully much more.
Describing dataDescription length We introduce the notion of description length, which is the length of a program used to generate an object.
The easiest way of describing objects is to distinguish them within a set.
Description complexity of integers Integers can easily be distinguished by their rank in ℕ. But this method does not see why round numbers are simple.
Complexity and structure Can we determine the complexity of an object by its structure? We illustrate this possibility by considering passwords: a good password should be simple to remember and complex to guess.
Defining complexity We define Kolmogorov complexity, which is the core of this course. It corresponds to the minimal description length of an object. This quantity depends on a reference machine, but the "invariance theorem" makes it objective.
Conditional complexity Sometimes, having access to another object can be a good proxy to provide an efficient description. This idea is involved in the definition of conditional complexity. A link between non-conditional complexity and conditional complexity is given by the "chain rule", which plays a prominent role, especially in AI.
What you will learn in this chapter:
You will discover that there is a hidden link between information and complexity.
You will discover how computer science defines information, based exclusively on coding and computation.
You will know why three mathematicians discovered the same notion of complexity at about the same time.
You will learn how to estimate the complexity of passwords, of π, of lottery draws and of the cups in your grandmother’s cupboard.
Marvin Minsky, one of the founding fathers of Artificial Intelligence,
gave this great advice about Algorithmic Information Theory:
"Everybody should learn all about it
and spend the rest of their lives working on it."
Algorithmic information is used in various sciences:
see for instance: Bennett, C. H. (1988). Logical depth and physical complexity. In R. Herken (Ed.), The universal Turing machine - A half-century survey, 227-257. Oxford: Oxford University Press.
Conversely, it is hard to find a science that would be immune to its influence.
Algorithmic Information’s ambition is to describe what it means for an intelligent entity, be it artificial or natural, to make sense of the world.
Understanding Artificial Intelligence
through Algorithmic Information Theory
Let’s consider these two pictures, for instance. The one on the right shows something that is easily recognizable, a "T".
It makes sense to us, because of the structure we recognize. The picture on the left, however, doesn’t make much sense.
According to Algorithmic Information Theory (AIT), the "T" picture "makes more sense" to us because we get a simpler description if the pixels are grouped to form the letter "T" than if the pixels are scattered anywhere.
AIT’s notions such as complexity and simplicity are central to artificial intelligence.
One of the main purposes of Artificial intelligence is to make sense of data. What does it mean?
These lectures on Algorithmic Information Theory will help you answer this question.'abababbababaabababbababaabababbababaabababbababa'
Let’s start with a simple example.
Consider the above string.'abababbababaabababbababaabababbababaabababbababa'
You may describe it in Python as...
• a string of 48 characters:
→ S = ''.join(['a', 'b', 'a', 'b', 'a', ...])
• an alternation of two strings: abababbababaabababbababaabababbababaabababbababa
→ S = ('ababab' + 'bababa') * 4
• a pattern of patterns:
→ S = ('ab' * 3 + 'ba' * 3) * 4
• repeating a mirrored pattern:
→ T = 'ab' * 3; S = (T + T[::-1]) * 4
• repeating 'ab' and delete every 6^{th} character:
'abababababababababababababababababababababababababababa'
→
T = list('ab' * 28); [T.pop(6 * n) for n in range(1,9)]; S = ''.join(T)'abababbababaabababbababaabababbababaabababbababa' S = ''.join(['a', 'b', 'a', 'b', 'a', ...]) S = ('ababab' + 'bababa') * 4 T = 'ab' * 3; S = (T + T[::-1]) * 4 T = list('ab' * 28); [T.pop(6 * n) for n in range(1,9)]; S = ''.join(T)Which would be considered the "most intelligent" description?
According to Algorithmic Information Theory,
the "most intelligent" description is the less complex one,
that is, the most concise one.
The most concise solution might not be the one we think it is at first glance. We must convert these representations into bits to decide.
The first mathematician who invented the notion of complexity in the mid 1960s, Ray Solomonoff, knew that he was paving the way for artificial intelligence. He understood that intelligent processing has to do with simplicity, and that simplicity has to do with coding. But not any coding.
The complexity of an object is assessed by finding
the shortest among the available binary descriptions
of that object. So we will start by talking about description length.
Description length
The information contained in an object is estimated by the length of its binary representation.
Intuitively, if an object contains a lot of information, then it cannot be described in a concise manner and should be considered complex.
This definition applies to any object or situation, as long as this object or situation can be coded to be represented in a computer.In this chapter, we will talk a lot about coding, as it is the most basic way to measure information content.
Subsequent chapters will introduce alternative ways to assess complexity.
Complexity of objects within sets
Problem:
In a collection of objects, do the elements have the same complexity?
On many occasions, the complexity of objects is easy to estimate, just because these objects are already stored in the computer’s memory. Consider, for instance, the alphabet: a, b, c,... , z. It can be retrieved as a list. In Python as in most computer languages, letters can be addressed through their ASCII code:
>>> Alphabet = [chr(c) for c in range(97,123)] >>> print(Alphabet) ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t', 'u','v','w','x','y','z']Once the alphabet is made available as a list, each letter can be retrieved by its rank on the list. So its complexity is estimated by the length of the binary representation of its rank.
For instance, one can designate the letter k by mentioning its position: 11 within the alphabet. An estimate of its complexity is 4, as we can code 11 with 4 bits (1011).
Question:
Isn’t it easier just to say 'k' instead of 1011?
Answer:
No. 'k' is just a convenient representation of an ASCII code, 107, which is 1101011 in binary form.
Here is what we are able to do so far:
When an object can be retrieved from a list,
its complexity within the list can be estimated
by the length of the binary representation of its rank.
Ordered list
About 57 000 people took part in the Paris marathon. You remember a runner and want to check whether you know her. All you know is that she arrived 37^{th} on the finish line. Once the ranking with the names of participants is made available, how many bits do the organizers need to retrieve her name?
The letter ‘k’ is a simple character in python.
What about the character ‘♖’ (white chess rook)?
Its Unicode rank is 9814 (10011001010110 in binary form).
So we may estimate the complexity of
‘♖’ as being about 14 (the length of 10011001010110).
Question:
Can we assess the complexity of a character if we don’t know its exact Unicode value?
Answer:
Yes. Since there are a total of 1 114 112 different Unicode characters,
we can be sure that the length of its binary representation must be less than 21 bits
(you can check: int('100010000000000000000',2) is 1114112, and len('100010000000000000000') is 21).
More generally:
When an object belongs to a set of size N,
its complexity in that set cannot exceed log_{2}(N).
This is because log_{2}(N) bits are sufficient to represent any rank in the set of size N, ordered by whatever means.Estimating the complexity of an object this way is useful whenever it belongs to a set that has no obvious order.
Consider for instance the cups in your grandmother’s cupboard. They may be ordered from left to right and bottom to top (if stacked up), or by colour and size if they are different, and so on. We suppose that you wish to designate one of the cups using such an ordering method.
Unordered set
How many bits would you need to designate one specific cup in your grandmother’s cupboard, if there is a total of 62 cups in it?
For now, what we know about measuring the complexity of an object is this:
we must find a code to represent the object in the computer (if the object is stored in a list, the code can be the rank in the list),
we express the code in binary form,
we measure the length of this binary form.
But it’s not sufficient. The notion of complexity is more demanding. One should find out the shortest possible binary code. So, any code that can spare bits is closer to the complexity of the entity.
Description Length within a set
Marie Martin is a French citizen you do not know. Her social security number is 2 36 07 24 322 986. Representing citizens by their social security number does not offer the maximally compact representation, though. Can you provide a good estimate of the complexity of Marie Martin, knowing that she is a living French citizen?
Emoji complexity
Suppose the only emojis that are available on your phone app are just these:👺 🤖 ❣ 🕳 ㊙ ⚪ 🛫How complex is the most complex emoji among them?
Consider these 12 objects.
The Ugly Duckling paradox
Which one is the simplest object in this set?
Description complexity of integers
Problem:
Can numbers be simpler than the length of their binary representation?
Integers are central to an algorithmic information approach to artificial intelligence.
integer complexity
number
binary representation
length
1
1
1
2
10
2
3
11
2
4
100
3
5
101
3
6
110
3
...
...
...
10
1010
4
11
1011
4
Integers are used to retrieve objects by their rank in lists. Even seemingly complex objects such as convoluted emojis are no more complex that their address in the set of emojis. And this address is nothing but an integer.
Integers are much more important than that. Integers are involved in the formal description of most objects AI is dealing with.
Integers are objects of interest in themselves as well, because of their role in mathematics (note that "real" numbers with limited precision are, in fact, pairs of integers).
So let’s find a way to measure the complexity of integers.
Mathematics (since Leibniz and, before him, ancient Chinese scientists) and Computer science like to represent numbers with bits. A first estimate of complexity consists of counting the number of bits in the standard binary representation (see table).One can see that with this method, the complexity of n, as measured by the length of its binary representation, is:
⌈log_{2}(1+n)⌉
Decimal to binary converter:
where ⌈x⌉ (with upper brackets) designates the smallest integer larger than x.
Binary to decimal converter:
This method, however, totally ignores the existence of "round" numbers. This is unfortunate. A round number like one billion (1 000 000 000 in base 10) looks much simpler than the 30 bits of its binary representation (111011100110101100101000000000).
Complexity of Integers in a nutshell
The video explains how integers can be coded in a compact way.
Since the standard binary code of n always starts with 1 (except for n =0), let’s code n as the standard binary code of n+2 stripped of its heading 1. In python: bin(n+2)[3:]
To distinguish round numbers (in base 10), we reintroduce a bit at the front: 1 for "normal" integers, 0 for round numbers.
A round number n is coded as 0 followed by the compact code of n stripped of its trailing zeros, followed by the compact code of the number of zeros.
Space separators are used to make the code unambiguous.
Question:
Why are we ignoring space separators when counting the length of round numbers compact codes?
Answer:
Separators are not symbols. You can’t have two separators in a row. They may result from physical signal processing, as for the Morse code, or from low-level word segmentation (e.g. lists of strings in Python: '0010 11010 111' ↔ ['0010','11010','111']).
(Note: In Chapter 3, we will consider the possibility of eliminating all space separators altogether.)
The above video proposed a compact coding method for integers:
compact code for integers
To code n, we consider the standard binary code of (n+2)
and we drop the initial 1 (which is redundant).
We then reintroduce a separated heading bit: 1 for "normal" numbers,
and 0 for "round" numbers.
Coding Integers (1)
Using the coding convention that was just presented in the video, what would be a good coding for 500?
Coding Integers (2)
Which number does the following code represent: 0 110 11?
Decimal to binary converter:
Our coding system acknowledges the fact that round numbers are simple. What about their immediate neighbours? Intuitively, 5000000011 isn’t much more complex than 5000000000, as it can be expressed as 5000000000+11.
Binary to decimal converter:
The coding method illustrated in the figures below captures this idea. We prefix the code by 00 for a positive shift from the round number, and by 01 for a negative shift. Then we indicate the two-part code of the round number, and finally the compact code of the shift.
Caveat: These codes use space delimiters, as in the Morse code.
Complexity of 1000008
Based on the suggested coding, provide a short binary representation of 1000008?
Complexity is “continuous”
Problem:
Does the complexity of integers vary erratically or smoothly?
Thanks to the previous coding method, round numbers and their neighbours receive shorter codes than other numbers of the same magnitude.
The two plots below show the complexity of each number n on the x-axis. They have been drawn using the above compact integer coding method, implemented in .
The purple curve indicates log_{2}(n) for reference. On a large scale (first image), we can see that the compact code length roughly evolves like log(n). Note that the complexity cannot exceed log_{2}(n) by more than one bit.
We can observe the round-number effect: complexity drops for multiples of 10, and even more so for multiples of 100 and of 1000.
A zoom around 50000 (second image) shows that C(n) is roughly "continuous".
To capture the fact that the complexity of integers is a "quasi-continuous" function, we may write that for any integers n and h:
|C(n+h) – C(n)|< f(|h|)+ O(1)
where f is an increasing function of |h| such that f(0)=0 and O(1) is a constant.
Continuity of complexity
Which function \(f\) among the following is the most appropriate to capture the "quasi-continuity" of complexity?
Suggestion: Improve , which computes compact codes for integers. In its current state, it takes advantage of the proximity of round numbers. Imagine additional ways of compacting integers (e.g. multiple or divisors, repetitive sequences, famous numbers based on Google hits, etc.) and integrate them into the code.
Complexity and structure
Problem:
Can we define the complexity of new objects, based on their structure?
Understanding Artificial Intelligence
through Algorithmic Information Theory
What is the complexity of this figure? We can describe it
pixel by pixel, or observe that there is
a blue disk on top of a red (mmm... not really red) square.
The second description (disk on square) is more likely to convey the idea that it is a really simple image. How can we be sure that this representation is simpler than the other one?
Answer: by coding these representations in binary form and by comparing their lengths in bits.
Ok, but there can be a long way from the object to its binary representation.
Objects or situations manipulated by computers are usually coded into a formal representations,
such as:
disc(X), blue(X), size(4), square(Y), red(Y), size(4), on(X,Y)
This representation can then be converted (through whatever means) into binary form. The length of this binary form gives us an idea of the complexity of the object.
To sum up, one way of estimating algorithmic information consists of
finding a formal representation for this object,
coding this representation in binary form,
measuring its description length as the length of this binary representation.
The intermediate, formal representation represents the structure of the object.
Finding structures for objects is a big challenge of AI.
The most intelligent structure, according to AIT, is the one that turns out to be the most concise in number of bits.
For now, we are able to estimate the complexity of integers and of objects that are stored in memory (through the complexity of their address). What about new objects? Consider, for instance, the case of passwords.
When looking for a good password, we try to outsmart machines.
This situation offers a nice illustration of the importance of complexity.
Question:
Complex passwords are impossible to remember. Can we define what a good password should be?
Answer:
A good password should be easy to remember, but hard to guess. In AIT’s terms, it should be complex to anyone but you.
Password complexity
How would you rank the complexity of the following passwords
(from 1=simplest to 6=most complex)?
Passwords are character strings. Some strings are obviously simpler than others. How can we quantify this?
A first step consists of finding a compact formal representation of the string, such as repetition(a, 8), and then translate this formal representation into bits using some appropriate code, and finally counts the bits.
String Structures in a nutshell
Alphabetic strings can be represented using operators: repetition, increment, mapping.
These operators can be combined. For instance:
mapping(increment(1, a, 4), repetition(X, 2))
first generates abcd, and then maps the repetition operation on it to produce aabbccdd.
Simple Structure (1)
Using the coding scheme presented in the video, determine which string is represented by:
mapping(increment(1,a,4), increment(1,X,2)).
Simple Structure (2)
Provide an expression that represents abcdefgh.
letter
position
code
a
1
0
b
2
1
c
3
00
d
4
01
e
5
10
f
6
11
g
7
000
h
8
001
Suppose we want to measure the complexity of a password such as abcdefgh.
We can code for the letters one by one. The codes that appear in the video assume standard binary code. We can get a slightly more compact code using our compact integer coding scheme (see table).If we add up all the bits used in codes, we get a total of 16 bits for abcdefgh (not counting space delimiters).
But we can do better than that, if we consider that abcdefgh can be represented as increment(1, a, 8). To do so, suppose that we limit ourselves to the following list of available operators, with the corresponding codes.
operator:
repetition
increment
mapping
code:
0
1
00
Suppose now that we consider three kinds of entities, coded accordingly:
entity:
operator
integer
letter
code:
0
1
00
Using this coding scheme, an object like letter e will be coded as 00 10, where the first two bits designate letters and the last two bits designate the rank of e among letters.
Using the code (1)
What would be the correct code to designate number 6 in our small language?
Using the code (2)
What would be the correct code to designate operator mapping in our small language?
Using the code (3)
What would be the correct code to designate letter b in our small language?
Since we are able to represent abcdefgh as increment(1, a, 8), we can now translate it into a binary representation. We still use our compact integer coding system. Note that when the type of the object is known, we can avoid using a complete code and thus get an overall shorter representation.
element
code
comment
increment
0 1
1 = designation of increment as operator
1
1
an integer is expected
a
00 0
complete code is expected (here: letter + a)
8
010
an integer is expected
We end up with the following code: 0 1 1 00 0 010 to represent abcdefgh. It is 9 bits long (ignoring space delimiters), which is more compact that the 16 bits of the letter-by-letter code (note: the code is decodable only if we keep the spaces between code words). Our coding scheme is sufficient to provide an estimate of the description complexity of alphabetic strings.
Description length of abcdabcd
Using the same coding method, provide a binary representation of abcdabcd?
Note that the resulting description length of abcdabcd is larger than the description length of abcdefgh. This might explain why abcdabcd is perceived as more complex than abcdefgh.
Forum: Good password
What do you think are the qualities of a good password (w.r.t. complexity)?
I feel that I can compute the informational content of any object, now.
Compute? Or maybe just assess?
Yes. Take this playing card. I just need a description of it, and then I measure the length of that description.
I don’t think you have to describe the structure of the card, element by element or pixel by pixel!
No! Since it is a playing card of my deck, I just have to distinguish it from the other cards.
You mean, by saying that it is the ten of spades?
Yes. Or since there are 32 cards in the deck, I’m sure that the complexity can’t exceed log_{2}(32)=5bits.
And did you get the point about the continuity of complexity?
Sure. It was about the complexity of integers.
So complex integers must be far away from simple ones?
Yes, it’s obvious to me. If a number is simple, a round number for instance, you take it as a proxy to describe its neighbors.
And so the neighbors can’t be complex?
No, because you just need to code the distance from the proxy number, and if this distance is small, it is simple as well.
I see. So we are able to measure the information content of any object, including numbers.
Yes. And we can compare all objects in the world, based on how much they "weigh" in bits.
I wonder what is left to be learned about algorithmic information.
History of the concept of complexity
Problem:
The notion of Complexity was invented three times simultaneously and independently. How?
AIT’s fundamental idea is that the informational content of an object can be measured by its shortest binary representation.
This is a profound idea, that did not occur to thinkers until the mid-twentieth century.
And then, quite remarkably, three mathematicians got the idea at the same time independently!
Informally, the description complexity of an object s can be estimated:
by finding a short binary codep(s) that represents s
and then by measuring the length of that binary representation of s.
To be a faithful image of s,
p(s) must be decodable into s.
For p(s) to be decodable into s,
we should be able to feed p(s) into a decoding machine that will then reconstruct s.
We can regard p(s) as a program.
More generally, we are interested in any program p that, once decoded, outputs s on the decoding machine.
More generally, we are interested in any program p that, once decoded, outputs s on the decoding machine. The description complexity of s is the size of the shortest of these programs.
To make things more formal, we need to consider an ideal decoding machine that is able to perform any computation. Such a machine came into existence for the first time in 1936...
in the mind of the great mathematician Alan Turing.
The machine that Turing imagined can be seen as an ideal computer with no memory limitations. Those idealized computers are called universal Turing machines.
The definition of algorithmic information is based on the idea of computation. Calculus, algorithms and proofs are forms of computations that have always been central to mathematical activity. However, a proper definition of computation had to wait until 1936 when Alan Turing defined what is now known as Turing machines.
Examples of Turing machines running on two simple problems
The description complexity of s, also called Kolmogorov complexity, is the size of the shortest program that outputs s when run on a given universal Turing machine U.
Kolmogorov Complexity
C_{U}(s)=min_{p}{|p|: U(p)= s }.
Expression |p| means the length of program p.
Intuitively, C_{U}(s) measures the length of s when all thinkable redundancy has been removed.
Question:
Can we compute C_{U}(s)?
Answer:
No! One can never be sure to have found the shortest program that generates s. However, one can compute approximations of C_{U}(s) in limited time, and this is what really matters for AI. The issue of the non-computability of C will be addressed in Chapter 3.
Complexity of a program
Suppose you are about to sort \(n\) data. Your computer is using a library of 1000 basic programs. sort is one of them. How can we estimate the description complexity of sort as a program?
The invariance theorem
Problem:
Complexity depends on who is measuring it!
At first sight, description complexity seems to be totally subjective. What is simple for you might be complex for me (think of your phone number). More formally, let’s consider all binary sequences of size N. There are 2^{N} of them. Some look fairly simple: 0000000000000000000000000000000000000000000000000000 1111111111111111111111111111111111111111111111111111 0101010101010101010101010101010101010101010101010101Some look more complex:
0010100101110110101101000101101011011010001011001011
Perspective
We may sort the \(2^N\) different sequences of length \(N\) in any order we wish and store them in that order (e.g. into a Python list).If \(N = 52\), how simple can the following sequence become after such an operation?
0010100101110110101101000101101011011010001011001011
Should we conclude that description complexity a totally subjective notion? Could it be that some remote mathematical culture would legitimately regard π as complex while other cultures would find it simple?
Is Pi Complex
On March 14^{th} (note the date, 3/14 in US notation), 2019, Emma Haruka Iwao broke the record of the number of decimal digits computed for \(\pi\). She could determine the first 31 415 926 535 897 digits (over 31 trillions! Note the chosen number) using 96 CPUs over 121 days.
In what way can we say that \(\pi\) is less complex now than it was in the pre-computer area?
The Invariance theorem puts this idea of "perspective" into perspective.
The invariance theorem states that discrepancies in the evaluation of Kolmogorov complexity are bounded.
Invariance theorem
∀s;|C_{U}(s) – C_{V}(s)|< c_{U,V}
This means that by changing machine U for another universal Turing machine V, the complexity of an object s may be affected at most by a constant c_{U,V}, which depends on both machines, but not on s.
Invariance theorem
Does the invariance theorem hold if \(U\) and \(V\) designate the same universal Turing machine equipped with two different programming languages \(L_1\) and \(L_2\) (such as Java and Python)?
The Invariance theorem is good news. It introduces some objectivity in complexity computations, irrespective of the machine or the programming language used.
The complexity of π, for instance, will remain finite on any universal Turing machine you consider.
π can be computed using a simple series.
π=4*(1–1/3+1/5–1/7+1/9–1/11 . . .)
In Python, you may approach this series with , where N controls the number of iterations:
def pi(N): PiDividedByFour = 0 for i in range(N): PiDividedByFour += (-1)**i /float(2*i+1) return 4* PiDividedByFour
Computing Pi
Execute for N = 60000 and N = 60001.
$ python Pi.py 60000 . . . $ python Pi.py 60001 . . .What is the precision we get for \(\pi\)?
This code provides the N^{th} approximation of π. You may embed this code into an infinite loop, or, if you are very patient, you may prefer the recursive version, which you use by calling pi(0):
You need to wait for the end of eternity to get the result (provided your computer has infinite memory)! (By the way, our current Python interpreter limits us to 800 recursive calls; you may test yours by adding the exit line: if N >=800: return 0.) Does this mean that only Godly Beings may use these algorithmic definitions of π? In a way yes, but this is not the point. The point is that we do have a finite definition (i.e. without ". . .") of π. This definition provides an upper bound of its complexity C(π).
Choosing the machine
In a previous exercise, we changed the code in order to turn a complex sequence into a simple one. We can do this for any object, a book for instance. There is a machine (or a code) V in which the content of a book such as the famous An introduction to Kolmogorov complexity and its applications (IKC)
has a complexity of only one bit. How can we compress a whole book down to a single bit? This is possible if the book is stored in machine V’s memory. V performs the same computations as a standard universal Turing machine U, but interprets the first bit of any program p in a specific way.
# procedure defining V if p[0] == 0: print(IKC_content) else: return U(p[1:])
The machine V defined by the above procedure is a univeral Turing machine, exactly as U. Its particularity is to pay attention to the first bit of any program, just in case its value happens to be 0, in which case it prints IKC. For this strange machine, C_{V}(IKC) = 1, since 1 bit is sufficient to retrieve the whole book. On another machine U that does not hold IKC in its memory, however, C_{U}(IKC) is the length of the shortest available identification of the book. If the book can be retrieved from an external resource such as the publisher’s Website, complexity might remain limited (e.g. equal to the complexity of the ISBN number).
In the worst case in which no external resource is available, the description of the book would contain some compressed version of its content.
The consequence is that the c_{U,V} constant may be of the same order of magnitude as the complexities we are interested in.Note that V holds IKC’s content in its memory, whereas a representation of its content must be present in the program fed to U.
More generally, a machine can use the address of an object in its memory as a code for that object.
TinyUrl
Some Web services such as TinyUrl offer to compress long URLs.
You may try to compress two URLs, a short one and a long one.
What kind of compression method do you think the service is using?
(several answers are admissible, choose one.)
Isn’t is a problem that complexity would completely depend on the machine that measures it?
It doesn’t! You have the invariance theorem.
Yes, I know. The divergence between two machines is bounded.
Indeed.
But a whole book may be compressed down to one bit only!
Yes, I agree that for finite objects, the divergence might be as large as the complexity itself.
Don’t you find it disturbing?
If the book were on my desk, I’d just need to point to it.
But since the book is actually on my desk, you may point to my desk to make it simple.
This only works because I know you well. You’re simple to me (no offense 😀).
Sure. It would be useless to mention that the book is on the desk of someone you don’t know.
This makes me think that machine-dependence is a good thing after all!
What do you mean?
I mean that objects that are relevant to you, as that book, should have a short description for you.
And not for you?
By the way, I like the book too. But yes, it’s normal to diverge on what we consider relevant.
And so it’s normal to get different measures of complexity?
Yes. I even begin to consider that it may be fundamental!
Conditional complexity
Problem:
Complexity does not only depend on the machine, but it also varies with the context!
Understanding Artificial Intelligence
through Algorithmic Information Theory
Can Complexity be a well-defined notion,
knowing that it depends on context?It seems that its value may change at any time!
For instance, the complexity of 638 is about log_{2}(638) ≈ 10bits,
but in a context in which 637 has just been mentioned,
we may adopt a relative code instead of an absolute one
638 = 637 + 1
and find that 638 can be coded in much less than 10 bits.
So the complexity of 638 varies through time!Conditional complexity does indeed depend on context. However,
far from being a problem, it is essential for AI applications.
It allows to take background knowledge into account.
Let’s start with an example.What is the description complexity of Bill Clinton?
The answer will differ, depending on what one knows about him.
Bill Clinton is much simpler to describe if one knows that he is a former US president.
The purpose of Conditional complexity is to express some piece of available knowledge explicitly and use it.
To unambiguously describe Bill Clinton
knowing that he was a US president,
one just needs to say that he appears in k^{th} position in the list of former US presidents
in reverse order.
Using this piece of knowledge, Bill Clinton’s complexity is much less than if he were any US citizen.
We can say that the complexity of Bill Clinton
knowing that he is a former president is the complexity of k (which is about log_{2}(k) bits), plus one bit to tell the scan direction of the list (latest first).
We write:
C(’Bill Clinton’ | ’US president’)=log_{2}(k)+1
The vertical bar means ‘knowing that’.
This notation reminds us of conditional probability.
The meaning is roughly the same. However...
We can say that the complexity of Bill Clinton
knowing that he is a former president is the complexity of k (which us about log_{2}(k) bits), plus one bit to tell the scan direction of the list (latest first). We write:
C(’Bill Clinton’ | ’US president’)=log_{2}(k)+1
Conditional complexity is much more natural than conditional probability
(the conditional probability p(s_{1}| s_{2}) of s_{1} "knowing that" s_{2} requires to compute p(s_{1} & s_{2}) and to divide it by p(s_{2}); this definition of "knowing that" is not as intuitive as what comes next.)C(s_{1}| s_{2}) just means that s_{2} is available for free when it comes to describing s_{1}.
More precisely, the conditional Kolmogorov complexity of s_{1} relative to s_{2} is the size of the shortest program p that can produce s_{1} on a universal Turing machine
that gets s_{2} and p as input.
Caution: in the definition of conditional complexity,
the notation U(s_{2}, p) supposes that the machine U be given a way
(such as a delimitation) to distinguish s_{2} from p.Conditional complexity offers a way of taking context into account. In a context in which the number 637 has just been mentioned, 637 itself and numbers close to it become automatically simple.
The following images show C(n) and C(n |637) using our coding scheme. We can verify that C(637|637)=0, and that numbers around 637 have low complexity (hence the trough around 637 in the conditional complexity curve).
C(n)
C(n |637)
Now consider the number 2811. Using our coding scheme for integers, one can see that the description of 2811 requires no more than 11 bits. The program can be used to verify that 2811 is coded using 2800 as a reference. Let’s first compute absolute complexities.
$ python >>> import CompactIntegerCoding as CC >>> CC.Coding(2811) '00 1110 00 101' # == '00' + <compact code of 2800> + <compact code of 11> >>> CC.Coding(2800) '0 1110 00' # == <compact code of 2800> (with heading '0') >>> CC.Coding(11) '1 101' # == <compact code of 11> (with heading '1')
Conditional complexity can be used to verify that the description complexity of 2811 can be even smaller if 2800 is already in the context. In , conditional complexity is implemented by using a 3-bit long prefix: '100' for a positive shift, '101' for a negative shift.
$ python >>> import CompactIntegerCoding as CC >>> CC.ConditionalCoding(2811, 2800) '100 1 101' # note the heading '100' that marks conditional computation; '1 101' = code for 11 >>> CC.Coding(2800) # to put 2800 into memory (in coded form) '0 1110 00' >>> CC.ConditionalDecoding('100 1 101') # uses memory content to retrieve the actual code for 2811 2811
For this program, 2811 is represented by 100 1 101 when 2800 is in short-term memory, instead of the longer code 00 1110 00 101 that contains the code for 2800 in explicit form (1110 00, which means ‘28 followed by 2 (= 00) zeros').
Conditional Complexity
Compute the conditional description length \(C(2789 | 2800)\) (not counting space delimiters) given by the coding program . What do you get?
Conditional vs Non Conditional
How does conditional complexity C(x|y) compare to non-conditional complexity C(x)?
The Chain rule
Problem:
Can we use Conditional complexity to compute complexity step-by-step?
Conditional complexity can be used to capitalize on what an agent (represented as a machine) may already know. It is also a great tool to compute estimates of description complexity, step-by-step, using the chain rule.If we already know an object (think for instance of a number) s_{2}, conditional complexity allows us to compute the complexity C(s_{1}|s_{2}) of another object s_{1} knowing s_{2}. We would like to say that the complexity C(s_{1}) of s_{1} is simply the complexity C(s_{2}) of s_{2} plus C(s_{1}|s_{2}). This is almost what we have:
Chain rule 1 for "unambiguous" program codes
C(s_{1})< C(s_{2})+ C(s_{1}| s_{2})+ O(1).
Here O(1) designates some constant.
This is almost what we wanted! There is an inequality instead of an equality. It comes from the fact that the detour through s_{2} might not be optimal.
Caveat: The chain rule is true only if the machine can delimit programs (i.e. it can tell where the program that generates s_{2} ends and the program that generates s_{1} from s_{2} begins). This property is guaranteed if program codes are "unambiguous" (or uniquely decodable). A formal definition of this property will be given in Chapter 3.
Proof of the chain rule
Proof of the Chain rule
C(s_{1})< C(s_{2})+ C(s_{1}| s_{2})+ O(1).
Suppose we can find the shortest program p_{2} that generates s_{2} on our machine. Suppose also that we can find the shortest program p_{21} that generates s_{2} from s_{1}. The length of p_{2} is C(s_2), and the length of p_{21} is C(s_{1}| s_{2}). The concatenation of p_{2} and p_{21} is a program that generates s_{1}, though not necessarily the shortest one (hence the inequality in the chain rule).
This only holds for "unambiguous" program codes (so that the machine knows where p_{2} ends and where p_{2}1 begins).
Chain Rule
Using the coding program ,
compute C(m)+ C(n | m) for n =2811 and m =2700 (do not count space delimiters).
What do you get?
The chain rule can also been written as follows:
Chain rule 2 for "unambiguous" program codes
C(s_{1}, s_{2})< C(s_{1})+ C(s_{2}| s_{1})+ O(1).
It comes from the fact that a good way to compute s_{1} and s_{2} consists of starting with computing one of them, before computing the other one. That "good way" might not be the shortest way, though (hence the inequality).
Chain rule and coincidences
Problem:
Are there practical applications of (conditional) complexity?
Algorithmic Information Theory has many practical applications. Some of them will be explored in the following chapters. Let’s mention a situation of everyday life in which Conditional complexity plays a central role: Coincidences.
Understanding Artificial Intelligence
through Algorithmic Information Theory
On September 10^{th}, 2009, the numbers 4, 15, 23, 24, 35, 42 were drawn by a machine live on the Bulgarian television. The event would have gone unnoticed, were it not that the exact same numbers had come up in the preceding round, a few days before.
As we will discover in a later chapter, such an event raises people’s interest because it is much simpler than expected. We intuitively feel that the last draw is incredibly simple. Why?
We can compute the complexity of the last lottery round r_{0} by referring to the preceding one r_{-1}.
Let’s apply the chain rule:
C(r_{0})< C(r_{-1})+ C(r_{0}| r_{-1})+ O(1).
C(r_{0})< C(r_{-1})+ C(r_{0}| r_{-1})+ O(1). Since all lottery draws are recorded, r_{-1} is entirely determined by its rank, 1, in the list of past draws.
In the lottery context, C(r_{-1}) ≈ C(1) is thus about 1 bit (see section on integer complexity).
If the two draws contain the same numbers, C(r_{0}| r_{-1})=0.
C(r_{0})< C(r_{-1})+ C(r_{0}| r_{-1})+ O(1).
C(r_{0})<1+0+ O(1)r_{0} appears utterly simple!
This explains why the event is so impressive that it was mentioned in the international news.
Lottery
The same winning combination 13, 14, 26, 32, 33, 36 was produced by Israel’s national biweekly (= twice a week) lottery on September 21^{st} and on October 16^{th}, 2010. Can you quantify the simplicity of the second event by the time it occurred ?
Interview
Recap
Take-home message
We hope you enjoyed this chapter on description complexity.
Let’s leave the last word to Angel@ and Alexandr@.
I didn’t know that information was a computer science notion.
It’s quite natural, if we think about it. Formally, a description is a program.
So we do not consider the object itself. Only its description?
Its shortest description. Complexity is the size of the shortest description.
I find it brilliant.
Yes, I guess that to make sense of an object, you have to describe it to yourself.
That’s true for AI as well. The machine must find a way to describe the object to be able to process it.
You mean, a neural network that recognizes a cat is in fact building a program that generates the cat??
By recognizing a cat, it gets a much more concise representation of the pixel array. Then you just need to represent how the actual cat differs from the stored prototype.
I liked this idea of coding for the difference. As in the case of integers.
When you represent a number by comparing it to the closest round number?
Yes. It was when complexity was shown to be "continuous". It can’t make abrupt jumps.
Oh, now I can make the link with the lottery story.
What link? Ah, when we learned about conditional complexity?
Yes. This was also a way of coding the difference.
The difference? You mean, by comparing the lottery draw with the previous one?
Yes, using the chain rule.
And so that’s how the second draw turned out to be abnormally simple.
I had a problem with the chain rule. Did you understand it?
It seems quite natural to me. You compute s_{2} first as a proxy for s_{1}.
I mean, the restriction to "uniquely decodable" programs.
Oh, it was written that we would get a full explanation in Chapter 3!
Meanwhile, I read that the next chapter, Chapter 2, is about AIT and data.
Yes, I’m curious about how algorithmic information can be applied to the real world.
Test (1)
The complexity difference \(C(n+1) - C(n)\) can become arbitrarily large when the integer \(n\) grows.
Test (2)
The complexity of \(x\) is defined as the average length over all programs that generate \(x\).
Test (3)
If we knew a program \(p\) such that \(C_U(x_0) = |p|\) for a given object \(x_0\), could we compute the value of \(C_V(x_0)\) for another universal machine \(V\) different from \(U\)?
Test (4)
Suppose that only two programs \(p_1\) and \(p_2\) can generate string \(x\) on the reference machine. \(p_1\) requires \(n^2\) milliseconds to run while \(p_2\) requires \(n \times log(n)\) milliseconds. Then \(C(x) = |p_2|\).
Test (5)
For any finite object \(x\), there exists a machine on which the complexity of \(x\) is equal to 1.
Test (6)
\(C(x|x) = 0\).
Forum: Chapter1
Do you have feedback or suggestions about the topics addressed in this chapter?
Vitányi, P. & Li, M. (2001). Simplicity, information, Kolmogorov complexity and prediction. In A. Zellner, H. A. Keuzenkampf & M. McAleer (Eds.), Simplicity, inference and modelling: Keeping it sophisticatedly simple, 135-155. Cambridge, UK: Cambridge University Press.
Zenil, H. (2013). A computable universe: understanding and exploring nature as computation. World Scientific.