logo_ipparis.png     TelecomParis_endossem_IPP_RVB_100pix.png Telecom Paris
Dep. Informatique & Réseaux

Dessalles_2018.png J-L. DessallesHome page

July 2021
5

AIAI0.png
IA325/IA703: Algorithmic Information and Artificial Intelligence

Lecturers:     Jean-Louis Dessalles
and             Pierre-Alexandre Murena
Etienne Houzé

                                                other AI courses 5

Contents

















Subjective Information and Simplicity

5

What you will learn in this chapter

What you will learn in this chapter:
  • that unexpected means abnormally simple,
  • why coincidences are so unexpected,
  • how to define subjective information and interest,
  • when an event or an action is relevant,
  • what is common to aesthetics, emotion, humour and coding.

AI systems will not be regarded as intelligent as long as they remain unable to understand human intelligence and to adapt to it. Algorithmic Information offers us a way to model key components of human intellect, such as surprise, relevance, aesthetics or emotional intensity. Why does AIT have something to say on these matters? Could it be because our mind does process algorithmic information?








Intuitive approach to subjective probability

Standard vs. algorithmic probability

Question: Can we use probability to characterize interesting events?

Answer:   No, it doesn’t work. We must find something else.

Understanding Artificial Intelligence
through Algorithmic Information Theory

    

Introduction to
Subjective Probability

    

    

aiai.telecom-paris.fr

251346.png You’re listening to the radio. It announces the last National Lottery draw: 1-2-3-4-5-6. You didn’t play.

What are you likely to do?


Think about it. You’ll probably see it as big news*.
Will you refrain from tweeting it? Or looking for someone to tell?


* On the 1st of December, 2020, South Africa’s lottery saw the numbers 5, 6, 7, 8, 9 and 10.
It made news around the world.

123456.png You know you shouldn’t.
After all, you know that all combinations have the same probability of occurrence.

    
8-10-12-23-31-45.png Since all such events have the same probability, why is 1-2-3-4-5-6 big news while 8-10-12-23-31-45 isn’t?

    
Based on probability, the latter should be as narratable than the former!

But it’s not!!
123456.png This lottery example offers a striking proof that probability is of little help to predict whether an event will be interesting or boring.



Sure! But still, it’s hard to convince ourselves that
1-2-3-4-5-6 had the same probability to occur
as "normal" outcomes.

GrilleLoto.png

In 2006, we offered pre-filled lottery tickets to customers in a bar, about fifty of them.

They could choose among a dozen of predefined combinations.

No-one chose 1-2-3-4-5-6.

One customer laughed:
"if you want to lose, choose that one!"

GrilleLoto.png

So it seems that the interest of events
has to do with their probability after all!

One just need to change the concept of probability 🤨.
As one may suspect by looking at the lottery example, subjective probability has to do with simplicity.

Whiteboard_with_markers.jpg

Maybe not with absolute simplicity, though.
We are not amazed by the white surface of a whiteboard!

Simplicity is only perceived as improbable if it is unexpected.
As one may suspect by looking at the lottery example, subjective probability has to do with simplicity.

Whiteboard_written.png

Unexpectedness is the difference between expected complexity and observed complexity. This definition of subjective probability
based on Algorithmic Information Theory
is found to match people’s intuitive probability judgments.
Bibliography

    



Lottery
Which lottery draw will be perceived as less likely?
2-4-6-8-10-12
6-12-18-24-30-36
7-14-21-28-35-42

    2-4-6-8-10-12, as the list of the first even numbers
    6-12-18-24-30-36, as a list of 6 numbers multiple of 6
    7-14-21-28-35-42, as an evenly spaced list within the Lottery range (1 to 49)

    

Problem: Algorithmic probability and subjective probability provide inconsistent values!
There is indeed a radical divergence between our two definitions of probability: algorithmic probability and subjective probability. The former declares that simpler events are more probable, while the later states the exact contrary! The next video provides the beginning of an answer, in a somewhat unexpected form.








Cognitive complexity

Question: What is the difference between Kolmogorov complexity and cognitive complexity?

Answer:   Cognitive complexity is resource-bounded.

Several phenomena point to the fact that the human brain is sensitive to Kolmogorov complexity. Nick Chater suggested that simplicity should be used as a fundamental principle in Cognitive Science. Here is an example.

Hidden.png

The preferred hidden pattern is not the complex one, but the closing curve that appears simplest (in the usual Kolmogorov sense).
Note that the straight line, which would be the simplest closing curve in the absolute, is not the simplest one in the context.
Cognitive complexity is a resource-bounded version of description complexity. It is defined as the shortest description (in bits) of the situation s that a human individual could reach at time t. In a semi-formal way:

Cognitive complexity

Cd(s) = minp {length(p); Ht(p) = s}
Here, H represents the cognitive model that simulates the individual’s behaviour. Ht represents the same model that is allowed to run during t units of time. p is any characterization that leads to the unambiguous determination of s and that is available to Ht. Note that Cd(s) may vary through time.
From now on, the name description complexity will be used to refer to its resource-bounded version: cognitive complexity.
Cognitive Complexity
If \(C\) represents plain complexity, what should we say?

    A.    \(C(s) \leq C_d(s)\).
    B.    \(C(s) \geq C_d(s)\).
    C.    \(C(s)\) and \(C_d(s)\) cannot be compared.

    

Complexity of locations: more thrilling when close

Problem: Why does the interest of events vary with the distance from home?
This is a fundamental law of journalism: greater distance tend to make identical events less interesting. Casualties in a tragic event are more newsworthy than distant ones. The phenomenon is known as the hierarchy of deaths, or as the "law of distant deaths".

Question: Does Algorithmic Information Theory have anything to say about the "hierarchy of deaths"?

Answer:   Yes, definitely! It even predicts the phenomenon.


On December 25, 2003, the French news reported three accidents in the following order:
  1. a small plane crash in the French Alps (2 casualties),
  2. another plane crash in Benin, Africa, (113 victims),
  3. a gas plant explosion in China (230 victims).
Distance effects (1)
Distances from Paris to these three events are 0.5 Mm, 4 Mm and 8 Mm (Mm = mega-metre).
What is the difference in bits of complexity between these locations (B-A and C-A)?

B-A: 3 bits; C-A: 4 bits
B-A: 6 bits; C-A: 8 bits
B-A: 8 bits; C-A: 16 bits

    

One may consider that dramatic impact depends on the logarithm of the number of casualties. We may write interest I as depending on distance d and on the number of casualties N:

=2 × log2(d) + b × log2(1+N/a).

Parameter a provides the unit value on the stimulus axis. Here we consider that = 1.
Parameter b measures the emotional interest due to the first casualty.
We ignore the effect of the cultural distance between the observer (here, the French audience) and the expected nationality of casualties. Cultural distance adds up to the cognitive complexity of the event, and thus contributes to diminishing its interest.

In the French news example, the most dramatic events were cited last: first A, then B and then C, though the ratio of casualties was 56 (B:A) and 115 (C:A). This means that the added complexity due to distance more than compensates for the emotion due to more numerous casualties.
Distance effects (2)
What is the maximal value of coefficient \(b\) in the French news example?

    \(b \leq 0.1\)                 \(b \leq 0.5\)                 \(b \leq 1\)

    

Measuring the distance of the event to the observer may not be the most economic way to measure description complexity. Quite often, events can be located in the vicinity of a known landmark (such as the Eiffel tower in Paris). The chain rule then applies to get an upper bound of Cd(s) using landmark L:

\(C_d(s) \leq C_d(L) + 2 \times \log_2(\mathrm{distance}(L, s))\)

Landmarks
Suppose that a dramatic event takes place in a foreign country. The media want to locate the event for the audience. A list of landmarks \(\{L_i\}\) (cities, monuments, points of interest, etc.) is available. It is ranked by decreasing popularity. The event is located at distance \(d_i\) of landmark \(L_i\).
Which is the best way to locate the event?

    \(i = 4\) and \(d_i = 114\ km\)
    \(i = 52\) and \(d_i = 15\ km\)

    

Simplicity and aesthetics

Problem: Aesthetics is an intuitive feelings that has nothing to do with computation!
What is to be considered as beautiful? Giving a definitive and complete answer to this question is way beyond current theorizing. However, people such as Jürgen Schmidhuber, John Maeda and others suggest that simplicity is involved in aesthetics. This principle is well-known in design. The French engineer Pierre Bézier used what are known as Bézier curves as a way to draw harmonious automobile bodies.

Bézier_2_big.gif Bézier_3_big.gif Bézier_4_big.gif

[images from Wikipedia]

Bézier curves are constructed using a set {Pi} of "control" points and a simple recursive parametric definition. Let’s call B0..n the Bézier curve based on the first n control points. It is defined for the value t of the parameter as B0(t) = P0 and:

B0..n(t) = B0..n-1(t) + t B1..n(t).

The animated illustration (from Wikipedia) above shows how the recursive definition is implemented.
Bezier curves
Explain in what way Bézier curves are simple.

    Because a Bézier curve of degree \(n\) can be defined with \(n+1\) points.
    Because Bézier curves are parametric curves that can be expressed as polynomials.
    Because a Bézier curve of degree \(n\) can be defined by adding two parameters
        to a Bézier curve of degree \(n-1\) .

    


Aesthetics
Explain why Larry Kagan’s structures are unexpected (you may watch this).

    Because we go from a 3-D structure to a 2-D shadow.
    Because the structure has been designed to be simple.
    Because the shadow is complex for all light angles but one.
    Because the shadow is simpler than the whole structure.

    

I find the ability of AIT to predict these cognitive phenomena quite impressive... Yes it is! But you seem puzzled...

I wonder how it works. Are we supposing that our brains are Turing machines? This just means that we are able to perform computations.

No, more that that! It presupposes that we monitor our own computations. You mean, to decide that one is shorter than the other?

Yes. It is one thing to perform computations, and another thing to measure and compare their length. So that we know that something is simple?

Precisely. It presupposes some kind of self-monitoring. I’m not fully convinced. And yet it does explain a lot of things. The story about abnormal lottery outcomes, the effect of distance on interest, the role of simplicity in aesthetics. And, I guess, many other cognitive phenomena (we are just halfway through this chapter).

Sure. I don’t deny that. But it could be that our brain does all this through very different means. I just read that people are very good at detecting visual patterns based on their simplicity.

Yes, I know. Like in the reconstruction of hidden patterns. No. The point was to measure reaction time when people have to pick an object that doesn’t fit.

And? It was about geometrical shapes. Reaction times was predicted by the simplicity of the shapes.

You mean that people were quicker at designating anomalies when the shapes were simpler? That’s it.

It just shows that our brain prefers simplicity. Not that it can measure it explicitly. Maybe the two are inseparable.

What do you mean? Don’t know. Just a hunch.









Relevance: the Holy Grail of AI

Problem: How can we bring machines to be relevant?
Since the year 1950, when Alan Turing gave the first definition of artificial intelligence, relevance is considered to be one of the most important attributes of intelligence. A relevant utterance not only makes sense, but also raises interest. On the other hand, by failing to be relevant, one is at risk of being ignored or, worse, to be rejected. One of the main objectives of AI research for the future is not only to be efficient, but also to be able to sustain a coherent, rational, dialogue. This is how people judge each other’s intellectual abilities and even mental health. This is why relevance represents a kind of Holy Grail for AI research.

Relevance and subjective information rely on two forms of complexity computations. We first examine the newcomer, causal complexity, before observing how they interact to generate relevance.

Causal complexity: the missing piece in the Relevance puzzle

Problem: Some events are exciting, not because they are simple, but because they seem incredible.

Understanding Artificial Intelligence
through Algorithmic Information Theory

    

Causal Complexity

    

    

aiai.telecom-paris.fr

Prestidigitateur.png There is nothing simple in a rabbit
that pops out of the magician’s hat.

On the contrary! There is so much complexity there that the layperson finds it hard to believe.
Prestidigitateur.png Before it appears, you are pretty sure that there’s no rabbit in the hat. You don’t expect to see one there.
This is the normal, simple, situation.

The negation of this simple situation is hugely complex. You need to imagine complex circumstances to explain why a rabbit ends up in the hat (teleportation, contortionist rabbit, you’ve gone blind for 5 seconds and have no memory of it, etc.).
Prestidigitateur.png So is unexpectedness sometimes due to simplicity (as in remarkable lottery draws) and sometimes due to complexity, as here?

In a way, yes...

... but there is a possible mix-up here.
Prestidigitateur.png We defined unexpectedness as
the difference between
    expected complexity
    and observed complexity.

Be sure to mind the difference.

  • Expected complexity means that you mentally consider the event before its occurrence (ex ante) and measure the complexity of this occurrence. It answers the questions "What were the odds" in computational terms (i.e., in bits).
  • Observed complexity means that you consider the event after its occurrence (ex post) and measure the complexity of this occurrence.
Prestidigitateur.png We defined unexpectedness as
the difference between
    expected complexity
    and observed complexity.

    
In the case of apparent magic, the expected complexity of the event (which is the negation of a normal situation) is huge.

This is what creates the complexity gap, hence the unexpectedness.

So unexpectedness may result, not only from a small value of the observed complexity (i.e. simplicity), but also from a large value of the expected complexity.
Prestidigitateur.png On many occasions, expected complexity is assessed through an attempt to reconstruct a causal scenario that led the event to happen.

It is then called causal complexity,
or also generation complexity,
or sometimes "world complexity", and is noted Cw.

    Cw(s) = minp {length(p) | W(p) = s}

W designates the observer’s causal model of the world.
The only admissible programs p are procedures that respects causality, as understood by the observer.
Prestidigitateur.png On many occasions, expected complexity is assessed through an attempt to reconstruct a causal scenario that led the event to happen.

It is then called causal complexity,
or also generation complexity,
or sometimes "world complexity", and is noted Cw.

    Cw(s) = minp {length(p) | W(p) = s}

Quite often, causal complexity matches probability (in bits, i.e. Cw(s) = log2(1/Pr(s))). However:

  • This is not always true, as for example in the case of the rabbit’s magic occurrence.
  • Causal complexity is more natural, as probability calculus must implicitly perform the same causal computations.


As we did for cognitive complexity, we consider a resource-bounded notion of causal complexity.
Causal complexity, also called world complexity, is defined as the shortest description (in bits) of a causal procedure that a human individual could reach at time t to account for the occurrence of a situation s. In a semi-formal way:

Causal complexity

Cw(s) = minp {length(p); Wt(p) = s}
Here, W represents the observer’s causal model of the world. Wt represents the same model that is allowed to run during t units of time. p is any procedure that respects causality as understood by the observer and that is available to Wt. Note that Cw(s) may vary through time.

In many cases (e.g. when you meet a colleague on the way to the workplace), there is one single obvious scenario that only demands parameters to be set to precise values (such as time of departure from home). If so, causal complexity amounts to the description of these values.
Causal Complexity
In 1991, the New Jersey three-digit number lottery issued the same winning number, 741, twice in a row. What is the causal complexity of the event?

    10 bits.    
    20 bits.

    

A formal definition of unexpectedness

Question: Unexpectedness should be a statistical notion, shouldn’t it?

Answer:   No! Statistics may even play no role whatsoever. For instance, many unexpected events are unique, unheard-of.

Unexpectedness is the core ingredient of relevance and subjective information.

Unexpected situations are situations
that require complex circumstances to occur,
while being simple to describe in comparison.


We already defined unexpectedness as the difference between expected complexity and description complexity. Now, we revisit this definition by observing that expected complexity, for rational agents, takes the form of causal complexity Cw. We may write:

Unexpectedness

= Cw – C.
Unexpectedness
What is the unexpectedness of the New Jersey Lottery event? (see preceding exercise)
(we ignore the complexity of locating the event in time and space)

     0 bits.                  0 or 10 bits, depending on the event we consider.    
    10 bits.                 10 or 20 bits, depending on the event we consider.    
    20 bits.    

    

Blase attitude
Can you describe the "blasé" attitude (most events aren’t worth noticing) in complexity terms?
The "blasé" attitude amounts to ...

1a.    increasing the causal complexity of all events
1b.    diminishing the causal complexity of all events
2a.    increasing the description complexity of all events
2b.    diminishing the description complexity of all events
        1a + 2a
        1a + 2b
        1b + 2a
        1b + 2b

    


Paranoia and surprise
When adopting a paranoiac attitude, you are not surprised by events such as accidents that affect you. Why is it so?

1a.    because the paranoiac attitude augments the causal complexity of such events
1b.    because the paranoiac attitude diminishes the causal complexity of such events
2a.    because the paranoiac attitude augments the observed complexity of such events
2b.    because the paranoiac attitude diminishes the description complexity of such events
        1a + 2a
        1a + 2b
        1b + 2a
        1b + 2b

    

Fortuitous encounter
Coming across a celebrity by chance near your doorstep might be perceived as an interesting event. The story would be even better if you live in the suburbs rather than downtown. Why?

    It increases the causal complexity of the event.
    It diminishes the causal complexity of the event.
    It increases the description complexity of the event.
    It diminishes the description complexity of the event.

    

People use features, not programs

Problem: People are not computers. They use features rather than programs to characterize things.
Indeed, people use relevant features to characterize events (rather than producing compact programs that generate those events from scratch). Suppose we are talking about a person born in the US in 2002.
  1. We want to mention that the person’s given name is Shea. There were about 0.04% children with that name. If you regard causality in this case as a mere choice among options, causal complexity is about log2(1/0.0004) ≈ 11 bits.
  2. Now mentioning the name comes at a price on the description complexity side. ‘Shea’ appears 900th or so in the list of names ranked by popularity. With only this source of information, we have to pay log2(900) ≈ 10 bits of description complexity for the mention of the name.
As we can see, the mention of the name hardly contributes by itself to making that person unexpected.
  1. Suppose now that the name is ‘Madonna’ instead of ‘Shea’. This would have made this person unique among 4 millions for that birth year. For an observer aware of name distributions, causal complexity amounts to log2(4 × 106) = 22 bits. On the other hand, the name ‘Madonna’ is at least as popular as ‘Shea’ on the Web, which makes the former no more complex than the latter. Mentioning the first name ‘Madonna’ for a person born in 2002 is an efficient way of making her quite unexpected, as it reduces complexity from 22 (causal) bits down to less than 10 (description) bits.

Question: Can features be relevant?

Answer:   Yes, and this is precisely how people use language to talk about events.

Understanding Artificial Intelligence
through Algorithmic Information Theory

    

How to tell a story?

    

    

aiai.telecom-paris.fr

Storytelling.png Do you know how to tell a story?

Yes, of course, like everyone else on Earth.
But are you aware of what you are really doing when telling a story?
Probably not...

Let’s illustrate the underlying mechanism
through an example taken from everyday life.
- Do you remember my friend Jeannine?
- Sure.
- She installed an automatic night vision camera on her backyard.

      Guess what she could film?
Renard.png - A hedgehog?
- No, that was a few weeks ago. But I just got this video from her. Look at this!
- Wow! A fox? But doesn’t she live in town?
- Yes. That’s amazing, isn’t it?
Let’s summarize the event as:
a fox visited Jeannine’s backyard two days ago.

Now let’s use "features" to describe the event:
  • the event involves Jeannine
  • the event took place in her backyard
  • the event took place two days ago
  • the event involves a fox
  • the event involves Jeannine

    

    
Features can be represented computionally using predicates, as in the Prolog programming language. We may translate the above feature in Prolog’s style as:


involves(S, 'Jeannine').
  • the event involves Jeannine
  • the event took place in her backyard
  • the event took place two days ago

    
Features can be represented computionally using predicates, as in the Prolog programming language. We may translate the above features in Prolog’s style as:


involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).
  • the event involves Jeannine
  • the event took place in her backyard
  • the event took place two days ago
  • the event involves a fox

Features can be represented computionally using predicates, as in the Prolog programming language. We may translate the above features in Prolog’s style as:


involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).
involves(S, F), fox(F).

involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).
involves(S, F), fox(F).

These features act as constraints on situation s.

Let’s represent them as predicates fi(s) that are regarded as being true about situation s.
The description complexity of a feature fi is simply Cd(fi), independently from s.

The causal complexity Cw(fi(s)) of fi(s) is the complexity of circumstances that brought fi(s) to hold in the known world.

A feature fi(s) is unexpected if Cw(fi(s)) > Cd(fi)
(and the difference measures how unexpected it is).

Unexpected features contribute to making situations interesting.

involves(S, 'Jeannine').

    

    

The above feature is not unexpected. The balance is even negative: there is no causal cost, while the mention of ‘Jeannine’ costs a few description bits
(not much, crucially, as she is a close acquaintance).

involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).

    
Same thing for location in time and space.

involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).
involves(S, F), fox(F).

Now here comes the climax of the story.
The notion of ‘fox’ is not descriptively very costly.
However, it demands complex circumstances to become true, if one knows that Jeannine lives in town
(what would bring a fox to venture that deep through the city?).

involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).
involves(S, F), fox(F).

The complexity drop between causality and description makes the event unexpected, and thus interesting.

The initial mention of Jeannine and her backyard are descriptively (slightly) costly, but they are essential to make causal complexity large.

involves(S, 'Jeannine').
location(S, L), backyard('Jeannine', L).
date(S, -2).
involves(S, F), fox(F).

Fox.png The dialog continued with a tentative explanation: the fox would have taken advantage from the general lockdown associated with the Covid-19 crisis to venture into the city. Note that the effect of the explanation is to diminish causal complexity, making it "2nd order-relevant" (more on this there).


Question: Do unexpected features make events relevant?

Answer:   Yes, thanks to the chain rule.

Let’s observe how features’ unexpectedness is transferred to the situation. Suppose we recount an event s by mentioning an unexpected feature f(s) (e.g. a fox wandering in the city). On the causality side, we have to measure the complexity needed by the "world" to make f(s) true. Contrary to mere description that only cares about f, causality requires to take both f and s into account to compute the complexity of reaching a state of the world in which f(s) holds.

Chain rule for feature causality

Cw(s) = Cw(f(s)) + Cw(| f(s))
Note that f(s) expresses a constraint. It is not a ‘function’, but a logical predicate. This is why one may write Cw(| f(s)) even if s is not fully characterized at this stage of the computation. Equality comes from the fact that f is the only available means to compute the causality of s at this point.

Cw(f(s)) measures the complexity of generating a state of the world in which f is true when applied to the situation s.
Cw(| f(s)) measures the causal complexity of other aspects of s.

Chain rule for feature description

Cd(s) < Cd(f) + Cd(| f(s))
Note that f appears alone in Cd(f). This is a crucial point: description complexity is about the conceptual nature of features, whereas causal complexity is about making them true.
    

Relevant features are simple to describe, but hard to produce (as in our example about the fox). Features can be relevant for an observer while being irrelevant for another one. For instance, in this excerpt from the animated series Futurama, the two robots are amazed at a mathematical coincidence that leaves their friends completely unmoved.
[I found initially that the second serial number 2716057 is not the sum of two cubes and that it should read 7216057 == 30**3 + 193**3; however, 2716057 == 952**3 + (-951)**3 - Thank you El Jj 🙂]     

Chain rule for unexpectedness

U(s) > U(f(s)) + U(| f(s))
where U(f(s)) = Cw(f(s)) – Cd(f).

This chain rule, when repeatedly applied, offers an optimal guide for (artificial) storytelling: choose features and feature order that will eventually maximize unexpectedness.
Features
Suppose we mention now (i.e. in the 2020s) the first name of a person born in the US in 2002. How unexpected is this feature if the first name is ‘Michelle’?

    Not unexpected         3 bits
    1 bit                         7 bits

    

Coincidences: a case study

Problem: Why are coincidences so thrilling?

The Lincoln-Kennedy coincidence should not be considered an objective marvel. However, the fact that it is so fascinating to so many people reveals that some systematic cognitive phenomenon is at work, and this demands an explanation.

Let’s call s1 and s2 the two terms of the coincidence. Under the crucial hypothesis that the two events are independent, the causal complexity of the two events taken together equals the sum of their causal complexities (note that this is the proper definition of independence; it is more rigorous and general than the probabilistic notion):

Cw(s1 & s2) = Cw(s1) + Cw(s2)

On the description side, however, we apply the chain rule:

Cd(s1 & s2) < Cd(s1) + Cd(s2|s1)

Unexpectedness then reads:

U(s1 & s2) = Cw(s1) + Cw(s2) – Cd(s1) – Cd(s2|s1).

Each element common to the two situations, such as the name of the successors in the Lincoln-Kennedy parallel, has to be generated twice causally, but has to be described only once. Its impact on the conditional complexity Cd(s2|s1) is zero. Common elements thus add to the unexpectedness of the situation.

If there is a strong analogy or at least a resemblance between the two situations, then Cd(s2|s1) << Cd(s2). The conjunction of the two events turns out to be simpler than expected. This creates unexpectedness.


Coincidence (1)
Why does the mention:    "KENNEDY was shot in a car named LINCOLN"
affect the interest of the coincidence?


    

Coincidence (2)
Why is the time interval of 100 years more interesting than an interval of 87 years?

a. Because it augments \(C_w(s_2|s_1)\).
b. Because it diminishes \(C_w(s_2|s_1)\).
c. Because it augments \(C_d(s_2|s_1)\).
d. Because it diminishes \(C_d(s_2|s_1)\).

    

Coincidence (3)
Most people consider that the coincidence would have been even more impressive if both successors were named LICZNEROWITCH instead of JOHNSON (as the name LICZNEROWITCH is much less frequent in the US).
Let’s call \(D\) the difference \(C_d(LICZNEROWITCH) - C_d(JOHNSON)\).
Why is it so ?


    


I wonder whether this definition of relevance is always valid. Which definition?

That interest, I mean relevance, is controlled by the gap between causal complexity and description complexity. My sister bought a bike.

What?! Ha, you see! It’s not relevant.

And? It’s not causally complex, so it is not relevant.

You mean, it would be more relevant if she were in prison for the next ten years. Or if she were blind? Yes. That would increase causal complexity considerably. What could bring her to buy a bike in such a case?

But what about an eclipse? ??

I mean, eclipses are events, and yet they are perfectly deterministic. By saying "deterministic", you mean that their causal complexity is zero?

Yes. I just read something about this example. I mean, about why the occurrence of an eclipse might be unexpected, and therefore exciting.

I can’t see how an eclipse can be unexpected! It all depends on the viewpoint you take.

What do you mean? Well, it doesn’t happen everyday!

Oh, you mean that there’s some causal complexity to get an eclipse precisely today? Yes.

But that’s just probability! More or less. You’ve still to consider the description complexity of the concept "eclipse".

I thought it rather had to do with the causal complexity of having the Earth, the moon and the sun perfectly aligned. Yes, that’s another viewpoint. It’s all explained on the Website.

I’ll definitely have a look. Yes, you should. It’s cool!









A compressionist view of intelligent processing

Problem: Is AI more than a heterogeneous set of clever tools?
Algorithmic information can serve as a theoretical foundation for many aspects of intelligent processing. AIT offered us a definition of unexpectedness and of subjective information.     We mentioned Gregory Chaitin’s famous sentence: "Comprehension is compression". Let’s build on the idea. The following table lists various aspects of intelligent processing and how they can be viewed from a compressionist perspective.

We start with statements "X is compression" that are most relevant to the present chapter.

Compression as unexpectedness
Subjective probability is compression
  • Subjective probability, as measured by unexpectedness, corresponds to the compression between expected complexity and observed complexity.
    ➢ see this chapter
Aesthetics is compression
Creativity is compression
  • Actions or innovations are perceived as creative when they elicit surprise, and surprise is analyzed as a complexity drop.
    More on this.
Abduction is compression
  • Abduction is the ability to find out a cause for a problematic state of affairs. The best causal hypotheses, according to Occam’s razor, are the simplest ones. Their effect is to diminish the causal complexity of the path to a solution.
    More on this.
Relevance is compression
  • A relevant statement corresponds to a complexity gap
    • by pointing to an unexpected situation
    • by reducing causal complexity through explanation or action
    • by increasing description complexity through trivialization

    More on this.
Responsibility is compression
  • The judge decides on responsibility by showing that the defendant’s action diminished the causal complexity of the undesirable effect.
  • Moral judgment also takes the simplicity of the victim (seen from the judge) into account.
    More on this.
Emergence is compression
  • Emergence in collective systems can be defined as the difference between the sum of individual complexities and the collective complexity.
    ➢ Watch the video at the end of chapter 1.
Humour is compression
  • Many humorous effects result from unexpected simplicity.
    ➢ See next exercise.

    
More on this: visit the
   visit →
WebIcon.png     Simplicity Theory website    

Spider
–    Why did the fly fly?
–    Because the spider spied her!

The pun is due to...
The causal complexity of the event.
The causal complexity of the phonetic similarity between ‘spider’ and ‘spied her’.
The unexpected simplicity of the phonetic similarity between ‘spider’ and ‘spied her’.
The complexity drop created by the explanation.

    


    
Now, let’s look at some "X is compression" conjectures that are relevant to computer science in general.

Compression and computer science
Coding is compression
  • Shannon-Fano codes, Huffman codes, hash codes are methods for designing compact codes. These codes replace the data. Since they are more compact, they achieve compression.
Storing is compression
  • Once data is stored, it can be defined by its address. As the latter is generally shorter than the data stored (which might sometimes be arbitrarily large), storing achieves compression.
  • Indexing methods are used to diminish the average size of addresses.
Forgetting is compression
  • One should forget only unimportant data. Importance, as relevance, is associated to unexpectedness. As time goes by, description complexity increases (as log(t)), making some data less and less relevant, until they become good candidates for forgetting.
Pattern recognition is compression
  • Whenever a structure is recognized in an image, the complexity of the pixels or sub-structures it contains drops as they can be predicted from the structure.
Image synthesis is compression
  • Fixed or animated image synthesis is efficient when it can be controlled using only a few parameters (e.g. recursion depth in a fractal). The fact that realistic images can be compressed down to those few parameters is often remarkable (see the demoscene mentioned in Chapter 2).
Visualizing is compression
  • Large data visualization, for instance for the description of large social networks, requires lossy compression to generate pictures that a human mind can understand.
Schematic is compression
  • Schematic results from lossy compression: contours are simplified and some objects are replaced by symbols. Note that the whole point is to keep the semantics and relevance of what is represented.
  • This is also true of comics.
Hash codes
Some hash code functions such as SHA-256 are designed to be virtually impossible to reverse in practice. They correspond to lossy compression:

True
False

    


    
Finally, let’s look at some other "X is compression" conjectures that are relevant to Artificial Intelligence.

Compression and AI
Information is compression
  • The smallest program p that generates an object s represents its informational content. It is a compressed version of s.
    ➢ see Chapter 2.
Learning is compression
Probability is compression
  • The probability of occurrence of an object is measured by the cumulative probability of the minimal programs that generate it.
    ➢ see Chapter 3.
Analogy is compression
  • Good proportional analogies (A is to B as C is to D) minimize the overall complexity of (A, B, C, D).
    ➢ see Chapter 4.
"Comprehension is compression"
  • This is Gregory Chaitin’s great aphorism.
    Understanding a phenomenon means replacing observations by a theory. A good theory realizes a trade-off between simplicity and accuracy.
    More on this.
Proving is compression
  • A proof replaces many true statements by a theorem. The theorem is a compressed version of the (often infinite) set of statements it proves.
  • Conversely, the compression argument can be used in many proofs by contradiction.
    ➢ see Chapter 3.
Intelligence is compression
  • Intelligent behaviour amounts to taking the best rewarding action based on the most probable, i.e. the simplest, future.
    ➢ see Chapter 4.
  • And more generally, any intelligent device (and this includes the human brain) seeks to achieving maximal compression (within the constraints of available resources).

So, "everything" is compression! Why do you mean by "everything"?

From what I see here, every cognitive process generates complexity drop. I don’t see why.

So it seems that probability, aesthetics, creativity, relevance, responsibility, even humour, and many more, correspond to compression. Maybe it is just due to the way our brain functions.

Or maybe one could say that any computation leads to information compression. I disagree. For instance, your brain is processing the syntax of what I’m saying, and...

and you mean that there is no complexity drop when processing a sentence? Indeed! At least not at the syntactic level.

And yet, once you get the syntactic structure, words fall in place, and the sequence of words becomes less complex. OK. Now I see your point. But that form of complexity reduction is quite different.

You mean, it’s not a difference between expected complexity and observed complexity? That’s it.

Maybe we should say that we are more concerned about "explicit" complexity drop here? Yes. The kind of compression you can experience... and talk about!

This "explicit" form of compression is what determines surprise. To me, it’s the amazing side of complexity drop.

As in fortuitous encounters: you expect to meet someone complex and you see your neighbor. My favorite is the story about Larry Kagan’s structures: you expect a complex shadow and you see something familiar projected on the wall.

Did you see the next topic? It’s about emotion. Yes, I wonder what complexity drop has to do with emotion.

Emotional intensity

Problem: Nothing can be further from emotions than the cold computations of AIT!
And yet, Algorithmic Information is crucial to compute emotions. Not the emotions themselves, but their intensity. Emotions elicited by the occurrence of events may be intense. Their intensity is due to two factors:
  1. The type of event (are there casualties, or are we talking of a 10€ loss);
  2. The event’s unexpectedness.
Simplicity is of course at work in unexpectedness, since unexpected situations are by definition abnormally simple. As we saw when we discussed the "law of distant deaths", simplicity is involved in the amplitude of the emotional stimulus. For instance, under the hypothesis that the impact of each casualty is proportional to its complexity, we can derive that emotional intensity varies as the logarithm of their number.
Wheel of Fortune (1)
In a wheel of fortune game, the player launches the wheel which then stops at one of the \(N\) discrete spaces of the wheel. There is only one winning space.

Suppose that the player has won.
What is the contribution of \(N\) to her emotional intensity?

a.    \( log_2(N) \)
b.    \(log_2(2 \times N)\)
c.    \(2 \times log_2(N)\)
d.    \(N\)
e.    \(2 \times N\)

    

Imagine for a moment that the player, though having won, is denied the reward. The value of emotional intensity would be the same as for winning, though with an opposite sign.

Let’s suppose now that the player did not win because she missed the winning space. She might be upset as well.
In these kinds of situations, people tend to consider the unexpectedness of the counterfactual win. They are all the more upset as they failed by a small margin.
Their (negative) emotion is, in absolute value, as intense as the emotion of winning, minus the counterfactual unexpectedness of reaching the winning space.
Wheel of Fortune (2)
Suppose that the wheel stopped at distance \(k\) of the winning space (near miss).

What is the contribution of \(k\) to the absolute value of the (now negative) emotional intensity?

a.    \(-(1 + log_2(k))\)
b.    \(-(1 + 2 \times log_2(k))\)

    

5

Interview


5








Recap

Take-home message

We hope you enjoyed this chapter on AIT and Machine Learning.
Let’s leave the last word to Angel@ and Alexandr@.

Subjective probability seems to function the other way around. What do you mean?

While algorithmic probability declares objects improbable if complex, situations are subjectively improbable if they are too simple! 123456.png
Oh, as in the lottery examples?

Yes. Improbable if simple. Not in the case of magic. Prestidigitateur.png

You’re right. Now I remember.
Improbability is due to causal complexity in that case.
Or rather to the difference between causal complexity and description complexity.

Yes. That’s how unexpectedness is defined. I was amazed by all the consequences of this definition.

You mean, concerning interest, relevance, aesthetics and all the other predictions? Yes. Relevance is my favorite one.

Why especially relevance? Because it defines intelligence, according to Turing’s imitation game.

Well, if I follow you correctly, relevance, and therefore complexity drop, define the way we think? Yes. It’s as if our mind was some kind of compression machine.
That’s something!

Indeed, it’s quite unexpected (oops) to see our minds put in equations that way! It seems you’ve become a supporter of compressionism now.

Ha ha, yes, I guess I’m officially converted. Me too!


Test (1)
People find that getting ten "tails" in a row is (subjectively) less probable than getting tail-head-tail-head-tail-head-tail-head-tail-head. Why?

    Because ten "tails" is causally harder to generate.
    Because ten "tails" is simpler to describe.

    

Test (2)
Cognitive complexity is always smaller than plain complexity.)

    True
    False

    

Test (3)
The interest of news decays with time as \(2 \times log_2(t)\), where \(t\) is the time interval since the news.

    True
    False

    

Test (4)
Are rhyming verses appreciated because the identity of sounds spares bits of information in description complexity?

    Yes
    No

    

Test (5)
When asked whether they are happy to find a folded 10-euro bill on the ground, people declare that they would be happier if it happened in the forest rather that on a commercial street in the evening.
Why?

    Because causal complexity is larger in the forest story.
    Because the 10-euro banknote is simpler to describe if found in the forest.

    

Test (6)
Why are important events celebrated more intensely after 10, 50 or 100 years?

    Because round anniversaries are improbable.
    Because round numbers are simple and diminish the description complexity of the event.
    Because round numbers are simple and diminish complexity drop.

    

    

Forum: Chapter5
Do you have feedback or suggestions about the topics addressed in this chapter?

        

Line.jpg








Bibliography


    Also, visit:         
   visit →
WebIcon.png     Simplicity Theory website    

5

up.png