logo_ipparis.png     TelecomParis_endossem_IPP_RVB_100pix.png Telecom Paris
Dep. Informatique & Réseaux

Dessalles_2018.png J-L. DessallesHome page

July 2021

IA325/IA703: Algorithmic Information and Artificial Intelligence

Lecturers:     Jean-Louis Dessalles
and             Pierre-Alexandre Murena
Etienne Houzé

                                                other AI courses 5


Measuring Information through compression


What you will learn in this chapter:
  • How to compute complexity through compression
  • How to compare algorithmic information with Shannon’s information
  • How to detect languages through joint compression
  • How to use the Web to compute meaning similarity

    All the programs used in this page
can be retrieved from

Information as compression

Problem: Can we define the quantity of information contained in a single, isolated object?
Information is what is left when any redundancy has been removed. Kolmogorov complexity does exactly that: It removes redundancy by performing ideal compression (without loss). The output represents the irreducible information contained in the object.


Compressing binary strings

Problem: Can we compress everything, and how much?

Understanding Artificial Intelligence
through Algorithmic Information Theory


Shrinking objects




How much can objects be compressed?

Squeeze.png Information is what is left after maximal compression.

How far can a given object be compressed?

Can we compress all objects?
How much can objects be compressed?

For the kind of objects that can be stored on a computer, such as videos, file sizes give us a hint.

For instance, a three-minute video such as this one weighs 70 MB in mkv format.
How much can objects be compressed?

Appearances may be deceptive, though.

This video has been generated by a 4 kB program! So a shrinking ratio of 17 000 to one.

This video won the Breakpoint PC 4K Intro competition in 2009. If you are aware of what 4 KB means, and if you watch the video, you can only be amazed. It is an example of what is called a demoscene.
Whenever an object can be compressed
down to a small description,
we know that there is not much information in it.
Let’s consider these three sequences.
(a) 1100110011001100110011001100110011001100110011001100
(b) 1001001000011111101101010100010001000010110100011000
(c) 0010100101110110101101000101101011011010001011001011

Sequence (a) is obviously simple. It can be described concisely.
In python: '1100' * 13.
This program can be seen as a compressed version of (a), since (a) can be recovered from it.
Let’s consider these three sequences.
(a) 1100110011001100110011001100110011001100110011001100
(b) 1001001000011111101101010100010001000010110100011000
(c) 0010100101110110101101000101101011011010001011001011

Sequence (b) is an attempt to write down ‘random’ binary digits by hand.
There seems to be no easy way to compress it.
Let’s consider these three sequences.
(a) 1100110011001100110011001100110011001100110011001100
(b) 1001001000011111101101010100010001000010110100011000
(c) 0010100101110110101101000101101011011010001011001011

Sequences (c) seems complex at first sight.
But an informed observer would know that sequence (c) represents the first digits of π in binary representation.
So (c) and any longer version of it can be compressed down to any short program that generates π.     

    def pi(N):
        PiDividedByFour = 0
        for i in range(N):
            PiDividedByFour += (–1)**i /float(2*i+1)
        return 4* PiDividedByFour
Let’s consider these three sequences.
(a) 1100110011001100110011001100110011001100110011001100
(b) 1001001000011111101101010100010001000010110100011000
(c) 0010100101110110101101000101101011011010001011001011

It seems that many sequences can be compressed, or shrunk, down to a short description that captures their information content.


. . .

Consider all binary sequences of size N. There are 2N of them. For instance, there are 4096 binary sequences of length = 12.

Some of these strings, such as the first and the last ones, are highly compressible.

For instance, we can represent the last one as '1' * N, and this can be coded with log2(N) + O(1) bits, which is much less than N if N is not too small.

How many of these N-long sequences can be shrunk?

. . .

Suppose we use a compressing program to compress the binary sequences of size N.

How many of them, at most, can be compressed by exactly k bits (< N)?

. . .

A sequence of N bits that is compressed by exactly k bits must be represented using N-k bits.

N-k bits can represent only 2N-k different sequences. Since we want compression to be reversible, only 2N-k sequences of length N at most can be compressed by exactly k bits.

. . .

If we count all sequences of size \(N\) that are possibly compressible by at least \(k\) bits, we get:

\(\sum\limits_{i=k}^{N} 2^{N-i} = \sum\limits_{j=0}^{N-k} 2^{j} = 2^{N-k+1} - 1\)

Compared with \(2^N\), this makes a proportion of roughly \(\frac{2^{N-k+1}}{2^N} = 1/2^{k-1}\).

This means that only a tiny proportion of sequences may be compressed by at least k bits as soon as k is significant.
In other words, there is no miracle.
Shrinking has its limits.

If one considers all possible objects
(here, binary strings),
compression can spare you a lot of bits on some of them, but
these compressible objects are rare.

Compressibility (1)
What is the proportion, among all binary sequences of size 128, that can be compressed down to 100 bits or less?

    \(1/2^{27}\)          \(1/2^{99}\)
    \(1/2^{100} - 1/2^{128}\)    
    \(1/{100} - 1/{128}\)    


Compressibility (2)
Consider a given compression method \(Z\). Are there binary sequences of size \(N\) that cannot be compressed through \(Z\)?
(you may count all compressible sequences.)

    Possibly none, for some \(Z\).
    Yes, at least one, whatever \(Z\) may be.
    Yes, at least \(N/2\), whatever \(Z\) may be.


Zipping sequences

Problem: How can we compress objects in practice?

repetition.gif Computer science offers a variety of tools to compress objects, such as the "jpeg" and "png" formats for images; for text or data files, one would use "zip", "bzip2" or "zpaq". The most basic mechanism at work in compressing programs consists of detecting duplicates in the sequence to be compressed.

Question: Does any duplicate pattern give rise to compression?

Answer:   No. The pattern has to be long enough and the two copies should not be located too far apart.

Can we quantify these conditions?

The program is able to detect copies within a binary sequence. For instance, it may spot the pattern 0110111001101 of length L = 13 at two different locations of the following string:

When finds that a sequence of size L appears at location P1 and again at location P2, it refers to the first occurrence instead of repeating the pattern. The second occurrence of 0110111001101 in the above sequence is replaced by 1 1100 1101 where: Our code is using space delimiters, as in the Morse code (see Chapter 1).
$ CopyDetection.py 000101 0110111001101 000010111110 0110111001101 10010001011100110001
Original:   0001010110111001101000010111110011011100110110010001011100110001 - length: 64
Encoded:  0001010110111001101000010111110 1 1100 1101 10010001011100110001 - length: 60
Decoded: 0001010110111001101000010111110011011100110110010001011100110001 - Correct

(for convenience, the program’s arguments are joined into a single string).
Note one more time that as explained in Chapter 1, we do not count space characters when computing code length (this choice will be discussed in Chapter 3).
Copy detection (1)
Use the program to compress the sequence:


What is the encoded version of the string?


Decimal to binary converter:

When given no argument, the program is applied to a pseudo-random binary string of size 100. Make several tries. You should be able to verify that pseudo-random sequences exhibit repeated patterns and thus are, slightly, compressible.
Copy detection (2)
Compute the greatest distance between two repeats for which it is advantageous to code the repetition, if the repeated pattern is 0110111001101.
Note: you have to compute the distance between the onset of the two strings, so don’t forget to take the length of the pattern into account.

    13     15     64     73     140     146    


In what follows, \(\log_{2+}\) means that the logarithm is rounded to the upper integer.
We approximate the complexity of an integer \(n\) as \(\log_{2+}(n+1)\).
Copy detection (3)
What is the theoretical condition on \(L\) (length of the repeated pattern), \(P_1\) (location of the 1st occurrence) and \(P_2\) (location of the 2nd occurrence) to get compression with such a copy detection program?
The constant \(c\) represents the offset of the coding scheme (the extra heading 1 in the program).

    a. \(\log_{2+}(P_2 - P_1 - L + 1) \lt L - \log_{2+}(L + 1) - c\)
    b. \(\log_{2+}(P_2 - P_1 - L + 1) \lt L - \log_{2+}(P_2 - P_1 + 1) - c\)
    c. \(\log_{2+}(P_2 - P_1 - L + 1) \lt P_2 - P_1 - \log_{2+}(P_2 - P_1 + 1) - c\)


Information as absence of redundancy

Problem: Does compression measure redundancy?
Redundancy obviously does not contribute to information. This is why compression is the best tool to measure information. For instance, unbalanced binary sequences that have more 0s than 1s, or more 1s than 0s, are highly redundant.


They are expected to be compressible. Can we verify that?

The program computes compression factors for biased sequences, using Zip (averaged over several trials; 100% compression corresponds here to the compression of a constant string).
For instance, if you execute:     BinSeqGenerator.py 1000 33
the program generates 1000-bit long sequences with 33% of 1s and tries to zip them.
Biased sequence (1)
What average compression factor do you get for strings of length 1000 with a 22% bias?

    about  6% compression         about 17% compression
    about 24% compression         about 29% compression


Biased sequence (2)
You may try several biases. What is the convexity of compression rate, as a function of bias?

    roughly linear


Compressors such as Zip have their limitations. They can detect duplicate patterns, but they can’t see all forms of redundancy. The best example of these limitations is offered by π. No common compressor is able to "see" the extreme redundancy hidden in π’s decimals. They are so redundant that you can get all of them once you know that they come from π.

Let’s check whether Zip is blind to this redundancy.
Retrieve some of π’s decimals from the Web and save them to a file named 'Pi_decimals.txt'. Run the program in the presence of that file.
Read the output. This small experiment reveals an error that any beginner in compressing techniques must encounter.
The compressor takes text as input, and texts are coded in ASCII (8 bits per character). Coding π’s digits as typographical characters (ASCII) introduces redundancy that the compressor is able to detect and to eliminate. Numbers are indeed represented with codes between 48 and 57 (\x30 and \x39).
Go into . At the end, look at the "Compressing Pi" section. Convert π’s decimals to base 256 by adding a relevant line (compare with the random number section). Verify that π’s decimals aren’t compressible after all.
Copy the added line below.


Iterated compression

Question: Should we rule out compressors that would expand some texts instead of reducing them?

Answer:   No. Any given (non trivial) compressor will expand some the texts given to it!

This paradoxical observation becomes obvious once one realizes that the compressor is an injective function within the set of all possible texts of size N or less: if Z(T1) = Z(T2), then T1 = T2. So if some texts are compressed, some others must be expanded.

It is easy to observe that "zip" does not re-zip its own output to a shorter string. One can even verify that applying "zip" repeatedly even leads to an ever growing string. To do so, execute the program on this file: Kolmogorov_complexity--Wikipedia.txt. The program keeps applying "zip" n times. For instance, for two iterations:

$ rzip.py 2 Kolmogorov_complexity--Wikipedia.txt
    Test for repeated zip

Iteration 0: length = 24552
Iteration 1: length = 8660
Iterated compression
Let \(Z\) represent "zip", and \(T\) our initial text. For which value of \(n\) do we get \(Z^n(T)) > {\rm length}(T)\)?



Information: Complexity vs. frequency

Algorithmic information and Shannon’s information

Problem: What is the link between Algorithmic Information and Shannon’s information?

Huffman codes

Problem: How can we take advantage of unbalanced statistics to compress information?
Huffman_tree.png When symbols are repeated (as in digital communications) and frequencies are available, Shannon’s theory of information tells us that maximal compression on average can be achieved through variable-length codes. For instance, if you want to represent English texts in a compact binary code, you should choose a shorter code for ‘E’ than for ‘Z’. Huffman codes do this optimally.

You can generate a Huffman code online.

I’m not sure whether Shannon’s information and Algorithmic information measure the same thing. You mean that they would be equivalent?

I know they are not. Yes, because Algorithmic information is about single objects.

Ah! Oh yes, you’re right. Shannon’s information is about events. Or rather, about occurrences of repeated events.

That’s why Shannon relies on the probability of events. And probabilities do not always apply. What’s the probability of π?

And yet, when probabilities make sense, both notions are equivalent. I’m not even sure of that.

Why? Because complexity is a minimum over all computations. And it may change through time.

You mean, it may depend on the context? Yes. A second occurrence of an event just requires a mere copy. It’s simple.

And yet, this second occurrence might be surprising if the event is rare. So it should carry much information. Yes. It was not totally clear to me. So I asked the teachers.

And...? I’ve been told that all this would become clear when we define algorithmic probability in Chapter 3.

That’s funny. Me, I’ve been told that I would understand all this in Chapter 5, when we talk about subjective information. Ok, let’s wait and see...

Inferring complexity from Web data

Problem: Is there a link between complexity and Web frequency?
As we saw, some integers, such as round numbers, are simpler than others. Does it affect their frequency on the Web? The picture below represents the number of hits (in 2009) declared by Yahoo and Google for requests consisting of numbers between 1000 and 2050 (the correlation for both search engines is 0.98; note the logarithmic vertical scale).


One can unambiguously see that larger frequencies in the Web correspond to "simple" numbers:

Question: stallion is more complex (and less frequent) than horse. What about quadruped?

Answer:   quadruped is indeed less frequent than horse. The point is that quadruped is more complex a concept than horse, even if a horse is a quadruped with additional characteristics (just ask children about horses and quadrupeds).

Some authors suggested the possibility of inverting the relation between complexity and frequency. The frequency of words in texts can be assessed using a search engine. Though it is difficult to infer the real number of pages indexed by search engines, it seems that dividing Google’s declared number of hits by 2 × 1012 gives a reasonable estimate of actual frequency. For instance, the word "spoon" can be considered to occur with a frequency of 0.02%, whereas the frequency of "book" is 0.87%.

If a word w occurs with a frequency f, we get an estimate of the complexity K(w) of the corresponding concept through the formula:

K(w)  log(1/f).

This formula, which holds up to an additional constant, will be further explored in Chapter 3. It serves as the basis for the use of log(1/f) as an approximation of K(w).
Forum: Web Complexity
You may look up these words on a search engine: rabbit, book, calligraphy, echinococcosis. Note the number of hits.
Convert these results into frequencies. Then convert these frequencies into complexities using the preceding formula. Compare your results (you may mention the search engine you are using).


Zipf’s law

Problem: Zipf’s law describes a spectacular phenomenon. Where does it come from?

Understanding Artificial Intelligence
through Algorithmic Information Theory


Zipf’s law




Zipf’s law describes an absolutely remarkable and quite unexpected phenomenon.

It links the frequency of words to their rank by frequency.

portrait_condon.png The law was first discovered in 1928 by Edward Condon


and then by George Zipf.

What does the "Condon-Zipf law" say?
English_freq.png Zipf’s law:    \(r_f(w) \approx \frac{k}{f(w)}\)

or:    \(log(f(w)) \approx -log(r_f(w)) + c\)


\( f(w) \) is the frequency of the word \( w \) in the language, \( r_f(w) \) is the rank of \( w \) in the vocabulary sorted by frequency and \( c \) is a constant.
English_freq.png Zipf’s law:    \(r_f(w) \approx \frac{k}{f(w)}\)

or:    \(log(f(w)) \approx -log(r_f(w)) + c\)

This means that the 10th most frequent word is 10 times more frequent than the 100th most frequent word, which is itself 10 times more frequent than the 1000th most frequent word, which is itself 10 times more frequent than the 10 000th most frequent word, and so on. Or the 10th most frequent word is twice more frequent than the 20th most frequent word, 4 times more frequent than the 40th most frequent word, 8 times more frequent than the 80th most frequent word, and so on.

Why is it so?
English_freq.png Zipf’s law:    \(r_f(w) \approx \frac{k}{f(w)}\)

or:    \(log(f(w)) \approx -log(r_f(w)) + c\)

This plot has been drawn based on the frequency of words in Alice’s adventures in Wonderland.
Several similar plots are available in this compressed file (look into texts/_Zipf). Yes, it works for Chinese too!
You may try on any corpus in any language you may choose using .
WordFrequency.png Zipf’s law:    \(r_f(w) \approx \frac{k}{f(w)}\)

or:    \(log(f(w)) \approx -log(r_f(w)) + c\)

If we plot the number of hits of a few words in a search engine against their frequency rank in the language, we get a linear tendency in log-log coordinates.
This conformity with Zipf’s law suggests that the Web, as seen through search engines, offers an accurate image of language use.

The data displayed above are available in this table
(format .xls, format .csv).

Word frequency ranks can be found in various Websites, such as this one, or this one or this one.
Forum: Checking Zipf law
Get frequency ranks for a few frequent and rare words (you may retrieve those ranks from one of the above sites).
Compare your measures.
Do they verify the inverse proportion = k/f ?

= k/f ?">         

Zipf’s law and Kolmogorov complexity

Zipf’s law is an empirical phenomenon. It is not easy to explain.

In 1936, George K. Zipf supposed that the frequency of words in language results from the strategy used by people during their mental search through the vocabulary to find appropriate words to express intended meanings. As a consequence, one may suppose that words are stored by decreasing frequency in our minds.

If so, Kolmogorov complexity may explain Zipf’s law. This way, Kolmogorov complexity can be seen as acting as a go-between which links rank to frequency.

Zipf law
Let’s take two books of similar size, written in two different languages (we suppose they have no words in common). Will the meeting of the two books verify Zipf’s law?

    yes, because Zipf’s law applies to any text
    yes, because both word frequencies and ranks are affected by a factor two
    yes, because addition leaves inverse proportionality invariant
    no, because words with the same meaning in the two languages are equally frequent
    no, because word ranks no longer reflects their frequency of use
    no, because addition does not leave inverse proportionality invariant


Using AIT to detect similarity

Problem: Can we measure shared information using compression?
AIT principles, implemented through compression, are used in a variety of practical applications: rediscover language genealogy without knowing linguistics, attributing authorship without knowing authors’ styles, determining concept similarity without any knowledge of semantics, finding out Coronavirus-19 phylogeny while ignoring biological knowledge, and more. We explore some of these techniques now.

Detecting language through compression

Understanding Artificial Intelligence
through Algorithmic Information Theory


Detecting Language Proximity through Compression




Excerpt1.png Excerpt2.png Intuitively, two texts T1 and T2, considered as mere strings, should share more information if they are written in the same language than when compared with a third text T3 written in a different language.
This is due to the greater redundancy if X and Y share many words.
Conditional complexity is the most natural way to measure information sharing. We expect:

    C(T1 | T2) << C(T1 | T3)
        C(T1 | T2) << C(T1 | T3)

For practical purposes, compression can be used as a proxy for complexity.
Using the Chain Rule, we can write:

        C(T1 | T2) > C(T1, T2) – C(T2)

Measuring the output of compressor Z as proxy for C and text concatenation instead of conjunction, we can estimate C(T1 | T2) by:

        length(Z(T1 + T2))length(Z(T2))
        length(Z(T1 + T2))length(Z(T2))

This measure of conditional complexity through joint compression may be used to represent the resemblance between objects such as texts.
        length(Z(T1 + T2))length(Z(T2))

LanguageTree.png In 2002, three scientists, Dario Benedetto, Emanuele Caglioti and Vittorio Loreto, were able to measure language proximity using a mere compressor such as Zip.

The language tree they could reconstruct with this estimate of conditional complexity matches the tree that linguists stepwise developed during a century!

Let’s consider a set of reference texts Ti written in different languages. Such texts can be found in texts.zip. They have been stored in utf-8 format.
Then you may build a small text X (typically one-page long) in whatever language you want. This text will be used as a probe. You may cut-and-paste some text into your editor before saving it as X (instead of saving it directly from the browser) to eliminate html tags and links. Be sure it’s saved in utf-8 format.

We want to see if joint compression with the Tis can identify X’s language. To do so, we rank the Ti by increasing values of:

length(Z(Ti + X))length(Z(Ti))

where Z is your favourite compressor. It would be great if the Ti that comes first correctly identifies the language (or the author) of X.

Use the program to compute length(Z(Ti + YourText))length(Z(Ti)), where the Ti are the texts in various languages (make sure that YourText is small as compared to Ti).
Forum: Joint compression
Indicate your results. Do they make sense?

You will then be able to compare with other students’ results.


NID: A universal distance

Problem: Can we define a concept similarity measure based on Algorithmic Information?
Complexity offers a natural measure of concept similarity. We may use conditional complexity K(x|y) to estimate how much x differs from y.
We can turn this measure into an operational method by estimating K using Web frequencies, as we did previously.    

Mare.png Suppose you want to measure the conditional complexity K(rider | horse) between two concepts, rider and horse. You can get an estimate by computing log(1/f(rider | horse)). Conditional frequency f(rider | horse) can be computed the same way as conditional probability: f(rider | horse) = f(rider+horse)/f(horse).

Let’s perform some measures on a Web search engine:
- "rider" gets 74 × 106 hits on Bing,
- "horse" gets 72.5 × 106,
- "rider + horse" gets 13 × 106.
The conditional frequency, according to Bing, is therefore f(rider | horse) = 13/72.5 = 0.18, and the conditional complexity K(rider | horse) = 2.47 bits.
By contrast, f(rider | the) = 71/13300 = 0.005 ≈ f(rider), and so K(rider | the) ≈ K(rider).
This makes sense: in a context in which "horse" has been mentioned, "rider" is less complex than in a context in which "the" has been mentioned (which amounts to no context at all).

Question: Conditional complexity is asymmetrical. Can we define a genuine distance based on it?

Answer:   Yes, and this is what Paul Vitányi and his colleagues did in 2004 by defining the "Normalized Information distance" (NID).

Understanding Artificial Intelligence
through Algorithmic Information Theory


A "universal" distance




In 1998, Charles Bennett and his colleagues introduced
the Normalized Information Distance (NID).

NID is a universal distance.

For any pair of objects, NID determines what is common to them, and only keeps their difference to measure the distance that separates them.

By doing so, NID can be regarded as "the most intelligent" distance!
NID is based on conditional complexity K(x|y).

Conditional complexity measures the amount of information contained in x that is not already present in y.

We’ll proceed from K(x|y) to NID(x,y) in three steps.

Step 1: symmetry.

Let’s make conditional complexity symmetrical, by considering max[K(x|y), K(y|x)] instead of K(x|y).

This will measure the amount of information that is not shared between x and y.
max[K(x|y), K(y|x)]

Step 2: normalization.

Suppose that two objects differ by n bits of information.

If n is significant compared to their complexity, then they are very different.

However, if the objects are complex and now the same n bits represent a small amount compared to their complexity, then the two objects are very similar.
max[K(x|y), K(y|x)]

Step 2: normalization.

To take this relative effect into account, let’s divide the similarity measure by the objects’ complexity.     $$ \frac{max[K(x|y), K(y|x)]}{max[K(x), K(y)]} $$

\(\frac{max[K(x|y), K(y|x)]}{max[K(x), K(y)]}\)

Step 3: chain rule.

The Chain Rule for K says that (up to an additional constant, neglected here): K(x, y) < K(x|y) + K(y).

If we replace K(x|y) by K(x, y) – K(y) in the above expression, we get what Bennett and his colleagues defined as Normalized Information Distance, or NID. $$ NID(x, y) = \frac{K(x, y) - min[K(x), K(y)]}{max[K(x), K(y)]} $$
\(NID(x, y) = \frac{K(x, y) - min[K(x), K(y)]}{max[K(x), K(y)]}\)

This distance has values between 0 and 1. Unlike usual distances that rely on predetermined computations, NID discovers for each pair {x,y} what x and y have in common.

For this reason, NID may be considered as the "most itelligent" distance one may think of.

This means that if two objects x and y are close to each other according to any given computable distance, then x and y will be found close to each other by NID.

See (Li, Chen & Li, 2004) for details.

The Google distance

Problem: Can we infer concept similarity by measuring conditional complexity on the Web?

SarsCov2Phylogeny2.png From NID to NCD. The Normalized Information Distance (NID) remains an abstract notion, based on Kolmogorov complexity. If we substitute a mere compressor such as "zip" for Kolmogorov Complexity, NID becomes NCD: the Normalized Compression Distance (here \(Z\) designates the length of the compressor output): $$ NCD(x, y) = \frac{Z(x, y) - min[Z(x), Z(y)]}{max[Z(x), Z(y)]} $$ The advantage of NCD is that it is computable and can be used in practice.

NCD has been used to compute distances between languages or distances between species based on DNA strings (Li, Chen & Li, 2004), and more recently to determine the phylogeny of the SARS-COV2 virus (Vitanyi & Cilibrasi, 2020) (see figure).


Consider this Python program and its execution.

import os, zlib

# Opening four sample texts
SampleFiles = ['sample1.txt', 'sample2.txt', 'sample3.txt', 'sample4.txt']
Texts = [open(F, 'rb').read() for F in SampleFiles]

# compressing texts
ZSizes = [len(zlib.compress(T)) for T in Texts]
print('Compressed sizes:\t', ZSizes)

# co-compressing texts
print('Z(sample1 + sample2):\t', len(zlib.compress(Texts[0] + Texts[1])))
print('Z(sample3 + sample4):\t', len(zlib.compress(Texts[2] + Texts[3])))

$ SampleCompress.py
    Compressed sizes: [8247, 10793, 2151, 2829]
    Z(sample1 + sample2): 18022
    Z(sample3 + sample4): 4352

Which pair should be regarded as more similar according to NCD, and by how much?

    NCD(sample1, sample2) is smaller by 0.13
    NCD(sample3, sample4) is smaller by 0.13
    NCD(sample1, sample2) is smaller by 0.27
    NCD(sample3, sample4) is smaller by 0.27
    NCD(sample1, sample2) is smaller by 0.33
    NCD(sample3, sample4) is smaller by 0.33


Question: Can we go one step further and use the Web to define a complexity-based distance between concepts?

Answer:   Yes, and this is what Rudi Cilibrasi and Paul Vitányi did in 2007 by creating what they call the "Google distance".

From NCD to the "Google distance". We may use Web data to estimate K. Let’s start from the expression of NID: $$ NID(x, y) = \frac{K(x, y) - min[K(x), K(y)]}{max[K(x), K(y)]} $$ As we did to get an estimate of conditional complexity, we may replace K(x) by log2(1/f(x)), where f(x) is the observed frequency of x on the Web, or rather by log2(N/g(x)), where N is the corresponding total number of indexed pages and g(x) denotes the number of pages containing x. $$ NGD(x, y) = \frac{max[log(g(x)), log(g(y))] - log(g(x, y))}{log(N) - min[log(g(x)), log(g(y))]} $$ NGD stands for Normalized Google Distance, after the name of the search engine used by Cilibrasi and Vitányi. \(g(x, y)\) denotes the number of pages containing both \(x\) and \(y\), as reported by the Web search engine (for Google, it seems that = 2 × 1012 is a reasonable estimate, in relation to the declared number of hits).

Intuitively, NID is supposed to extract by whatever ideal means what is common to two objects. By estimating K using a Web search engine, as in NGD, the two objects are compared not through their common structural features, but through the human-created contexts in which they co-occur.
Select all of the statements that are true.

Conditional complexity is replaced by \(K(x,y)-K(y)\) in NID.
Normalization consists of dividing by standard deviation.
In NGD, a Web search engine is used to count the number of differences in concepts’ definitions.
Conditional complexity is replaced by \(max[K(x,y), K(y,x)]\) in NID.
Normalization consists of dividing by the sum of the complexities.
NCD cannot be computed.
NGD cannot be computed.
In NGD, a Web search engine is used to retrieve the PageRank of concepts.
Conditional complexity is replaced by \(K(y) + K(x,y)\) in NID.
NID cannot be computed.


Forum: Search engine distance
Use the NGD to compare some distances between word pairs on your favourite search engine. Indicate the distances you get. You may take these pairs among:
    ‘Laurel’ and ‘Hardy’
    ‘complex’ and ‘simple’
    ‘artificial’ and ‘intelligence’
    ‘bicycle’ and ‘epistemology’
    or any word pair you might think of.






Take-home message

We hope you enjoyed this chapter on AIT and compression.
Let’s leave the last word to Angel@ and Alexandr@.

We saw that most objects can’t be significantly compressed.
That’s disturbing!

It means that AIT is concerned with only a small proportion of all possible objects. So what? It just means that most objects are boring.

And so, you mean that an AI program should pay attention only to compressible objects? Or only to objects that it is able to compress.

If I follow you correctly, Kolmogorov complexity can be used to measure intelligence. ??

Yes. Kolmogorov complexity is based on an ideal compressor. If making sense of data amounts to achieving compression, then KC offers an upper limit of what intelligent processing can do. Ah, OK... I see.
Personally, I felt more concerned by the practical side of it.

You mean, actual compression using Zip? Not only. It’s nice to know that we can assess complexity through frequency measures.

Yes. It connects Kolmogorov complexity to Shannon’s information. And using a search engine to compute semantic distance is really cool!

Ah, the story about the universal distance and the "Google distance"? Yes, nice!
But my favorite is still the story about Zipf’s law.
Oh yes. Incredible, isn’t it? The law is so simple: = k/r. And it applies to everything we write!

It seems to make sense, though.
I mean, when you look at it in terms of minimal description.
Yes. It suggests that Kolmogorov complexity is not only a tool.
It explains things.

Test (1)
Most binary strings of size 1024 can be compressed by 10 bits.



Test (2)
In Huffman codes, code lengths are inversely proportional to probability.



Test (3)
In typical texts, the 10th most frequent word is 10 times more frequent than the 100th most frequent word.



Test (4)
The joint complexity \(K(w_1, w_2)\) between the semantics of two words \(w_1\) and \(w_2\) can be measured by \(-\log(g(w_1 + w_2))\), where \(g\) counts the number of occurrence in a search engine.



Test (5)
The normalized "Google" distance (NGD) can be in principle as large as \(log(N)/2\), where \(N\) is the number of pages indexed by the search engine.




Forum: Chapter2
Do you have feedback or suggestions about the topics addressed in this chapter?