logo_ipparis.png     TelecomParis_endossem_IPP_RVB_100pix.png Telecom Paris
Dep. Informatique & Réseaux

Dessalles_2018.png J-L. DessallesHome page

July 2021

Cognitive Approach to Natural Language Processing (SD213)

                                other AI courses


This is the advanced lab on parsing.
I you need to refresh your memory on the subject, or if you never dealt with parsing before, you’re invited to visit the light version first.


There is a trade-off between the expressiveness of grammar and the efficiency of parsing: the more restricted the grammar formalism, the easier the parsing. Context-free grammars provide a good trade-off. Here is an example of a tiny grammar of English.

% partial elementary English grammar

% --- grammar
s --> np, vp.
np --> det, n.        % Simple noun phrase
np --> np, pp.        % Noun phrase + prepositional phrase
vp --> v.     % Verb phrase, intransitive verb
vp --> v, np.        % Verb phrase, verb + complement: like X
vp --> v, pp.        % Verb phrase, verb + indirect complement : think of X
vp --> v, np, pp.    % Verb phrase, verb + complement + indirect complement : give X to Y
vp --> v, pp, pp.    % Verb phrase, verb + indirect complement + indirect complement : talk to X about Y
pp --> p, np.        % prepositional phrase

% -- lexicon
np --> [kirk].
det --> [the]; [my]; [her]; [his]; [a].
det --> [some].
n --> [dog]; [daughter]; [son]; [sister]; [aunt]; [neighbour]; [cousin].
v --> [grumbles]; [likes]; [gives]; [talks]; [annoys]; [hates].
v --> [cries].
p --> [of]; [to].
p --> [about].

These clauses can be directly interpreted by Prolog to parse sentences. As it stands, Prolog’s naive top-down parser is however not robust enough to deal with natural language, as we will see.


The role of parsing is to analyse the syntactic structure of sentences. The meaning and relevance of language crucially depends on syntactic structure. For instance, the ambiguity of a sentence such as "She looks at the boy with a telescope" (is she using a telescope, or does he hold a telescope?) depends on whether the phrase "with a telescope" is complement of "the boy" or of "looks". The following ambiguity is generated by the parser we will be using here. It is due to the fact that the small grammar given as input allows the verb "talk" to be used with any proposition (here "to" instead of "about").

Syntactic interpretation 1          Syntactic interpretation 2
(accepting ‘talk...of')

| |__det : the
| |__n : cousin
     |__v : talks
         |__p : to
            | |__det : the
            | |__n : neighbour
             |__p : of
                 |__det : her
                 |__n : sister
| |__det : the
| |__n : cousin
     |__v : talks
     | |__p : to
     | |__np
     | |__det : the
     | |__n : neighbour
         |__p : of
            |__det : her
            |__n : sister

Parsing is a challenging operation, due not only to alternative syntactic structures (as in the above example), but also to the fact that many words are ambiguous. For instance, "flies" can be a noun or a verb, "like" can be a preposition or a verb, and so on. The accumulation of ambiguities generates combinatorial explosion, amplified by semantic ambiguities (e.g. "fly" may mean "pilot" or merely "fly"). It is therefore crucial to have efficient parsing methods. The algorithm studied below is based on one of the best-known parsing methods. It parses sentences of n words in O(n3) steps.

Top-down vs. bottom-up parsing

Moreover, naive top-down and bottom-up parsing methods are utterly inefficient. They re-analyze the same phrases each time they try a new rule. Inefficiency and endless looping are due to the fact that these naive methods are amnesic. Here is a partial trace of a top-down analysis showing the amnesic character of the method.

?- trace.
?- np([the,dog,of,the,neighbour,of,the,cousin],[]).
Call: (6) np([the, dog, of, the, neighbour, of, the, cousin], []) ?
Call: (7) det([the, dog, of, the, neighbour, of, the, cousin], _G6055) ?
Exit: (7) det([the, dog, of, the, neighbour, of, the, cousin], [dog, of, the, neighbour, of, the, cousin]) ?
Call: (7) n([dog, of, the, neighbour, of, the, cousin], []) ? skip
Fail: (7) n([dog, of, the, neighbour, of, the, cousin], []) ?
Redo: (6) np([the, dog, of, the, neighbour, of, the, cousin], []) ?
Call: (7) det([the, dog, of, the, neighbour, of, the, cousin], _G6055) ?
Exit: (7) det([the, dog, of, the, neighbour, of, the, cousin], [dog, of, the, neighbour, of, the, cousin]) ?
Call: (7) n([dog, of, the, neighbour, of, the, cousin], _G6055) ? skip
Exit: (7) n([dog, of, the, neighbour, of, the, cousin], [of, the, neighbour, of, the, cousin]) ?
Call: (7) pp([of, the, neighbour, of, the, cousin], []) ?
. . .

The parsing method studied below uses a smart method to record any partially recognized phrases. It thus avoids repeating the same analyses again and again.


Prolog is a unique programming language with the following main features:

These characteristics are valuable for natural language processing.

We will be using SWI-Prolog. It is perhaps already present on the machine you are using. Try to run it, usually through the command swipl, sometimes pl, or simply by clicking on a file with extension .pl. Otherwise, download it from the SWI-Prolog website.

When in prolog, you can load a program such as Grammar.pl by typing ['Grammar']. (or sometimes by giving Grammar.pl as argument to the swipl command, or merely by clicking on Grammar.pl).

Bottom-up Chart parsing

Chart parsing is a clever way to memorize partial results obtained during parsing to avoid making the same computations endlessly. The method chart parsing method presented here is a bottom-up procedure, based on the CYK algorithm. For a corresponding top-down method, see Earley’s algorithm. These are among the most classical parsing methods. A description in French of chart parsing can be found in this text by François Yvon, sections 12.2.2 & 12.2.3.

Chart parsing consists in keeping and updating a table of well-formed phrases. The table is used to remember recognized phrases and to propose hypotheses about next phrases to be parsed.
Location (i,j) in this table contains the list of phrases starting from word i and extending to word j that are already recognized or currently analyzed. The point of the algorithm is to build this table dynamically.

Rather than as a table, the current state of the algorithm is more conveniently depicted as a graph, where the edges connect locations between words (see figure). An edge in this graph is associated with a phrase which is about to be analysed, or is partially or fully parsed. Once the phrase is parsed to the end, the edge is said to be inactive. It is active otherwise. Edges are labelled by ‘dotted rules’. A dot is used to mark the point that the analysis of the phrase has reached for the edge. For instance, np → •det n from node 1 to itself designates a candidate phrase that is not yet parsed. np → det •n from node 1 to node 2 marks the fact that the first word of the sentence, "my", has been parsed as a determiner and that the analysis of the candidate np phrase has progressed one step further (as compared with the initial edge). Lastly, the presence of the inactive edge np → det n• from node 1 to node 3 stores the fact that the three first words of the sentence have been parsed as an np.


Three main mechanisms are used.

Example :
  1. I encounter "my" : it appears as a det in the grammar; I create an inactive edge (1,2) labelled det → 'my'•.
  2. det corresponds to the ‘left corner’ of the rule np → det, n; we insert an active edge (1,1), labelled np → •det n.
  3. an existing inactive edge labelled det can extend the newly created edge; we insert a new active edge (1,2) labelled np → det •n; this edge is an extension of the previous one.
  4. moving forward, we encounter the word "dog"; we insert an inactive edge labelled n.
  5. the newly created inactive edge can extend the edge inserted in 3. This extension generates an inactive edge corresponding to an np phrase.
  6. a new phrase has been parsed; new hypotheses can be generated from rules where that phrase appears at left-corner position (here s → •np vp and np → •np pp).
  7.  . . .


To build the graph dynamically, we use the Prolog database (using assert). Edges are stored as edge(StartPos, EndPos, Cat, Done, Rest, T). Such an edge indicates that a candidate phrase of type ‘Cat’ may start from ‘StartPos’. The seven arguments are:[Note]

When processing the preceding example, edges are created and produce this kind of trace:

Lexical edge: 1->2     det --> my .     det(my)
Generating edge: 1->1     np --> . det n     np
Extending edge: 1->2     np --> det . n     np(det(my))
Lexical edge: 2->3     n --> dog .     n(dog)
Extending edge: 1->3     np --> det n .     np(det(my),n(dog))
Generating edge: 1->1     s --> . np vp     s
Extending edge: 1->3     s --> np . vp     s(np(det(my),n(dog)))
Lexical edge: 3->4     v --> barks .     v(barks)
Generating edge: 3->3     vp --> . v     vp
Generating edge: 3->3     vp --> . v np     vp
Extending edge: 3->4     vp --> v .     vp(v(barks))
Extending edge: 1->4     s --> np vp .     s(np(det(my),n(dog)),vp(v(barks)))
Extending edge: 3->4     vp --> v . np     vp(v(barks))

The program chartparser.pl provides the outline of the parser. It uses util.pl that provides input/output routines. The program reads a file containing the grammar (e.g. Grammar.pl) and converts the DCG rules such as np → det, n. into predicate form rule(np, [det, n]).
Execute chartparser using SWI-Prolog (ignore warnings at this point). Type ask('Grammar.pl'). (or replace by the file name of your grammar), then enter a sentence that should be parsed by the grammar, such as: "the cousin hates her sister". As you can observe, the program as it stands recognizes words and does nothing more than generating lexical edges.

At this point, the parser gets terminals (words), generates lexical edges, and that’s it. We must add mechanisms to generate and extend hypotheses.

Generating edges

Look at the third clause of add_edge. Each time a phrase has been recognized, i.e. whenever an inactive edge has been recorded, add_edge should also insert new active edges that start from that new edge. For instance, the edges s → •np vp and np → •np pp should be generated as soon as np → det n• becomes inactive and is recorded.
We want to generate new edges (i.e. the ‘self-node edges’, that is, the blue lines in the graph). Complete the third clause of add_edge (the one with the "% generation" comment) and paste it below.


Run the program and verify that it correctly generates appropriate edges sur as the cousin hates the neighbour of her sister.

Extending edges

The next thing to do when an inactive edge for category X is added is to extend all active edges which are waiting for X, i.e. for which the first element in the list of elements to be found is X. Note that this operation may lead to new edges becoming inactive, thus trigerring a chain reaction with new edge generations and extensions.

Complete the fourth clause of add_edge (the one with the "% extension" comment) and paste it below.


Run the program and verify that it is now able to parse simple sentences.

Extending current active edge

A this point, we did not yet use the power of memory to accelerate parsing. When a phrase is fully recognized, i.e. when an edge E becomes inactive, we want to look among existing edges if by chance another edge E’ could extend E, in which case we can generate E+E’ right away. This will spare us the burden of extending E step by step.

To complete our parser accordingly, we need to check whether the newly added edge can be right away extended.

Add a penultimate clause to add_edge (just before the catch-all clause) to extend the current edge. This clause should scan all inactive edges to see if they match the first element in the list of waiting elements. If so, one should add an edge that spans one more phrase than the current edge. Note that this new clause more or less mirrors the third clause of add_edge. Paste the clause below.



The parser is now completed. The mere call of add_lexical_edges triggers a chain reaction that generates all edges that can possibly be added. Note that parse checks for the existence of an edge labelled s spanning the entire input sentence. If such an edge does exist, the sentence is correct with regard to the grammar.

Test the program (you may cancel tracing). Provide output on a syntactically ambiguous sentence (if you change the grammar, paste the new DCG rules).


Naive top-down and bottom-up algorithms were unsatisfactory, not only because of their inefficiency, but also because they were prone to endless looping.

Experiments (1)
Experiment with the algorithm. Try to parse a sentence containing an unknown word.


Experiments (2)
Insert a faulty rule such as np --> np. into the grammar. Describe possible problems.




[Note] In SWI-Prolog, edge may need to be declared as a dynamic predicate to be asserted:
:- dynamic(edge/7).)


Back to main page