SIR - Semantic Information Retrieval

Introduction.

SIR was an early AI program written in Lisp by Bertram Rapheal as part of his PHD work. I was exposed to the original program in the early 70's in my Artificial Intelligence class. The program inputs information from the user and builds a semantic network, that is, a network of objects connected by their relationships to each other. It can then, in turn, answer questions from the user making deductions from those same relationships.

Objects and Relationships

Consider the following dialogue

? the cat owns a collar
  I understand
? fluff is a cat
  I understand
? does fluff own a collar
  Not sure
? every cat owns a collar
  I understand
? does fluff own a collar
  Yes
?

The program parses each sentence, which must be in a fairly rigid format, and outputs a response. The way this is done will become more clear as we proceed.

There are basically these kinds of objects

Specific objects such as "fluff" or "the cat"

Generic objects such as "collar" and "cat" and "animal"

Relationships between objects are the following

A specific object may be a member of set of generic objects

A specific object may have multiple names (equivalence). "Dave" and "David" may refer to the same person.

A generic object may be a subset of another generic

A specific object may own (possess) another specific or generic object

Each member of a generic set may own another generic object

The original Lisp program as published in Shapiro's "Techniques of Artificial Intelligence" is about 13 pages of code and difficult. Several years ago I tried to basically translate it into Python but with unsatisfactory results. Recently, I took another look at the program with the idea of simplifying it somewhat and of using regular expressions to both parse the input and to express validations for relationship paths when answering questions. The result is a fairly simple program that is only a page and half of Python and should be fairly easy to understand.

Just enough regular expression syntax

I don't want to spend much time on regular expression syntax. There are lots of resources on the internet. We can see what our program uses with a single example.

>>> import re
>>> m = re.match ("(every|any|an|a) (.*) is (a|an) (.*)", "a man is a person")
>>> m
<_sre.SRE_Match object at 0xb7ef4cd8>
>>> m.groups()
('a', 'man', 'a', 'person')
>>>

The "re" module contains a "match" function which takes 2 parameters; a pattern and a string to match against the pattern. If the match is successful a "match" object is returned. The "groups" method returns a tuple with the parts matching subpatterns enclosed in parentheses. The "|" character makes alternative matches possible. The "." character matches any single character and finally the "*" character indicates that the preceding character (or group) may be present zero or more times.

That's basically all we need to know

A Quick Overview of the Code

At this point, bring the code for sir.py into another window or make a printout.

The first thing you'll notice is the "rules" table that matches sentence patterns with actions. Statements that are input result in calls to the function "addFact". Questions result in calls to the function "getPath". There are also simple commands to dump the facts table, toggle the debug flag, and exit the program gracefully.

As always, the program starts in the function "main" which reads input from either a file (or files) followed by input from the user. A file name of "." bypasses input from the keyboard which makes it convenient to do regression testing as the program is developed

The interesting bits will now be looked at in more detail

Relationship Paths

It's pretty easy to see how regular expressions can match our input statements and questions. They can also be used to describe valid chains of relationships used to answer questions.

Inputting a sentence like "every man is a person" adds a relation to our "database" of relations. This is actually just a list of relations ("facts"). Each relation is a tuple. In this case the tuple will be ('man','s','person') where 's' means 'subset'.

The function "matchSent" matches your input sentence against the patterns in "rules". If a match is found, the function "addFact" is called with 2 arguments; the groups from the match and a command string like "1s3|3S1". Commands in the string are separated by a "|". This string means add the relation "group(1) is a subset of group(3)" and add the relation "group(3) is a superset of group(1). Here are some more examples.

rules = (
 ("(every|any|an|a) (.*) is (a|an) (.*)",lambda g: addFact(g,"1s3|3S1")),
 ("(.*) is (a|an) (.*)",                 lambda g: addFact(g,"0m2|2M0")),
 ("(.*) is (.*)",                        lambda g: addFact(g,"0e1|1e0")),
 ...

Single characters to denote relationships. Lower case letters go one way and upper case the opposite. The meanings are as follows

"s" is subset, "S" is superset
"m" is member, "M" is contains
"e" is equivalent
"p" is "possess", "P" is "possessed by"

If we run the program in debug mode, the function "addFact" gives us more information than just the mysterious "I understand".

: python sir.py
? debug
? every man is a person
  adding fact ('man', 's', 'person')
  adding fact ('person', 'S', 'man')
  I understand
?

Things get interesting when we start asking questions. Leaving debug on, watch the following

? every man is a person
  adding fact ('man', 's', 'person')
  adding fact ('person', 'S', 'man')
  I understand
? fred is a man
  adding fact ('fred', 'm', 'man')
  adding fact ('man', 'M', 'fred')
  I understand
? is fred a person
  path -  fred  to  person
    path -  man  to  person
  Yes e*ms* ['ms']
?

Now "fred" is a member of the set "man" and "man" is a subset of "person". Finding a path from "fred" to "person" becomes a two step process that we can denote with "ms" (member-subset)

The rule matching "is fred a person" is

( "is (.*) (a|an) (.*)",   lambda g: getPath(g,"0e*ms*2"))

which results in a call to the function "getPath" finding a path from "fred" (g[0]) to "person" (g[2]) using the pattern "e*ms**". A path "ms" (member-subset) was found to link "fred" to "person". This is the simplest case for this pattern; there were no "equivalence" or additional "subset" steps necessary.

Moving along, let's add an equivalence at the front end and another subset on the back end to see the full pattern put to use.

?  every person is a creature of god
    adding fact ('person', 's', 'creature of god')
    adding fact ('creature of god', 'S', 'person')
    I understand
? the big guy is fred
    adding fact ('the big guy', 'e', 'fred')
    adding fact ('fred', 'e', 'the big guy')
    I understand
? is the big guy a creature of god
    path -  the big guy  to  creature of god
      path -  fred  to  creature of god
        path -  man  to  creature of god
          path -  person  to  creature of god
        path -  the big guy  to  creature of god
    Yes e*ms* ['emss']
?

Forward and backward paths

Consider the following dialogue from the file sir.1.txt

$ python sir.py sir.1.txt .
bill is a man
  I understand
is bill a person
  Not sure
bill owns a porsche
  I understand
does any person own a car
  Not sure
a man is a person
  I understand
does any person own a car
  Not sure
a porsche is a car
  I understand
debug
does any person own a car
  path -  person  to  car
    path -  man  to  car
      path -  bill  to  car
        path -  porsche  to  car
  Yes S*Me*ps* ['SMps']

Now we can see the reason for adding bidirectional rules. The path is 'SMps'. "person" is a superset of "man". "man" contains "bill" as a member (M relation), "bill" owns a "porsche", and finally, "porsche" is a subset of "car".

Efficiency in path searching

The function "path" is recursive and left to its own devices can conduct a very large search. It needs to be controlled a bit. A couple of techniques are used.

The "used" dictionary at each recursion level keeps the search from getting into a loop. It requires that each relationship added to a path is "fresh", not used somewhere earlier in the path. The problem is easy to see if we have "Joe is Joseph" and an "e*" in the validation pattern. Then we quickly get "Joe->Joseph->Joe->Joseph..." until we have a stack overflow. The following clips this possibility.

if used.get(fact) : continue

At each level of recursion "used" carries all the earlier relationships and adds its own. A new dictionary is created and is updated with the preceding relations already used. Backing out again our new entries are forgotten so that other paths may still be explored that use the same relationships in other ways.

The function "okSoFar" checks that as each relationship is added to the path the pattern is still partially matched. If that is not the case then we have started down a blind alley.

A final check involves the variable "indent" in the function "path" which grows with each level of recursion. Simply checking that it is not growing too big will eventually abandon any path caught up in an infinite loop.

With a little unix magic we can get a feel for the importance of these checks. Let's run the above dialogue but pipe the output to "grep" to filter out all lines not containing the word "path".

$ python sir.py sir.inp . | grep path
  path -  person  to  car
    path -  man  to  car
      path -  bill  to  car
        path -  porsche  to  car

Now let's pipe this output to the program "wc" (word count) to see the number of lines

python sir.py sir.inp . | grep path | wc
     4      20     112

So, 4 lines, 20 words, 112 characters. If you now comment out the lines

sts = okSoFar(pat, sofar+rel)
if not sts : continue

you will see the 4 lines jump to an 8. If you additionally comment out the line

if used.get(fact) : continue

you see the 8 jump to 232! The contraints are quite important.