Binary Trees

Introduction

There are many ways to hold a collection of data in Python and most other modern languages. But depending on what we want to with the collection some approaches are much better than others.

In this project we'll look briefly at simple lists, linked lists and finally binary trees. In each case we want to be able to do the following efficiently.

  1. insert a new data item (value) into the collection
  2. see if a given value is in the collection
  3. remove a value from the collection

It usually helps if the collection is sorted so that searching is optimized. Most of the time searching for items is much more frequent that inserting or removing items.

We're going mostly assume that the collection needs to be sorted. So must be possible to compare items with the operators less than, equal to, and greater than

For simplicity, we will also assume the following

  1. Values in the collection are unique.
  2. Values will be simple integers. In a real situation the values might be pointer to data structures or records in a database

Simple Python List

The easiest way to make a collection is to simply use the built-in Python list.

>>> collection = [3,6,17,9]
>>> collection
[3, 6, 17, 9]

Here the data here is not sorted. We can insert a new value with the append method

>>> collection.append(5)
>>> collection
[3, 6, 17, 9, 5]

The second requirement to see if a given item is in the collection is trivial

>>> 9 in collection
True
>>> 8 in collection
False

And if Python didn't have the in operator, we could create function like the following. Notice that this is a recursive function. It could also be written with a while or for loop. But this recursion will gently lead into more complex structures later.

>>> def contains(collection, value) :
...     if not collection           : return False    # collection empty
...     elif value == collection[0] : return True     # value is in the first slot
...     else : return contains(collection[1:], value) # value is somewhere futher?
...
>>> contains(collection, 5)
True
>>> contains(collection, 10)
False

If the collection is sorted then average search time can be cut in half.

>>> def contains(collection, value) :
...     if not collection           : return False
...     elif value == collection[0] : return True
...     elif value <  collection[0] : return False     # only if sorted
...     else : return contains(collection[1:], value)

The third requirement to remove a value can be done by splicing around the value. Here we'll use a for loop to find the split point.

>>> def remove(collection, value) :
...     for i in range(len(collection)) :
...        if value == collection[i] :
...            return collection[0:i]+collection[i+1:]  # remove and rejoin

Linked Lists

Linked lists consists of nodes containing both a value and a pointer to the next node. Linked lists only make sense when they are sorted.

We can make nodes using a simple 2 element Python list for each node. The first element in the node is the value and second element points to the next node. In the last node the second element is the value None. Here is an example.

>>> import lists
>>> print lists.sampleList     # linked list is a nested
[10, [20, [30, [40, None]]]]
>>>

The Python function id will show the memory address of any value.

>>> a = 541
>>> id(a)         # shows value *541* resides in memory at 36300744
36300744
>>> b = a
>>> id(b)
36300744          # b == a since they point at the same value
>>> id(b), b
(36300744, 541)
>>> a = 542       # changing the value of *a*
>>> id(a), a
(36300768, 542)

Function lists.showLinks uses the system id function to show the links. Notice that even the value None has a home at 9392928.

>>> l = lists.sampleList
>>> lists.showLinks(l)              # id of node, value, id of next node
140107872758600: 10 140107872758672
140107872758672: 20 140107872685088
140107872685088: 30 140107872683864
140107872683864: 40 9392928

code for founction contains. Almost same as in section 1. Here is a function contains to walk a linked list

18 def contains(l, value) :
19     if   l == None     : return False
20     elif l[0] == value : return True
21     else               : return contains(l[1],value)
22

The traditional way to insert a node into a sorted list is manipulate the pointers. We have to consider 3 possible conditions. Will the node go at the beginning of the list, or the end or somewhere in the middle.

23 def insert (nod0, value) :
24     if   nod0   == None  : return [value, None]    # new list created
25     elif nod0[0] > value : return [value, nod0[1]] # insert as first node
26     else :
27         nod = nod0
28         while nod[1] :
29             if nod[1][0] >= value  :   # next node in the chain?
30                 ins = (value, nod[1])  # build a new node and
31                 nod[1] = ins           # squeeze it in
32                 break                  #
33             else : nod = nod[1]
34         if nod[1] == None : nod[1] = (value, None) # becomes the tail
35         return nod0
36

Using a more functional approach saves a great deal of code. In fact, just a single line for each condition mentioned above.

37 def insert2(l, value) :
38     if   l   == None  : return [value, None]
39     elif l[0] > value : return [value, l]
40     else              : return [l[0], insert2(l[1], value)]
41

It is critical that you understand what is happening in line 40. It replaces lines 26-35 in the earlier version. Take as long as need.

The function lists.showList prints the list in a more natural way. The last 4 digits of each nodes id and the nodes value, seperated with a :. Here is the code followed by a sample run

10 def showList(l, accum="") :
11     if l == None : return accum
12     else :
13         if accum : sep = " --> "
14         else     : sep = ""
15         accum += sep + "%d:%s" % (id(l)%10000,l[0]) # include last 4 dig of id
16         return showList(l[1], accum)
17
>>> l1 = lists.sampleList
>>> print l1
[10, [20, [30, [40, None]]]]
>>> l2 = lists.insert2(l1, 25)
>>> print l2                    <<<<<<<<<<<<<<< need answer
>>> print lists.showList(l1)
7472:10 --> 8696:20 --> 2280:30 --> 2208:40
>>> print lists.showList(l2)
3728:10 --> 3440:20 --> 2792:25 --> 2280:30 --> 2208:40
>>>

Notice please. l1 and l2 share the same nodes after the insertion point (value 25). l2 has a new front-end. This sharing of "tails" will prove to be a key feature for the efficiency of binary trees (coming up)

There is a test function that is invoked like the following

>>> import lists
>>> lists.testFunc(lists.insert2)
<function insert2 at 0x7f5a2d3f9848>
[10, [20, [30, [40, None]]]]
7456:10 --> 7096:20 --> 3512:30 --> 2288:40

input value: 25
[10, [20, [25, [30, [40, None]]]]]
7672:10 --> 7528:20 --> 7600:25 --> 3512:30 --> 2288:40

Tuples for linked lists

With the functional versions of insert, contains and delete, we no longer modify existing nodes. The 2 element lists can be replaced with 2 element tuples. This is a good idea for two reasons. One, tuples use much less memory than lists and two, not being able to modify nodes is a very good thing. Making modifications to list structures can lead to subtle bugs, especially if lists are shared by concurrent processes.

The python programs lists.py and tuples.py can be viewed seperately. They are quite similar.

zip file.

Binary Trees

Binary trees are like linked lists except that they are two dimensional. Each node has a value and two pointers to sub-trees, one left and one right. Both sub-trees can either be the value None or be a root node to another binary tree. The left subtree contains only nodes whose values are less than the parent node. The one on the right only nodes whose values are greater.

For convenience in this discussion, when referring to a node I will use its value to identify it. So "node 70" is the node containing the value of 70. Of course, just like in linked lists, values don't have to be just integers.

A binary tree is a recursive data structure. Here is an example. Notice that in every subtree values on the left are less than the parent above and those on the right are greater.

images/tree-01.png

The python representation of this tree is the following.

(50, (30, (20, (10, None, None), None), (40, None, None)),
     (80, (70, (60, None, None), None), (90, None, None)))

You can see how nodes 20 and 70 have no right hand subtree and how nodes 10, 40, 60 and 90 have no subtrees at all.

The tree is balanced. Although there are nine nodes in the tree each node can be reached from the root in three steps or less.

As with linked lists, we want to be able to do the following; find if the tree contains a value, insert a new value and delete a value. Of course, I'm really refering to a node with that value.

The code for finding a value is simple and looks like the following.

34 def contains(tree, target) :
35     if tree == None : return False
36     (value,left,right) = tree
37     if  value == target : return True
38     elif value > target : return contains(left,target)
39     else                : return contains(right,target)
40

Function contains starts at the root node and traverses the tree with each step either to the left or right. If it lands on the node with the target value, it returns True. Reaching the end returns False. Either value is passed right back up through the series of returns popping out at the top.

>>> from bintree import contains
>>> mytree = (50, (30, (20, (10, None, None), None), (40, None, None)),
...         (80, (70, (60, None, None), None), (90, None, None)))
>>> contains(mytree, 60)
True
>>> contains(mytree, 65)
False

Inserting a new node

Inserting a new node requires finding the right place for the insertion and then building a linked list in the nested returns that mimics the original but in fact creates a new tree that will mostly merge with the old one.

24 def insert(tree, newValue) :
25     if not tree : return (newValue,None,None)
26     (value,left,right) = tree
27     if value == None     : return (newValue,None,None)
28     if value == newValue : return tree  # already there
29     elif newValue < value :
30         return (value, insert(left,newValue), right)
31     else :
32         return (value, left, insert(right,newValue))
33

With each return using lines 25,27,28,30,32 we create a new node linked to those below it. Here's is an example of inserting a value of 65.

images/tree-02.png

The nodes in yellow are from the original tree. The red nodes leading from our new value back to the top are new nodes built with the values of the original returned through the recursion. The original tree still exists and if saved we can show this

Starting again from the first tree above we remove node 40. This requires new nodes for 30 and 50 with the yellow nodes still the original ones. Lines 30 and 32 above mimic line 26 in tuples.py

>>> mytree
(50, (30, (20, (10, None, None), None), (40, None, None)),
(80, (70, (60, None, None), None), (90, None, None)))
>>>
>>> oldtree = mytree
>>> from bintree import insert
>>> newtree = insert(oldtree, 65)
>>> newtree
(50, (30, (20, (10, None, None), None), (40, None, None)),
(80, (70, (60, None, (65, None, None)), None), (90, None, None)))
>>> oldtree
(50, (30, (20, (10, None, None), None), (40, None, None)),
(80, (70, (60, None, None), None), (90, None, None)))

So, no side-effects.

Deleting nodes

Deleting a node is somewhat more complex. The following shows the effect of deleteing node 40.

images/tree-03.png

Basically, when the root of a subtree is deleted, one of other nodes must take its place. The new root node must maintain the conditions. The minimum node in the right subtree works fine. Why? Its value is still larger than anything in the left node and smaller than any other node on the right:. An even simpler process can be used if either subtree is empty. Here is the code.

41 def delete(tree, target) :
42     if tree == None : return None
43     (value, left, right) = tree
44     if target == value :  # delete this node
45         if left and right :
46             swp = minval(right)
47             print "Swap val", swp
48             return (swp, left, delete(right,swp))
49         if left  == None : return right
50         if right == None : return left
51     elif target < value :
52         return (value, delete(left,target), right)
53     elif target > value :
54         return (value, left, delete(right,target))

57 def minval(tree) :
58     (val,left,right) = tree
59     if left == None : return val
60     else : return minval(left)
61

If we have found the node to delete (line 44) and if both left and right branches are present then we find the minimum value on the right side using the function minval. A new node is created (line 48) and a new branch built (in red) back to the top.

If either the left or right branch is empty then the other branch can simply take the place of the node deleted (lines 49,50). Lines 51 through 54 traverse down the tree to find the node to delete.

Balancing a tree

The optimum tree can be very time efficient. In a balanced tree with 1024 nodes every node is within 10 steps of the root. Deletes and Inserts also impact the tree by only rebuild new paths of this length. But this efficiency can be lost if the sequence of inserts is not random.

In the next picture 2 nodes have been inserted. They leave the tree somewhat out of balance. Instead of all nodes being within 3 step from the root, node 62 is five steps away

images/tree-04.png

After rebalancing, all nodes are again within 3 steps from the root

images/tree-05.png

Below is the code to make this happen. The idea is to convert the tree into a Python list that is sorted. Now the optimum root for the binary tree is in the middle. Take values to the left and create a sub-tree for the left link of our root and those values to the right for the right link. Do this recursively and tree simply unfolds.

The function values builds the sorted Python list. Sorting is automatic in how the tree is traversed recursively. Function balance2 is also recursive and a sub-root node is inserted at each iteration.

07
08 def values(tree) :
09     if not tree : return []
10     (value,left,right) = tree
11     return values(left) + [value] + values(right)
12
13 def balance(tree) :
14     return balance2(None, values(tree))
15
16 def balance2(btree, sorted) :
17     if not sorted  : return btree
18     split = len(sorted)/2
19     btree = insert  (btree,sorted[split])     # midpoint
20     btree = balance2(btree,sorted[0:split])   # left side
21     btree = balance2(btree,sorted[split+1:])  # right side
22     return btree
23

If the number of nodes is 2^n-1 for some number n then the tree will be perfectly balanced. Otherwise there will be a partially filled in row at the bottom from right to left. What change in line 18 will cause the row to be filled from left to right?

The graphics are produced from test.py which in turn uses displayTree.py. You can run tree.py if you have pygame installed. Download and unzip bintree.zip to tie into the pygame system.

$ python test.py
Type number or b or x: 25

Typing a number will insert or delete an existing node or insert a new one. Typing b will rebalance the tree. And x will exit.

The Wrapup

Recursive data structures should always be handled with recursive algorithms. With modern languages like Python this is generally not too difficult. But there is a learning curve before you can do it easily. Even so, it is definitely worth the effort. I once read a quote somewhere about functional programming that went something like

"I've never spent so much time writing so little code that does so much".

But with practice the "much time" becomes "just a little time".

If you have comments or suggestions You can email me at mail me

Copyright © 2019 Chris Meyers

.