The K-means Algorithm

The K-means algorithm is used to seek clusters in data. Clusters are formed from data points that are close to each other, but not so close to other clusters. The data may represent anything and in a world of commerce it is everywhere. For example, user preferences and finding clusters of like users can enable advertisers to target ads and offers selectively to individuals. Each ad or offer would be targeted to users in a given cluster and the ad itself modified to appeal to them.

As an simple and perhaps silly example lets consider the manufacturing and marketing of ketchup. Ketchup contains not only tomatoes and vinegar which are fairly healthy but also a lot of sugar and salt. Each ketchup buyer prefers a certain mix of salt and sugar and can specify exactly how much of each would be optimal for them. This example is nice because with only two variables the buyer can be simply assigned a spot on a 2D scatter plot.

The varieties of ketchup that can be produced is limited. When using the K-means algorithm we first choose a value for K which is this case will be number of ketchup varieties. Each variety also is a spot on the 2D plot corresponding to its salt and sugar density. Perhaps salt is on the X axis and sugar on the Y axis.

The K-means Algorithm in Action

As we start the scatter plot displays each buyer in lower case and each variety of ketchup in upper case. Initially, each buyer is assigned at random to a given variety. On the plot buyers are in the color of their current variety and to make the presentation a bit more dramatic, we connect the buyer to the variety with a line of the same color.

The following 3 pictures show the interactions of buyers and varieties. The Initial placement has buyers (lower case) scattered randomly in the plot and also four varietes (A, B, C and D) also scattered randomly. Since each buyer is assigned to a variety at random. The result is pretty chaotic.

In the second picture the buyers have been allowed to choose a more optimal variety and most do. But buyers f and j "decide" to stick with their original assignments. A great deal or order suddenly appears.

Following the swap, varieties A to D move to the mean of their buyer positions to be optimally close to their buyers. But then some buyers whose favorite variety moved away from them may decide to switch to another variety that is suddenly a better match to its preferences. In this example, after six pairs of shuffle-and-swap a stable configuration is reached; that is, no swaps occur after the last shuffle. This should an overall optimum placement. Interestingly, variety D lost all its buyers with the first swap but by the end has the most buyers. On the other hand B's three buyers stuck with it. All four varieties moved around quite a bit

images/km_001.png images/km_002.png images/km_012.png

To see this in action as a little movie, click on the animation link

Animated (return with the browser's Back button)

So, in summary, to create 4 varieties of ketchup that bring the most happiness to the most folks, create varieties that now have salt and sugar amounts shown in A, B, C, D in the scatter plot. Actually, this is overly optimistic.

Running kmeans

The images above can be produced by the simple command from a terminal window.

$ python kmeans.py seed=180

This will generate a 600x600 pixel presentation. To reproduce at half the size use

$ python kmeans.py seed=180 pixels=300

At the end you will be asked to "Hit Return to ext". You may have to click the terminal window before you can.

The seed= argument seeds the random number generator letting us reproduce exactly the same sequence. We always generate square frames and pixels= lets us specify the size of our window. Other flags you can specify are buyers= and varieties=. These numbers must be less or equal to 26, basically to fit in the alphabet. Try the following.

$ python kmeans.py varieties=3 buyers=26

These command line arguments look like arguments in Python function calls. This is all handeled in a small module sarg discussed below.

The Code

As with the other projects in this group, the sarg, camera and pgcon modules are described in the Visualization with Pygame If you haven't already done so, you may want to read about them first.

kmeans.py from the top down

Once the animated example is seen, the algorithm seems pretty clear and intuitive. And you can probably imagine how you might go about coding it. I'm going to approach my version more or less top-down. You can download the zip file with the necessary files in kmeans.zip

Two object classes, variety and buyer are pretty obvious.

class Variety :
   def __init__ (self, label) :
       self.x,self.y = randomXY()
       self.label = label
       self.buyers = []

   def attachBuyer (self, buyer) :
       self.buyers.append(buyer)

   def detachBuyer (self, buyer) :
       if buyer in self.buyers :
           self.buyers.remove(buyer)

   def centerToBuyers(self) :
       nBuyers = len(self.buyers)
       if nBuyers :
           xsum = ysum = 0.0
           for buyer in self.buyers :
               xsum += buyer.x
               ysum += buyer.y
           self.x = xsum/nBuyers
           self.y = ysum/nBuyers

   def distTo (self, buyer) :
       xdist = abs(self.x-buyer.x)
       ydist = abs(self.y-buyer.y)
       return math.sqrt(xdist*xdist + ydist*ydist)

A variety has an x/y position and a label. In addition it has a list of buyers that are attached to it. The methods attachBuyer and detachBuyer simply add or remove a buyer from the buyers list. The method centerToBuyers is also straight forward. The variety calculates the average x and y positions of all attached buyers and adjusts its postion accordingly. Of course, skip this step if there are no buyers attached.

class Buyer :
   def __init__ (self, label, varieties) :
       self.x,self.y = randomXY()
       self.label   = label
       self.variety = varieties[random.randint(0,len(varieties)-1)]
       self.variety.attachBuyer(self)

   def choose(self, varieties) :
       bestVariety = self.variety
       bestDist = bestVariety.distTo(self)
       for tryout in varieties :
           if tryout != self.variety :
               tryDist = tryout.distTo(self)
               if tryDist < bestDist :
                   bestVariety = tryout
                   bestDist    = tryDist
       if bestVariety != self.variety :
           self.variety.detachBuyer(self)
           bestVariety.attachBuyer(self)
           self.variety = bestVariety
           return True                # changed variety
       return False

The instance of the buyer class also is assigned a fixed x/y postion and a label. On initialization its initial variety is chosen at random. To make this easy, the varieties are created before the buyers.

The choose method allows a buyer to trade its current variety for another that is closer. It's a matter of simply looking at every alternative to find the bestVariety at the bestDistance. The method choose returns true if a swap took place.

Putting buyers and varieties through the paces

The following is the main engine driving the algorithm.

def kmeans(seed)  :
   from pgcon import Pgcon
   pgcon = Pgcon()
   labels = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
   blabels= labels.lower()

   varieties = [Variety(labels[i])        for i in range(nVarieties)]
   buyers = [Buyer(blabels[i], varieties) for i in range(nBuyers)]
   move = 0
   display(pgcon, labels, varieties, buyers, "Initial Placement")
   while True :
       changed = False
       move += 1
       for b in buyers :
           changed = b.choose(varieties) or changed
       display(pgcon, labels, varieties, buyers, "Move %d Swap"%move)

       if not changed : break

       for v in varieties :
           v.centerToBuyers()
       display(pgcon, labels, varieties, buyers, "Move %d Center"%move)

   pgcon.close(pause=True)

The Pgcon module built an instance pgcon that is simply passed through to the display function. Varieties are assigned their label from labels and buyers from blabels. The varieties and buyers are built efficiently using list comprehension. Then the Initial Placement displayed.

The while loop runs until changed no longer gets set to True. With each trip through the while loop all buyers are allowed to change their variety preference. If none do (and changed remains False) then we are done.

Displaying the changing scatter plot

The function display creates a screen image and perhaps saves it to disk as a png, which may later become a frame in a gif animation.

def getColor(label) :
   slot =  ord(label.lower())-ord('a')
   return colors[slot%len(colors)]  #nifty wrap-around

def display(pgcon, labels, varieties, buyers, banner) :
   pgcon.newScreen()

   for b in buyers :
       v = b.variety
       color = pgcon.lite(getColor(v.label))
       pgcon.lineDraw(color, (b.x,b.y),(v.x,v.y), 1)
       pgcon.textDraw(color, (b.x,b.y), b.label)

   for v in varieties :
       color = getColor(v.label)
       pgcon.textDraw(color, (v.x,v.y), v.label)
   pgcon.textDraw(WHITE, (pixels/2 ,20), banner)
   pgcon.camera.armed = True    # capture png to file if camera set up for it
   pgcon.writeScreen(.500)

Though we haven't looked at the Pgcon class yet, using it should be quite straight-forward. Each buyer is displayed by its label (a lower case letter) at its x/y position along with a line connect it to its current variety. The color of the buyer and line matches the color of the variety but at half the intensity. This makes each cluster stand out from the others.

Launching the program

Finally, at the bottom of kmeans.py we launch the function kmeans with a seed for generating random numbers. A seed may be provided on the command line with an argument like "seed=180". If none is supplied one is chosen from 899 possibilities. Being able to replay a run by providing a seed is pretty important for debugging.

if __name__ == "__main__" :
   seed    = sarg.Int("seed", 0)
   if seed : random.seed(seed)
   else :
       seed = random.randint(101,999)
       print "Seed is", seed
       random.seed(seed)
   kmeans(seed)

Here is a sample run creating the gif image you saw at the top

$ python kmeans.py pixels=300 prefix=km maxpics=40 gif=1 seed=180
Just captured frame: 1, km_001.png
Just captured frame: 2, km_002.png
Just captured frame: 3, km_003.png
Just captured frame: 4, km_004.png
Just captured frame: 5, km_005.png
Just captured frame: 6, km_006.png
Just captured frame: 7, km_007.png
Just captured frame: 8, km_008.png
Just captured frame: 9, km_009.png
Just captured frame: 10, km_010.png
Just captured frame: 11, km_011.png
Just captured frame: 12, km_012.png
Hit Return to exit
Making uniform GIF
km.gif has 12 frames

All of the code files can also be obtained as kmeans.zip A images/ directory is included that contains the png files and the resulting gif file in the example at the top. You can use an image viewer or browser to view a slide show with the pngs or the little movie with the gif. Have fun.