The K-means Algorithm


The K-means algorithm seeks out clusters of values in data.

In a simple (and perhaps silly) example let us consider the manufacturing and marketing of ketchup.

Note that this example (with only two variables) is nice because we can easily place each buyer on a 2D scatter plot.

When using the K-means algorithm, we first choose a value for K, which is this case, will be number of ketchup varieties.

The K-means Algorithm in Action

The program labels each buyer with a lower case letter, (a-p), and each variety of ketchup as an upper case letter, (A-D)

To start, it randomly assigns each buyer to a variety. and randomly scatters buyers and varieties, across the plot. So the resultant plot is pretty chaotic.

On the plot buyers have the color of their current chosen variety, and just to make the presentation a bit more dramatic, we connect the buyer to the variety with a line of the same color.

In the first step the buyers may chose a more optimal variety (and most do) than the one randomly assigned. Just buyers f and j "decide" to stick with their original assignments. A great deal of order suddenly appears.

Following the swap, varieties A and D move (change their formulations) to the mean of their buyer positions to be optimally close to their buyers.

But then some buyers whose favorite variety moved away from them may decide to switch to another variety that is suddenly a better match to their preferences.

In this example, after six iterations of shuffle-and-swap the program reaches a stable configuration and no swaps occur after the last shuffle.

This should be an overall optimum distribution.

Interestingly, variety D lost all its buyers with the first swap but by the end of the process, it had the most buyers. On the other hand B's three buyers stuck with it. All four varieties moved around quite a bit.

So, in summary, to create 4 varieties of ketchup that bring the most happiness to the most folks, we have computed formulations that now have salt and sugar amounts shown in (A, B, C, D) in the scatter plot.

You can see running by clicking this animation. Use the browsers back button to return here.

Running kmeans

You may use any version of python (2.7 or greater) to run these animations.

$ python kmeans.py seed=180

generates the animation above. The seed=180 presets the random number generator. The default screen size is 600x600 pixels.

$ python kmeans.py seed=180 pixels=300

will do the same thing at half the size.

You can watch the buyers switch to different varieties, and then how vendors adjust their formulation to better cater to their current set of customers. Click the mouse inside the graphic window to step through. If you want to exit early you can also click the window-close button in the upper right.

So, in nine steps, the program will

With every second click, buyers may switch to another variety and vendors may re-adjust their formulations.

The arguments on the command line are:

Try the following. Results will be random since we are not seeding the random number generator.

$ python kmeans.py varieties=3 buyers=26

You might notice that these command line arguments look much like arguments in Python function calls. This is all handled in the module sarg. I find it useful for ad-hoc programs.

The Program kmeans.py from the top down

Once the animated example is seen, the Python code should seem pretty clear and intuitive and you can probably imagine how you might go about coding it. My approach is to make two classes, Variety and Buyer. You can download the project code here. kmeans.zip

There are two object classes Variety and Buyer.

class Variety has the following attributes:

25 class Variety :
26     def __init__ (self, label) :
27         self.x,self.y = randomXY()
28         self.label = label
29         self.buyers = []
30
31     def attachBuyer (self, buyer) :
32         self.buyers.append(buyer)
33
34     def detachBuyer (self, buyer) :
35         if buyer in self.buyers :
36             self.buyers.remove(buyer)
37
38     def centerToBuyers(self) :
39         nBuyers = len(self.buyers)
40         if nBuyers :
41             xsum = ysum = 0.0
42             for buyer in self.buyers :
43                 xsum += buyer.x
44                 ysum += buyer.y
45             self.x = xsum/nBuyers
46             self.y = ysum/nBuyers
47
48     def distTo (self, buyer) :
49         xdist = abs(self.x-buyer.x)
50         ydist = abs(self.y-buyer.y)
51         return math.sqrt(xdist*xdist + ydist*ydist)
52

Instances of the Buyer class have attributes for a label, position and current variety. These latter two are initially assigned at random. (58). (To make this easy, the varieties are created before the buyers.)

Method choose(), (61), lets a buyer trade his current variety for another that is closer. It looks at every alternative to find the best Variety at the best Distance, and returns true if a swap took place.

54 class Buyer :
55     def __init__ (self, label, varieties) :
56         self.x,self.y = randomXY()
57         self.label   = label
58         self.variety = varieties[random.randint(0,len(varieties)-1)]
59         self.variety.attachBuyer(self)
60
61     def choose(self, varieties) :
62         best Variety = self.variety
63         bestDist = best Variety.distTo(self)
64         for tryout in varieties :
65             if tryout != self.variety :
66                 tryDist = tryout.distTo(self)
67                 if tryDist < bestDist :
68                     best Variety = tryout
69                     bestDist    = tryDist
70         if best Variety != self.variety :
71             self.variety.detachBuyer(self)
72             best Variety.attachBuyer(self)
73             self.variety = best Variety
74             return True                # changed variety
75         return False
76

Putting buyers and varieties through their paces

Function kmeans() is the main engine which drives the algorithm.

 81 def kmeans(seed)  :
 82     labels = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
 83     babels= labels.lower()
 84
 85     varieties = [Variety(labels[i])        for i in range(nVarieties)]
 86     buyers = [Buyer(babels[i], varieties) for i in range(nBuyers)]
 87     fig = 0
 88     display(varieties, buyers, "Initial Placement")
 89     while True :
 90         changed = False
 91         fig += 1
 92         for b in buyers :
 93             changed = b.choose(varieties) or changed
 94         display(varieties, buyers, "Fig %d Buyers Choose Closest"%fig)
 95
 96         if not changed : break
 97
 98         fig += 1
 99         for v in varieties :
100             v.centerToBuyers()
101         display(varieties, buyers, "Fig %d Varieties Move to Center"%fig)
102     display(varieties, buyers, "Fig %d Done. Stability found"%fig)
103

Displaying the changing scatter plot

The function display() erases and fills the graphic window with each step. When done it waits for the user to click the mouse (131) in the window before returning. The last click exits the program. You

getcolor() maps the a label to a color from the colors list.

 77 def getColor(label) :
 78     slot =  ord(label.lower())-ord('a')
 79     return colors[slot%len(colors)]  #nifty wrap-around
 80

104 def display(varieties, buyers, banner) :
105     for obj in drawn : obj.undraw()
106
107     txtBanner = Text(Point(pixels//2,20), banner)
108     txtBanner.draw(win)
109     drawn.append(txtBanner)
110
111     for b in buyers :
112         v = b.variety
113         color = getColor(v.label)
114         p1 = Point(b.x,b.y)
115         p2 = Point(v.x,v.y)
116
117         line = Line(p1,p2)
118         line.setFill(color)
119         line.draw(win)
120         drawn.append(line)
121
122         txtLabel = Text(p1, b.label)
123         txtLabel.draw(win)
124         drawn.append(txtLabel)
125
126     for v in varieties :
127         color = getColor(v.label)
128         txtVar = Text(Point(v.x,v.y),v.label)
129         txtVar.draw(win)
130         drawn.append(txtVar)
131     win.getMouse()
132

Running the program

At the bottom of kmeans.py we launch the function kmeans with a seed to generate random numbers.

For debugging purposes as well as for reproducing interesting runs, it is necessary to be able to use a known seed to exactly reproduce a computational run. If no seed= is provided the one chosen is printed so you can reuse it. (line 138)

133 if __name__ == "__main__" :
134     seed    = sarg.Int("seed", 0)
135     if seed : random.seed(seed)
136     else :
137         seed = random.randint(101,999)
138         print("Seed is %s" % seed)
139         random.seed(seed)
140     kmeans(seed)
141

Again, the code files can be found in kmeans.zip

If you have comments or suggestions You can email me at mail me

Copyright © 2014-2021 Chris Meyers and Fred Obermann