The K-means Algorithm

The K-means algorithm seeks out clusters of values in data.

• Clusters are formed from data points that are close in value to each other, but not so close to data points in other clusters.
• The data may represent anything, and in a world of commerce, the algorithm is used everywhere.
• Clustering by user-preferences can enable vendors to selectively target individuals with offers and advertisements.

In a simple (and perhaps silly) example let us consider the manufacturing and marketing of ketchup.

• The number of varieties of ketchup is limited.
• Each variety contains not only tomatoes and vinegar which are fairly healthy but also a lot of sugar and salt.
• Each ketchup buyer prefers a certain mix of salt and sugar, and can specify exactly how much of each would be optimal.

Note that this example (with only two variables) is nice because we can easily place each buyer on a 2D scatter plot.

When using the K-means algorithm, we first choose a value for K, which is this case, will be number of ketchup varieties.

• We plot each user-position based on his preferred density of salt and sugar on the X axis and Y axis, respectively.
• We plot each ketchup variety-position corresponding to its salt and sugar density.

The K-means Algorithm in Action

The program labels each buyer with a lower case letter, (a-p), and each variety of ketchup as an upper case letter, (A-D)

To start, it randomly assigns each buyer to a variety. and randomly scatters buyers and varieties, across the plot. So the resultant plot is pretty chaotic.

On the plot buyers have the color of their current chosen variety, and just to make the presentation a bit more dramatic, we connect the buyer to the variety with a line of the same color.

In the first step the buyers may chose a more optimal variety (and most do) than the one randomly assigned. Just buyers f and j "decide" to stick with their original assignments. A great deal of order suddenly appears.

Following the swap, varieties A and D move (change their formulations) to the mean of their buyer positions to be optimally close to their buyers.

But then some buyers whose favorite variety moved away from them may decide to switch to another variety that is suddenly a better match to their preferences.

In this example, after six iterations of shuffle-and-swap the program reaches a stable configuration and no swaps occur after the last shuffle.

This should be an overall optimum distribution.

Interestingly, variety D lost all its buyers with the first swap but by the end of the process, it had the most buyers. On the other hand B's three buyers stuck with it. All four varieties moved around quite a bit.

So, in summary, to create 4 varieties of ketchup that bring the most happiness to the most folks, we have computed formulations that now have salt and sugar amounts shown in (A, B, C, D) in the scatter plot.

You can see running by clicking this animation. Use the browsers back button to return here.

Running kmeans

You may use any version of python (2.7 or greater) to run these animations.

\$ python kmeans.py seed=180

generates the animation above. The seed=180 presets the random number generator. The default screen size is 600x600 pixels.

\$ python kmeans.py seed=180 pixels=300

will do the same thing at half the size.

You can watch the buyers switch to different varieties, and then how vendors adjust their formulation to better cater to their current set of customers. Click the mouse inside the graphic window to step through. If you want to exit early you can also click the window-close button in the upper right.

So, in nine steps, the program will

• compute and display the scatter plot based on the current formulations of ketchup, and customer choices
• wait for the user to click on the image
• re-compute and display the scatter plot based on the current formulations of ketchup, and customer choices

With every second click, buyers may switch to another variety and vendors may re-adjust their formulations.

The arguments on the command line are:

• seed=nnn – allows us to seed the random number generator (so we re-run interesting sequences).
• pixels=nnn – lets us specify the size of our square window.
• buyers=nn and varieties=nn – allow us to specifiy the number of buyers and varieties. These numbers must be less or equal to 26, (a limitation of the English alphabet).

Try the following. Results will be random since we are not seeding the random number generator.

You might notice that these command line arguments look much like arguments in Python function calls. This is all handled in the module sarg. I find it useful for ad-hoc programs.

The Program kmeans.py from the top down

Once the animated example is seen, the Python code should seem pretty clear and intuitive and you can probably imagine how you might go about coding it. My approach is to make two classes, Variety and Buyer. You can download the project code here. kmeans.zip

There are two object classes Variety and Buyer.

class Variety has the following attributes:

• an x/y position (line 27)
• a label (28)
• a list of buyers that are attached to it (29)
• centerToBuyers() which calculates a graph position based on the average tastes (x and y offsets) of its buyers.
``````25 class Variety :
26     def __init__ (self, label) :
27         self.x,self.y = randomXY()
28         self.label = label
30
33
37
41             xsum = ysum = 0.0
47
48     def distTo (self, buyer) :
51         return math.sqrt(xdist*xdist + ydist*ydist)
52``````

Instances of the Buyer class have attributes for a label, position and current variety. These latter two are initially assigned at random. (58). (To make this easy, the varieties are created before the buyers.)

Method choose(), (61), lets a buyer trade his current variety for another that is closer. It looks at every alternative to find the best Variety at the best Distance, and returns true if a swap took place.

``````54 class Buyer :
55     def __init__ (self, label, varieties) :
56         self.x,self.y = randomXY()
57         self.label   = label
58         self.variety = varieties[random.randint(0,len(varieties)-1)]
60
61     def choose(self, varieties) :
62         best Variety = self.variety
63         bestDist = best Variety.distTo(self)
64         for tryout in varieties :
65             if tryout != self.variety :
66                 tryDist = tryout.distTo(self)
67                 if tryDist < bestDist :
68                     best Variety = tryout
69                     bestDist    = tryDist
70         if best Variety != self.variety :
73             self.variety = best Variety
74             return True                # changed variety
75         return False
76``````

Putting buyers and varieties through their paces

Function kmeans() is the main engine which drives the algorithm.

• We build the varieties and buyers using list comprehensions (85, 86).
1. We assign a 1-letter upper-case label to each variety from labels.
2. We assign a 1-letter lower-case label to each buyer from babels.
• We display the Initial Placement (88).
• The while loop runs until changed no longer gets set to True.
1. With each trip through the while loop all buyers are allowed to change their variety preference.
2. If none do (and changed remains False) then we are done.
• Otherwise, we make the switches and let the varieties wiggle their way to their new centers of gravity (100).
• With each change we generate a new display.
• A last display() (102) shows the final stable configuration.
`````` 81 def kmeans(seed)  :
82     labels = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
83     babels= labels.lower()
84
85     varieties = [Variety(labels[i])        for i in range(nVarieties)]
87     fig = 0
89     while True :
90         changed = False
91         fig += 1
92         for b in buyers :
93             changed = b.choose(varieties) or changed
95
96         if not changed : break
97
98         fig += 1
99         for v in varieties :
101         display(varieties, buyers, "Fig %d Varieties Move to Center"%fig)
102     display(varieties, buyers, "Fig %d Done. Stability found"%fig)
103``````

Displaying the changing scatter plot

The function display() erases and fills the graphic window with each step. When done it waits for the user to click the mouse (131) in the window before returning. The last click exits the program. You

• display() works like other projects using John Zelle's graphics.py.
• It draws instances of Line and Text classes onto the window.
• It keeps a reference to each instance in the global list, drawn.
• At the start of a new frame
1. It iterates through the drawn list and un-draws everything it had previously drawn
2. It redraws every instance, adding the drawing artifacts to the drawn list.

getcolor() maps the a label to a color from the colors list.

`````` 77 def getColor(label) :
78     slot =  ord(label.lower())-ord('a')
79     return colors[slot%len(colors)]  #nifty wrap-around
80

104 def display(varieties, buyers, banner) :
105     for obj in drawn : obj.undraw()
106
107     txtBanner = Text(Point(pixels//2,20), banner)
108     txtBanner.draw(win)
109     drawn.append(txtBanner)
110
111     for b in buyers :
112         v = b.variety
113         color = getColor(v.label)
114         p1 = Point(b.x,b.y)
115         p2 = Point(v.x,v.y)
116
117         line = Line(p1,p2)
118         line.setFill(color)
119         line.draw(win)
120         drawn.append(line)
121
122         txtLabel = Text(p1, b.label)
123         txtLabel.draw(win)
124         drawn.append(txtLabel)
125
126     for v in varieties :
127         color = getColor(v.label)
128         txtVar = Text(Point(v.x,v.y),v.label)
129         txtVar.draw(win)
130         drawn.append(txtVar)
131     win.getMouse()
132``````

Running the program

At the bottom of kmeans.py we launch the function kmeans with a seed to generate random numbers.

• A seed may be provided on the command line with an argument like "seed=321".
• If none is supplied we choose one from 899 possibilities. It will then print just after starting.

For debugging purposes as well as for reproducing interesting runs, it is necessary to be able to use a known seed to exactly reproduce a computational run. If no seed= is provided the one chosen is printed so you can reuse it. (line 138)

``````133 if __name__ == "__main__" :
134     seed    = sarg.Int("seed", 0)
135     if seed : random.seed(seed)
136     else :
137         seed = random.randint(101,999)
138         print("Seed is %s" % seed)
139         random.seed(seed)
140     kmeans(seed)
141``````

Again, the code files can be found in kmeans.zip

If you have comments or suggestions You can email me at mail me