The K-means algorithm seeks out clusters of values in data.

- Clusters are formed from data points that are close in value to each other, but not so close to data points in other clusters.
- The data may represent anything, and in a world of commerce, the algorithm is used everywhere.
- Clustering by user-preferences can enable vendors to selectively target individuals with offers and advertisements.

In a simple (and perhaps silly) example let us consider the manufacturing and marketing of ketchup.

- The number of
*varieties*of ketchup is limited. - Each variety contains not only tomatoes and vinegar which are fairly healthy but also a lot of sugar and salt.
- Each ketchup
*buyer*prefers a certain mix of salt and sugar, and can specify exactly how much of each would be optimal.

Note that this example (with only two variables) is nice because we can easily place each buyer on a 2D scatter plot.

When using the K-means algorithm, we first choose a value for *K*, which is this case, will be number of ketchup varieties.

- We plot each user-position based on his preferred density of salt and sugar on the X axis and Y axis, respectively.
- We plot each ketchup variety-position corresponding to its salt and sugar density.

The program labels each *buyer* with a lower case letter, (a-p), and each *variety* of ketchup as an upper case letter, (A-D)

To start, it randomly assigns each *buyer* to a *variety*. and randomly scatters buyers and varieties, across the plot. So the resultant plot is pretty chaotic.

On the plot buyers have the color of their current chosen variety, and just to make the presentation a bit more dramatic, we connect the buyer to the variety with a line of the same color.

In the first step the buyers may chose a more optimal variety (and most do) than the one randomly assigned. Just buyers *f* and *j* "decide" to stick with their original assignments. A great deal of order suddenly appears.

Following the swap, varieties A and D move (change their formulations) to the mean of their buyer positions to be optimally close to their buyers.

But then some buyers whose favorite variety moved away from them may decide to switch to another variety that is suddenly a better match to their preferences.

In this example, after six iterations of shuffle-and-swap the program reaches a stable configuration and no swaps occur after the last shuffle.

This should be an overall optimum distribution.

Interestingly, variety *D* lost all its buyers with the first swap but by the end of the process, it had the most buyers. On the other hand *B's* three buyers stuck with it. All four varieties moved around quite a bit.

So, in summary, to create 4 varieties of ketchup that bring the most happiness to the most folks, we have computed formulations that now have salt and sugar amounts shown in (*A, B, C, D*) in the scatter plot.

You can see running by clicking this animation. Use the browsers back button to return here.

You may use any version of python (2.7 or greater) to run these animations.

$ python kmeans.py seed=180

generates the animation above. The seed=180 presets the random number generator. The default screen size is 600x600 pixels.

$ python kmeans.py seed=180 pixels=300

will do the same thing at half the size.

You can watch the buyers switch to different varieties, and then how vendors adjust their formulation to better cater to their current set of customers. Click the mouse inside the graphic window to step through. If you want to exit early you can also click the window-close button in the upper right.

So, in nine steps, the program will

- compute and display the scatter plot based on the current formulations of ketchup, and customer choices
- wait for the user to click on the image
- re-compute and display the scatter plot based on the current formulations of ketchup, and customer choices

With every second click, buyers may switch to another variety and vendors may re-adjust their formulations.

The arguments on the command line are:

*seed=nnn*– allows us to seed the random number generator (so we re-run interesting sequences).*pixels=nnn*– lets us specify the size of our square window.*buyers=nn*and*varieties=nn*– allow us to specifiy the number of buyers and varieties. These numbers must be less or equal to 26, (a limitation of the English alphabet).

Try the following. Results will be random since we are not seeding the random number generator.

$ python kmeans.py varieties=3 buyers=26

You might notice that these command line arguments look much like arguments in Python function calls. This is all handled in the module *sarg*. I find it useful for ad-hoc programs.

Once the animated example is seen, the Python code should seem pretty clear and intuitive and you can probably imagine how you might go about coding it. My approach is to make two classes, Variety and Buyer. You can download the project code here. kmeans.zip

There are two object classes *Variety* and *Buyer*.

class *Variety* has the following attributes:

- an x/y position (line 27)
- a label (28)
- a list of buyers that are attached to it (29)
- the methods
*attachBuyer()*and*detachBuyer()*that add and remove a buyer from the*buyers*list *centerToBuyers()*which calculates a graph position based on the average tastes (x and y offsets) of its buyers.

```
25 class Variety :
26 def __init__ (self, label) :
27 self.x,self.y = randomXY()
28 self.label = label
29 self.buyers = []
30
31 def attachBuyer (self, buyer) :
32 self.buyers.append(buyer)
33
34 def detachBuyer (self, buyer) :
35 if buyer in self.buyers :
36 self.buyers.remove(buyer)
37
38 def centerToBuyers(self) :
39 nBuyers = len(self.buyers)
40 if nBuyers :
41 xsum = ysum = 0.0
42 for buyer in self.buyers :
43 xsum += buyer.x
44 ysum += buyer.y
45 self.x = xsum/nBuyers
46 self.y = ysum/nBuyers
47
48 def distTo (self, buyer) :
49 xdist = abs(self.x-buyer.x)
50 ydist = abs(self.y-buyer.y)
51 return math.sqrt(xdist*xdist + ydist*ydist)
52
```

Instances of the *Buyer* class have attributes for a label, position and current variety. These latter two are initially assigned at random. (58). (To make this easy, the varieties are created before the buyers.)

Method *choose()*, (61), lets a buyer trade his current variety for another that is closer. It looks at every alternative to find the *best Variety* at the *best Distance*, and returns true if a swap took place.

```
54 class Buyer :
55 def __init__ (self, label, varieties) :
56 self.x,self.y = randomXY()
57 self.label = label
58 self.variety = varieties[random.randint(0,len(varieties)-1)]
59 self.variety.attachBuyer(self)
60
61 def choose(self, varieties) :
62 best Variety = self.variety
63 bestDist = best Variety.distTo(self)
64 for tryout in varieties :
65 if tryout != self.variety :
66 tryDist = tryout.distTo(self)
67 if tryDist < bestDist :
68 best Variety = tryout
69 bestDist = tryDist
70 if best Variety != self.variety :
71 self.variety.detachBuyer(self)
72 best Variety.attachBuyer(self)
73 self.variety = best Variety
74 return True # changed variety
75 return False
76
```

Function *kmeans()* is the main engine which drives the algorithm.

- We build the varieties and buyers using list comprehensions (85, 86).
- We assign a 1-letter upper-case label to each variety from
*labels*. - We assign a 1-letter lower-case label to each buyer from
*babels*.

- We assign a 1-letter upper-case label to each variety from
- We display the
*Initial Placement*(88). - The
*while*loop runs until*changed*no longer gets set to True.- With each trip through the while loop all buyers are allowed to change their variety preference.
- If none do (and
*changed*remains False) then we are done.- Otherwise, we make the switches and let the varieties wiggle their way to their new centers of gravity (100).

- With each change we generate a new display.
- A last display() (102) shows the final stable configuration.

```
81 def kmeans(seed) :
82 labels = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
83 babels= labels.lower()
84
85 varieties = [Variety(labels[i]) for i in range(nVarieties)]
86 buyers = [Buyer(babels[i], varieties) for i in range(nBuyers)]
87 fig = 0
88 display(varieties, buyers, "Initial Placement")
89 while True :
90 changed = False
91 fig += 1
92 for b in buyers :
93 changed = b.choose(varieties) or changed
94 display(varieties, buyers, "Fig %d Buyers Choose Closest"%fig)
95
96 if not changed : break
97
98 fig += 1
99 for v in varieties :
100 v.centerToBuyers()
101 display(varieties, buyers, "Fig %d Varieties Move to Center"%fig)
102 display(varieties, buyers, "Fig %d Done. Stability found"%fig)
103
```

The function *display()* erases and fills the graphic window with each step. When done it waits for the user to click the mouse (131) in the window before returning. The last click exits the program. You

*display()*works like other projects using John Zelle's graphics.py.- It draws instances of
*Line*and*Text*classes onto the window. - It keeps a reference to each instance in the global list,
*drawn*. - At the start of a new frame
- It iterates through the
*drawn*list and un-draws everything it had previously drawn - It redraws every instance, adding the drawing artifacts to the
*drawn*list.

- It iterates through the

*getcolor()* maps the a label to a color from the *colors* list.

```
77 def getColor(label) :
78 slot = ord(label.lower())-ord('a')
79 return colors[slot%len(colors)] #nifty wrap-around
80
104 def display(varieties, buyers, banner) :
105 for obj in drawn : obj.undraw()
106
107 txtBanner = Text(Point(pixels//2,20), banner)
108 txtBanner.draw(win)
109 drawn.append(txtBanner)
110
111 for b in buyers :
112 v = b.variety
113 color = getColor(v.label)
114 p1 = Point(b.x,b.y)
115 p2 = Point(v.x,v.y)
116
117 line = Line(p1,p2)
118 line.setFill(color)
119 line.draw(win)
120 drawn.append(line)
121
122 txtLabel = Text(p1, b.label)
123 txtLabel.draw(win)
124 drawn.append(txtLabel)
125
126 for v in varieties :
127 color = getColor(v.label)
128 txtVar = Text(Point(v.x,v.y),v.label)
129 txtVar.draw(win)
130 drawn.append(txtVar)
131 win.getMouse()
132
```

At the bottom of *kmeans.py* we launch the function *kmeans* with a seed to generate random numbers.

- A seed may be provided on the command line with an argument like "seed=321".
- If none is supplied we choose one from 899 possibilities. It will then print just after starting.

For debugging purposes as well as for reproducing interesting runs, it is necessary to be able to use a known seed to exactly reproduce a computational run. If no seed= is provided the one chosen is printed so you can reuse it. (line 138)

```
133 if __name__ == "__main__" :
134 seed = sarg.Int("seed", 0)
135 if seed : random.seed(seed)
136 else :
137 seed = random.randint(101,999)
138 print("Seed is %s" % seed)
139 random.seed(seed)
140 kmeans(seed)
141
```

Again, the code files can be found in kmeans.zip

If you have comments or suggestions You can email me at mail me

Copyright © 2014-2021 Chris Meyers and Fred Obermann

…