Step Four – Calculate the mean (average) of each cluster set Figure 6 Referring back to Figure 4, we then find the minimum distance for each case from each of the three start points – this tells us which cluster (1, 2 or 3) that the case is closest to – which is shown in the ‘initial choice column’. We can do it “mechanically” as shown here – but Excel has a built-in formula to use: SUMXMY2 – this is far more efficient to use. We calculate the difference between each of the three data points in the set, and then square the differences, and then sum them. We want to calculate the distance and we use the sum of squares method – as shown here.
Remember that we have arbitrarily designated Case 6 to be our random start point for Cluster 1. Let’s look at the first number in the table – case 1, start 1 = 10.54. You should note that the intersection of each of these gives a 0 (-) in the table. Start 1 is the data for case 6, start 2 is case 9 and start 3 is case 15. Referring to the table output – this is our first calculation in Excel and it generates our “initial choice” of clusters. Please refer to the article on why cluster analysis sometimes generates different results. This suggests that these three cases are somewhat different to each other, so good starting points as they are spread out. The reason I selected these cases is because – when looking at variable X only – case 6 was the median, case 9 was the maximum and case 15 was the minimum.
For these start points I have selected cases 6, 9 and 15 – but any random points could also be suitable. In this example – as I’m wanting to create three clusters, then I will need three starting points. Figure 4įor k-means clustering you typically pick some random cases (starting points or seeds) to get the analysis started. Please note that you can use this Excel approach to identify as many clusters as you like – just follow the same concept as explained below. Yes, there are four clusters evident in the diagram above, but that only looks at two of the variables. Step Three – Calculate the distance from each data point to the center of a clusterįor this walk-through example, let’s assume that we want to identify three segments/clusters only.
#Scatter plot in excel 2016 how to#
As I have suggested, a good approach when there are only two variables to consider – but is this case we have three variables (and you could have more), so this visual approach will only work for basic data sets – so now let’s look at how to do the Excel calculation for k-means clustering. With this next graph, I have visibly identified probable cluster and circled them. In this case, you could identify three or four relatively distinct clusters – as shown in this next chart. And, at times, you can cluster the data via visual means.Īs you can see in this scatter graph, each individual case (what I’m calling a consumer for this example) has been mapped, along with the average (mean) for all cases (the red circle).ĭepending upon how you view the data/graph – there appears to be a number of clusters. In this cluster analysis example we are using three variables – but if you have just two variables to cluster, then a scatter chart is an excellent way to start. Step Two – If just two variables, use a scatter graph on Excel Figure 2 You can see from this example set that three start positions have been highlighted – we will discuss those in Step Three below. It’s just easier for me to classify that person in the “over $250,000” income bracket and scale income 1-9 – but that’s up to you depending upon the data you are working with. Say, for example, I am using income data (a demographic measure) – most of the data might be around $40,000 to $100,000, but I have one person with an income of $5m. The reason for this is to “contain” any outliers. NOTE: I prefer to use scaled data – but it is not mandatory.