CMSC 210: Lecture 28

Machine Learning: Clustering

Announcements

Assignment 7: Due Wednesday May 18th
Last class!

Clustering

Clustering

Clustering is an unsupervised ML method
Clustering algorithms will attempt to divide a set of data into 2 or more segments
Data within clusters should be as similar as possible
Data between clusters should be as different as possible

How can we best draw N boundaries around this data?

Most algorithms require you to say how many clusters you want in advance
But there are also methods to suggest an optimum number of clusters from the data

K-Means Clustering

Simple, works fairly well
Pick N random points in your data where N=number of clusters
Each of these points is called the centroid
Go through every other point and assign it to the nearest centroid
For each of the clusters, find the new center of the cluster
Now go through each point and re-assign to the closest centroid
Keep doing this till the clusters stop changing

Uses of clustering

Segmentation (audience/customer)
Document clustering
Image segmentation
Recommendation engines

How "well" have we clustered data?

Common measure: inertia
To calculate:
1. measure the distance between each data point and its centroid
2. square this distance
3. sum the squares across all points

Let's do some clustering...

DBSCAN

density-based spatial clustering of applications with noise

k-Means vs. DBSCAN

density - number of points in a circle of specific radius (epsilon)
minimum points - minimum density for a point to be considered a core point
core point - points that meet minimum density metric
border point - points that don't meet minimum density metric but are inside a circle for a core point
noise - points that are not inside any other circle

Advantages over k-Means

no need to a priori specify number of clusters in data
can find clusters of arbitrary shapes
has a concept of noise, good for data with outliers