KMeansModel

class pyspark.mllib.clustering.KMeansModel(centers)[source]

A clustering model derived from the k-means method.

New in version 0.9.0.

Examples

>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
>>> model = KMeans.train(
...     sc.parallelize(data), 2, maxIterations=10, initializationMode="random",
...                    seed=50, initializationSteps=5, epsilon=1e-4)
>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0]))
True
>>> model.predict(array([8.0, 9.0])) == model.predict(array([9.0, 8.0]))
True
>>> model.k
2
>>> model.computeCost(sc.parallelize(data))
2.0
>>> model = KMeans.train(sc.parallelize(data), 2)
>>> sparse_data = [
...     SparseVector(3, {1: 1.0}),
...     SparseVector(3, {1: 1.1}),
...     SparseVector(3, {2: 1.0}),
...     SparseVector(3, {2: 1.1})
... ]
>>> model = KMeans.train(sc.parallelize(sparse_data), 2, initializationMode="k-means||",
...                                     seed=50, initializationSteps=5, epsilon=1e-4)
>>> model.predict(array([0., 1., 0.])) == model.predict(array([0, 1.1, 0.]))
True
>>> model.predict(array([0., 0., 1.])) == model.predict(array([0, 0, 1.1]))
True
>>> model.predict(sparse_data[0]) == model.predict(sparse_data[1])
True
>>> model.predict(sparse_data[2]) == model.predict(sparse_data[3])
True
>>> isinstance(model.clusterCenters, list)
True
>>> import os, tempfile
>>> path = tempfile.mkdtemp()
>>> model.save(sc, path)
>>> sameModel = KMeansModel.load(sc, path)
>>> sameModel.predict(sparse_data[0]) == model.predict(sparse_data[0])
True
>>> from shutil import rmtree
>>> try:
...     rmtree(path)
... except OSError:
...     pass
>>> data = array([-383.1,-382.9, 28.7,31.2, 366.2,367.3]).reshape(3, 2)
>>> model = KMeans.train(sc.parallelize(data), 3, maxIterations=0,
...     initialModel = KMeansModel([(-1000.0,-1000.0),(5.0,5.0),(1000.0,1000.0)]))
>>> model.clusterCenters
[array([-1000., -1000.]), array([ 5.,  5.]), array([ 1000.,  1000.])]

Methods

computeCost(rdd)

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

load(sc, path)

Load a model from the given path.

predict(x)

Find the cluster that each of the points belongs to in this model.

save(sc, path)

Save this model to the given path.

Attributes

clusterCenters

Get the cluster centers, represented as a list of NumPy arrays.

k

Total number of clusters.

Methods Documentation

computeCost(rdd)[source]

Return the K-means cost (sum of squared distances of points to their nearest center) for this model on the given data.

New in version 1.4.0.

Parameters:
rdd:pyspark.RDD

The RDD of points to compute the cost on.

classmethod load(sc, path)[source]

Load a model from the given path.

New in version 1.4.0.

predict(x)[source]

Find the cluster that each of the points belongs to in this model.

New in version 0.9.0.

Parameters:
xpyspark.mllib.linalg.Vector or pyspark.RDD

A data point (or RDD of points) to determine cluster index. pyspark.mllib.linalg.Vector can be replaced with equivalent objects (list, tuple, numpy.ndarray).

Returns:
int or pyspark.RDD of int

Predicted cluster index or an RDD of predicted cluster indices if the input is an RDD.

save(sc, path)[source]

Save this model to the given path.

New in version 1.4.0.

Attributes Documentation

clusterCenters

Get the cluster centers, represented as a list of NumPy arrays.

New in version 1.0.0.

k

Total number of clusters.

New in version 1.4.0.