Is my python implementation of the Davies-Bouldin Index correct?(我对 Davies-Bouldin 索引的 Python 实现是否正确?)
问题描述
我正在尝试计算 Davies-Bouldin 指数Python.
I'm trying to calculate the Davies-Bouldin Index in Python.
以下是代码尝试重现的步骤.
Here are the steps the code below tries to reproduce.
5 个步骤:
- 对于每个集群,计算每个点到质心的欧几里德距离
- 对于每个集群,计算这些距离的平均值
- 对于每对集群,计算它们的质心之间的欧几里德距离
那么,
- 对于每对聚类,求到它们各自质心的平均距离之和(在第 2 步计算),然后除以它们之间的距离(在第 3 步计算).
最后,
- 计算所有这些划分(= 所有索引)的平均值以获得整个聚类的 Davies-Bouldin 索引
代码
def daviesbouldin(X, labels, centroids):
import numpy as np
from scipy.spatial.distance import pdist, euclidean
nbre_of_clusters = len(centroids) #Get the number of clusters
distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
distances_means = [] #Store the mean of these distances
DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
second_cluster_idx = [] #Store index of the second cluster of each pair
first_cluster_idx = 0 #Set index of first cluster of each pair to 0
# Step 1: Compute euclidean distances between each point of a cluster to their centroid
for cluster in range(nbre_of_clusters):
for point in range(X[labels == cluster].shape[0]):
distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
# Step 2: Compute the mean of these distances
for e in distances:
distances_means.append(np.mean(e))
# Step 3: Compute euclidean distances between each pair of centroid
ctrds_distance = pdist(centroids)
# Tricky step 4: Compute Davies-Bouldin index of each pair of cluster
for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
second_cluster_idx.append(e)
if second_cluster_idx[i-1] == nbre_of_clusters - 1:
first_cluster_idx += 1
DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
# Step 5: Compute the mean of all DB_indexes
print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes))
在参数中:
X
是数据labels
,是由聚类算法计算的标签(即:kmeans)centroids
是每个集群的质心坐标(即:cluster_centers_
)
X
is the datalabels
, are the labels computed by a clustering algorithm (i.e: kmeans)centroids
are the coordinates of each cluster's centroid (i.e:cluster_centers_
)
另外,请注意我使用的是 Python 3
Also, note that I'm using Python 3
问题 1:计算每对质心之间的欧氏距离是否正确(步骤 3)?
QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?
问题 2:我对第 4 步的实施是否正确?
QUESTION2: Is my implementation of step 4 correct?
问题 3:我是否需要归一化簇内和簇间距离?
QUESTION3: Do I need to normalise intra and inter cluster distances ?
关于第 4 步的进一步说明
假设我们有 10 个集群.循环应该计算每对集群的数据库索引.
Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.
在第一次迭代时:
- 求和集群 1 的内距离平均值(
distances_means
的索引 0)和集群 2 的内距离平均值(distances_means
的索引 1) - 将这个总和除以 2 个集群之间的距离(
ctrds_distance
的索引 0)
- sums intra-distances mean of cluster 1 (index 0 of
distances_means
) and intra-distances mean of cluster 2 (index 1 ofdistances_means
) - divides this sum by the distance between the 2 clusters (index 0 of
ctrds_distance
)
在第二次迭代中:
- 求和集群 1 的内距离平均值(
distances_means
的索引 0)和集群 3 的内距离平均值(distances_means
的索引 2) - 将这个总和除以 2 个集群之间的距离(
ctrds_distance
的索引 1)
- sums intra-distances mean of cluster 1 (index 0 of
distances_means
) and intra-distances mean of cluster 3 (index 2 ofdistances_means
) - divides this sum by the distance between the 2 clusters (index 1 of
ctrds_distance
)
等等...
以 10 个集群为例,完整的迭代过程应该是这样的:
With the example of 10 clusters, the full iteration process should look like this:
intra-cluster distance intra-cluster distance distance between their
of cluster: of cluster: centroids(storage num):
0 + 1 / 0
0 + 2 / 1
0 + 3 / 2
0 + 4 / 3
0 + 5 / 4
0 + 6 / 5
0 + 7 / 6
0 + 8 / 7
0 + 9 / 8
1 + 2 / 9
1 + 3 / 10
1 + 4 / 11
1 + 5 / 12
1 + 6 / 13
1 + 7 / 14
1 + 8 / 15
1 + 9 / 16
2 + 3 / 17
2 + 4 / 18
2 + 5 / 19
2 + 6 / 20
2 + 7 / 21
2 + 8 / 22
2 + 9 / 23
3 + 4 / 24
3 + 5 / 25
3 + 6 / 26
3 + 7 / 27
3 + 8 / 28
3 + 9 / 29
4 + 5 / 30
4 + 6 / 31
4 + 7 / 32
4 + 8 / 33
4 + 9 / 34
5 + 6 / 35
5 + 7 / 36
5 + 8 / 37
5 + 9 / 38
6 + 7 / 39
6 + 8 / 40
6 + 9 / 41
7 + 8 / 42
7 + 9 / 43
8 + 9 / 44
这里的问题是我不太确定distances_means
的索引是否与ctrds_distance
的索引匹配.
The problem here is I'm not quite sure that the index of distances_means
matches the index of ctrds_distance
.
换句话说,我不确定计算的第一个簇间距离对应于簇 1 和簇 2 之间的距离.计算的第二个簇间距离对应于簇 3 和簇 1 之间的距离... 依此类推,遵循上述模式.
In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.
简而言之:恐怕我将簇内距离对除以不对应的簇间距离.
In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.
欢迎任何帮助!
推荐答案
这是上面 Davies-Bouldin 索引朴素实现的更短、更快速的更正版本.
Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.
def DaviesBouldin(X, labels):
n_cluster = len(np.bincount(labels))
cluster_k = [X[labels == k] for k in range(n_cluster)]
centroids = [np.mean(k, axis = 0) for k in cluster_k]
variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
db = []
for i in range(n_cluster):
for j in range(n_cluster):
if j != i:
db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
return(np.max(db) / n_cluster)
回答我自己的问题:
- 初稿(第 4 步)上的计数器是正确的,但无关紧要
- 无需标准化集群内和集群间距离
- 计算欧几里得距离时出错
请注意,您可以找到尝试改进此索引的创新方法,特别是新版本Davies-Bouldin 指数",用圆柱距离代替欧几里得距离.
Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.
这篇关于我对 Davies-Bouldin 索引的 Python 实现是否正确?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持编程学习网!
本文标题为:我对 Davies-Bouldin 索引的 Python 实现是否正确?
基础教程推荐
- 如何在海运重新绘制中自定义标题和y标签 2022-01-01
- 线程时出现 msgbox 错误,GUI 块 2022-01-01
- 筛选NumPy数组 2022-01-01
- 用于分类数据的跳跃记号标签 2022-01-01
- 使用PyInstaller后在Windows中打开可执行文件时出错 2022-01-01
- 何时使用 os.name、sys.platform 或 platform.system? 2022-01-01
- 如何让 python 脚本监听来自另一个脚本的输入 2022-01-01
- Python kivy 入口点 inflateRest2 无法定位 libpng16-16.dll 2022-01-01
- 在 Python 中,如果我在一个“with"中返回.块,文件还会关闭吗? 2022-01-01
- Dask.array.套用_沿_轴:由于额外的元素([1]),使用dask.array的每一行作为另一个函数的输入失败 2022-01-01