I have a node2vec embedding stored as a .csv file, values are a square symmetric matrix. I have two versions of this, one with node names in the first column and another with node names in the first row. I would like to cluster this data with DBSCAN, but I can't seem to figure out how to get the input right. I tried this:
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics
input_file = "node2vec-labels-on-columns.emb"
# for tab delimited use:
df = pd.read_csv(input_file, header = 0, delimiter = "\t")
# put the original column names in a python list
original_headers = list(df.columns.values)
emb = df.as_matrix()
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)
This leads to an error:
dbscan.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
emb = df.as_matrix()
Traceback (most recent call last):
File "dbscan.py", line 15, in <module>
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
File "C:\Python36\lib\site-packages\sklearn\cluster\_dbscan.py", line 312, in fit
X = self._validate_data(X, accept_sparse='csr')
File "C:\Python36\lib\site-packages\sklearn\base.py", line 420, in _validate_data
X = check_array(X, **check_params)
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
return f(**kwargs)
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 646, in check_array
allow_nan=force_all_finite == 'allow-nan')
File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 100, in _assert_all_finite
msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I've tried other input methods that lead to the same error. All the tutorials I can find use datasets imported form sklearn so those are of not help figuring out how to read from a file. Can anyone point me in the right direction?
source https://stackoverflow.com/questions/75128068/cant-get-correct-input-for-dbscan-clustersing
Comments
Post a Comment