Skip to main content

Can't get correct input for DBSCAN clustersing

I have a node2vec embedding stored as a .csv file, values are a square symmetric matrix. I have two versions of this, one with node names in the first column and another with node names in the first row. I would like to cluster this data with DBSCAN, but I can't seem to figure out how to get the input right. I tried this:

import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn import metrics

input_file = "node2vec-labels-on-columns.emb"

# for tab delimited use:
df = pd.read_csv(input_file, header = 0, delimiter = "\t")

# put the original column names in a python list
original_headers = list(df.columns.values)

emb = df.as_matrix()
db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print("Estimated number of clusters: %d" % n_clusters_)
print("Estimated number of noise points: %d" % n_noise_)

This leads to an error:

dbscan.py:14: FutureWarning: Method .as_matrix will be removed in a future version. Use .values instead.
  emb = df.as_matrix()
Traceback (most recent call last):
  File "dbscan.py", line 15, in <module>
    db = DBSCAN(eps=0.3, min_samples=10).fit(emb)
  File "C:\Python36\lib\site-packages\sklearn\cluster\_dbscan.py", line 312, in fit
    X = self._validate_data(X, accept_sparse='csr')
  File "C:\Python36\lib\site-packages\sklearn\base.py", line 420, in _validate_data
    X = check_array(X, **check_params)
  File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 73, in inner_f
    return f(**kwargs)
  File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 646, in check_array
    allow_nan=force_all_finite == 'allow-nan')
  File "C:\Python36\lib\site-packages\sklearn\utils\validation.py", line 100, in _assert_all_finite
    msg_dtype if msg_dtype is not None else X.dtype)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I've tried other input methods that lead to the same error. All the tutorials I can find use datasets imported form sklearn so those are of not help figuring out how to read from a file. Can anyone point me in the right direction?



source https://stackoverflow.com/questions/75128068/cant-get-correct-input-for-dbscan-clustersing

Comments