I am working on a dataset where few values in one of the column are string. due to that i am getting error while performing operations on dataset.
sample dataset:-
1.99 LOHARU 0.3 2 0 2 0.3 5 2 0 2 2
1.99 31 0.76 2 0 2 0.76 5 2 7.48 4 2
1.99 4 0.96 2 0 2 0.96 5 2 9.45 4 2
1.99 14 1.26 4 0 2 1.26 5 2 0 2 2
1.99 NUH 0.55 2 0 2 0.55 5 2 0.67 2 2
1.99 99999 0.29 2 0 2 0.29 5 2 0.06 2 2
full dataset can be found here:- https://www.kaggle.com/sid321axn/audit-data?select=trial.csv
I need to found the missing values and outlier in the dataset. below is the code i am using to find missing values:-
#Replacing zeros and 99999 with `np.NaN`
dataset[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]]=dataset[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]].replace(99999,np.NaN)
#if 12,14 and 17 can have zeroes then
dataset[[0,1,2,3,4,5,6,7,8,9,10,11,13,15,16]]=dataset[[0,1,2,3,4,5,6,7,8,9,10,11,13,15,16]].replace(0,np.NaN)
print(Dataset.isnull().sum())
but this doesn't replace 99999 with NaN
and to find outlier:-
i am calculating zscore
import scipy.stats as stats
array = Dataset.values
Z=stats.zscore(array)
but it gives me below error:
- TypeError: unsupported operand type(s) for /: 'str' and 'int'
source https://stackoverflow.com/questions/69381821/how-to-clean-a-dataset-having-string-values
Comments
Post a Comment