Fastest way to find the size of the union of two lists/sets? Coding Efficiency

I am conducting research on the relationships between different communities on Reddit. As part of my data research I have about 49,000 CSV's for sub I scraped and of all the posts I could get I got each commentator and thier total karma and number of comments (see pic for format) Each CSV contains all the commentors I have collected for that individual sub.

I want to take each sub and then compare it to another sub and identify how many users they have in common. I dont need a list of what users they are just the amount the two have in common. I also want to set a karma threshold for the code I currently have it is set to more than 30 karma.

For the code I have I prepare two lists of users who are over the threshold then I convert both those lists to sets and then "&" them together. Here is my code:

Since I am re-reading the same CSVs over and over again to compare to each other can I prepare them to make this more efficient, sort user names alphabetically? I plan on re-doing all the CSVs to get rid of the "removal users" which are all bots beforehand, should I sort by karma instead of users so rather than comparing if karma is above a certain level it only reads up to a certain point?Memory usage is not a problem only speed. I have billions of comparisons to make so any reduction in time is appreciated.

remove_list = dt.fread('D:/Spring 2023/red/c7/removal_users.csv')
drops = remove_list['C0'].to_list()[0]
def get_scores(current):
    sub = dt.fread(current)
    sub = pd.read_csv(current, sep=',', header =None)
    current = get_name(current)
    sub.columns = ['user','score']
    sub = sub[~sub['user'].isin(drops)]
    sub = sub[sub['score'] > 30]
    names =list(sub['user'])
    return names
def get_name(current):
    current = current.split('/')[-1].split('.')[-2]
    return current
def common_users(list1,list2):
    return len(set(list1) & set(list2))

source https://stackoverflow.com/questions/76576093/fastest-way-to-find-the-size-of-the-union-of-two-lists-sets-coding-efficiency

StacksPedia

Search This Blog

Fastest way to find the size of the union of two lists/sets? Coding Efficiency

Labels

Comments

Post a Comment

Popular posts from this blog

How to show number of registered users in Laravel based on usertype?

Why is my reports service not connecting?

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input