Skip to main content

What is making this Python code so slow? How can I modify it to run faster?

I am writing a program in Python for a data analytics project involving advertisement performance data matched to advertisement characteristics aimed at identifying high performing groups of ads that share n similar characteristics. The dataset I am using has individual ads as rows, and characteristic, summary, and performance data as columns. Below is my current code - the actual dataset I am using has 51 columns, 4 are excluded, so it is running with 47 C 4, or 178365 iterations in the outer loop.

Currently, this code takes ~2 hours to execute. I know that nested for loops can be the source of such a problem, but I do not know why it is taking so long to run, and am not sure how I can modify the inner/outer for loops to improve performance. Any feedback on either of these topics would be greatly appreciated.

import itertools
import pandas as pd
import numpy as np

# Identify Clusters of Rows (Ads) that have a KPI value above a certain threshold
def set_groups(df, n):
    """This function takes a dataframe and a number n, and returns a list of lists. Each list is a group of n columns.
    The list of lists will hold all size n combinations of the columns in the dataframe.
    """
    # Create a list of all relevant column names
    columns = list(df.columns[4:]) # exclude first 4 summary columns
    # Create a list of lists, where each list is a group of n columns
    groups = []
    vals_lst = list(map(list, itertools.product([True, False], repeat=n))) # Create a list of all possible combinations of 0s and 1s
    for comb in itertools.combinations(columns, n): # itertools.combinations returns a list of tuples
        groups.append([comb, vals_lst])
    groups = np.array(groups,dtype=object)
    return groups  # len(groups) = len(columns(df)) choose n

def identify_clusters(df, KPI, KPI_threshhold, max_size, min_size, groups):
    """
    This function takes in a dataframe, a KPI, a threshhold value, a max and min size, and a list of lists of groupings.
    The function will identify groups of rows in the dataframe that have the same values for each column in each list of groupings.
    The function will return a list of lists with each list of groups, the values list, and the ad_ids in the cluster.
    """
    # Create a dictionary to hold the results
    output = []
    # Iterate through each list of groups
    for group in groups:
        for vals_lst in group[1]:  # for each pair of groups and associated value matrices
            # Create a temporary dataframe to hold the group of rows with matching values for columns in group
            temp_df = df
            for i in range(len(group[0])):
                temp_df = temp_df[(temp_df[group[0][i]] == vals_lst[i])]  # reduce the temp_df to only rows that match the values in vals_lst for each combination of values
            if temp_df[KPI].mean() > KPI_threshhold:  # if the mean of the KPI for the temp_df is above the threshhold
                output.append([group, vals_lst, temp_df['ad_id'].values])  # append the group, vals_lst, and ad_ids to the output list
    print(output)
    return output

## Main
df = pd.read_excel('data.xlsx', sheet_name='name')
groups = set_groups(df, 4)
print(len(groups))
identify_clusters(df, 'KPI_var', 0.0015, 6, 4, groups)

Any insight into why the code is taking such a long time to run, and/or any advice on improving the performance of this code would be extremely helpful.



source https://stackoverflow.com/questions/74158253/what-is-making-this-python-code-so-slow-how-can-i-modify-it-to-run-faster

Comments

Popular posts from this blog

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input

So, I am trying to predict the model but its throwing error like it has 10 features but it expacts only 1. So I am confused can anyone help me with it? more importantly its not working for me when my friend runs it. It works perfectly fine dose anyone know the reason about it? cv = KFold(n_splits = 10) all_loss = [] for i in range(9): # 1st for loop over polynomial orders poly_order = i X_train = make_polynomial(x, poly_order) loss_at_order = [] # initiate a set to collect loss for CV for train_index, test_index in cv.split(X_train): print('TRAIN:', train_index, 'TEST:', test_index) X_train_cv, X_test_cv = X_train[train_index], X_test[test_index] t_train_cv, t_test_cv = t[train_index], t[test_index] reg.fit(X_train_cv, t_train_cv) loss_at_order.append(np.mean((t_test_cv - reg.predict(X_test_cv))**2)) # collect loss at fold all_loss.append(np.mean(loss_at_order)) # collect loss at order plt.plot(np.log(al...

Sorting large arrays of big numeric stings

I was solving bigSorting() problem from hackerrank: Consider an array of numeric strings where each string is a positive number with anywhere from to digits. Sort the array's elements in non-decreasing, or ascending order of their integer values and return the sorted array. I know it works as follows: def bigSorting(unsorted): return sorted(unsorted, key=int) But I didnt guess this approach earlier. Initially I tried below: def bigSorting(unsorted): int_unsorted = [int(i) for i in unsorted] int_sorted = sorted(int_unsorted) return [str(i) for i in int_sorted] However, for some of the test cases, it was showing time limit exceeded. Why is it so? PS: I dont know exactly what those test cases were as hacker rank does not reveal all test cases. source https://stackoverflow.com/questions/73007397/sorting-large-arrays-of-big-numeric-stings

How to load Javascript with imported modules?

I am trying to import modules from tensorflowjs, and below is my code. test.html <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title </head> <body> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script> <script type="module" src="./test.js"></script> </body> </html> test.js import * as tf from "./node_modules/@tensorflow/tfjs"; import {loadGraphModel} from "./node_modules/@tensorflow/tfjs-converter"; const MODEL_URL = './model.json'; const model = await loadGraphModel(MODEL_URL); const cat = document.getElementById('cat'); model.execute(tf.browser.fromPixels(cat)); Besides, I run the server using python -m http.server in my command prompt(Windows 10), and this is the error prompt in the console log of my browser: Failed to loa...