Skip to main content

How can I wrap all BeautifulSoup existing find/select methods in order to add additional logic and parameters?

I have a repetitive sanity-check process I go through with most calls to a BeautifulSoup object where I:

  1. Make the function call (.find, .find_all, .select_one, and .select mostly)
  2. Check to make sure the element(s) were found
    • If not found, I raise a custom MissingHTMLTagError, stopping the process there.
  3. Attempt to retrieve attribute(s) from the element(s) (using .get or getattr)
    • If not found, I raise a custom MissingHTMLAttributeError
  4. Return either a:
    • string, when it's a single attribute of a single element (.find and .select_one)
    • list of strings, when it's a single attribute of multiple elements (.find_all and .select)
    • dict, when it's two attributes (key/value pairs) for multiple elements (.find_all and .select)

I've created the below solution that acts as a proxy (not-so-elegantly) to BeautifulSoup methods. But, I'm hoping there is an easier eay to accomplish this. Basically, I want to be able to patch all the BeautifulSoup methods to:

  1. Allow for an extra parameter to be passed, so that the above steps are taken care off in a single call
  2. If using any of the above methods without providing the extra parameter I want to return the BeautifulSoup objects like normal or raise the MissingHTMLTagError if the return value is None or an empty list.

Most of the time the below function is used with a class variable (self._soup), which is just a BeautifulSoup object of the most-recent requests.Response.

from bs4 import BeautifulSoup

def get_html_value(self, element, attribute=None, soup=None, func="find", **kwargs):
    """A one-step method to return html element attributes.

    A proxy function that handles passing parameters to BeautifulSoup object instances
    while reducing the amount of boilerplate code needed to get an element, validate its existence,
    then do the same for the attribute of that element. All while managing raising proper exceptions for debugging.
    
    **Examples:**
    # Get a single attribute from a single element using BeautifulSoup.find
    >> self.get_html_value("a", "href", attrs={"class": "report-list"})
    >> "example.com/page"
    # Get a single attribute from multiple elements using using BeautifulSoup.find_all
    >> self.get_html_value("a", "href", func="find_all", attrs={"class": "top-nav-link"})
    >> ["example.com/category1", "example.com/category2", "example.com/category3"]
    # Getting key/value pairs (representing hidden input fields for POST requests)
    # from a fragment of the full html page (login_form) that only contains the form controls
    >> self.get_html_value("input", ("name", "value"), soup=login_form, func="find_all", attrs={"type": "hidden"})
    >> {"csrf_token": "a1b23c456def", "viewstate": "wxyzqwerty"}
    # Find an element based on one of its parents using func="select_one"
    >> account_balance = self.get_html_value("div#account-details > section > h1", func="select_one")
    >> account_balance.string
    >> "$12,345.67"
    # Using func="select" with no attribute will return BeautifulSoup objects
    >> self.get_html_value("div#accounts > div a", func="select")
    >> [<a href="...">Act. 1</a>, <a href="...">Act. 2</a>, <a href="...">Act. 3</a>]
    # Using func="select" with attribute will return list of values
    >> self.get_html_value("div#accounts > div a", attribute="href", func="select")
    >> ["example.com/account1", "example.com/account2", "example.com/account3"]
    """
    if not any([soup, self._soup]):
        raise ValueError("Class property soup not set and soup parameter not provided")
    elif soup:
        # provide parsing for strings and requests.Responses
        if isinstance(soup, str):
            soup = BeautifulSoup(soup, "html.parser")
        elif isinstance(soup, requests.Response):
            soup = BeautifulSoup(soup.text, "html.parser")
    else:
        soup = self._soup
 
    if not isinstance(attribute, (str, tuple)):
        raise TypeError("attribute can only be a string or a tuple")
    if isinstance(attribute, tuple) and len(attribute) != 2:
        raise ValueError("attribute can only be a string or tuple of 2 strings (key/value pairing)")
 
    bs_func = getattr(soup, func)
    if not bs_func:
        raise AttributeError("Method %s not found in the BeautifulSoup package" % func)
 
    bs_element = bs_func(element, **kwargs) if kwargs else bs_func(element)
    if not bs_element:
        raise MissingHtmlError(self, element, None, soup, func, kwargs)
    if attribute:
        if isinstance(attribute, str):
            # handle soup.find and soup.select_one
            if isinstance(bs_element, list):
                # single attribute for multiple elements
                bs_attributes = []
                for el in bs_element:
                    el_attribute = el.get(attribute)
                    if not el_attribute:
                        raise MissingHtmlError(self, element, attribute, soup, func, kwargs)
                    bs_attributes.append(el_attribute)
                return bs_attributes
            else:
                # single attribute for single element
                bs_attribute = bs_element.get(attribute)
                if not bs_attribute:
                    raise MissingHtmlError(self, element, attribute, soup, func, kwargs)
                return bs_attribute
        else:
            # handle soup.find_all and soup.select
            key, value = attribute
            if isinstance(bs_element, list):
                # attribute pairs for multiple elements
                bs_attributes = {}
                for el in bs_element:
                    el_key = el.get(key)
                    if el_key is None:
                        raise MissingHtmlError(self, element, attribute, soup, func, kwargs)
                    bs_attributes[el_key] = el.get(value, "")
                return bs_attributes
            else:
                # attribute pair for a single element
                el_key = bs_element.get(key)
                if el_key is None:
                    raise MissingHtmlError(self, element, attribute, soup, func, kwargs)
                return {el_key: bs_element.get(value, "")}
    # no attribute was provided, so return the requested element(s)
    return bs_element

Is there anyway to wrap all of the exposed .find and .select-type methods of BeautifulSoup, so I can still use the methods normally (ex: soup.find()) instead of having to use my workaround function?



source https://stackoverflow.com/questions/67756520/how-can-i-wrap-all-beautifulsoup-existing-find-select-methods-in-order-to-add-ad

Comments

Popular posts from this blog

ValueError: X has 10 features, but LinearRegression is expecting 1 features as input

So, I am trying to predict the model but its throwing error like it has 10 features but it expacts only 1. So I am confused can anyone help me with it? more importantly its not working for me when my friend runs it. It works perfectly fine dose anyone know the reason about it? cv = KFold(n_splits = 10) all_loss = [] for i in range(9): # 1st for loop over polynomial orders poly_order = i X_train = make_polynomial(x, poly_order) loss_at_order = [] # initiate a set to collect loss for CV for train_index, test_index in cv.split(X_train): print('TRAIN:', train_index, 'TEST:', test_index) X_train_cv, X_test_cv = X_train[train_index], X_test[test_index] t_train_cv, t_test_cv = t[train_index], t[test_index] reg.fit(X_train_cv, t_train_cv) loss_at_order.append(np.mean((t_test_cv - reg.predict(X_test_cv))**2)) # collect loss at fold all_loss.append(np.mean(loss_at_order)) # collect loss at order plt.plot(np.log(al...

Sorting large arrays of big numeric stings

I was solving bigSorting() problem from hackerrank: Consider an array of numeric strings where each string is a positive number with anywhere from to digits. Sort the array's elements in non-decreasing, or ascending order of their integer values and return the sorted array. I know it works as follows: def bigSorting(unsorted): return sorted(unsorted, key=int) But I didnt guess this approach earlier. Initially I tried below: def bigSorting(unsorted): int_unsorted = [int(i) for i in unsorted] int_sorted = sorted(int_unsorted) return [str(i) for i in int_sorted] However, for some of the test cases, it was showing time limit exceeded. Why is it so? PS: I dont know exactly what those test cases were as hacker rank does not reveal all test cases. source https://stackoverflow.com/questions/73007397/sorting-large-arrays-of-big-numeric-stings

How to load Javascript with imported modules?

I am trying to import modules from tensorflowjs, and below is my code. test.html <!DOCTYPE html> <html lang="en"> <head> <meta charset="UTF-8"> <title>Document</title </head> <body> <script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@2.0.0/dist/tf.min.js"></script> <script type="module" src="./test.js"></script> </body> </html> test.js import * as tf from "./node_modules/@tensorflow/tfjs"; import {loadGraphModel} from "./node_modules/@tensorflow/tfjs-converter"; const MODEL_URL = './model.json'; const model = await loadGraphModel(MODEL_URL); const cat = document.getElementById('cat'); model.execute(tf.browser.fromPixels(cat)); Besides, I run the server using python -m http.server in my command prompt(Windows 10), and this is the error prompt in the console log of my browser: Failed to loa...