sgfixedincome_pkg.scraper ========================= .. py:module:: sgfixedincome_pkg.scraper Functions --------- .. autoapisummary:: sgfixedincome_pkg.scraper.fetch_webpage sgfixedincome_pkg.scraper.extract_table sgfixedincome_pkg.scraper.table_to_df sgfixedincome_pkg.scraper.parse_bounds sgfixedincome_pkg.scraper.parse_tenure sgfixedincome_pkg.scraper.clean_rate_value sgfixedincome_pkg.scraper.reshape_table sgfixedincome_pkg.scraper.scrape_deposit_rates Module Contents --------------- .. py:function:: fetch_webpage(url) Fetches webpage content from the given URL. :param url: The URL of the website to scrape. :type url: str :returns: Parsed HTML content of the page. :rtype: BeautifulSoup :raises Exception: If the webpage cannot be fetched or parsed. .. py:function:: extract_table(soup, table_class) Locates tables with the specified class in the parsed HTML. :param soup: Parsed HTML content. :type soup: BeautifulSoup :param table_class: Class name of the table(s) to locate. :type table_class: str :returns: A list of located elements. :rtype: list :raises Exception: If no tables with the specified class are found. .. py:function:: table_to_df(table) Converts an HTML
element into a pandas DataFrame. This function takes in a BeautifulSoup Tag object representing a table and extracts the rows and columns of data. It can handle both traditional tables where headers are inside , and contents would be found within row. Each row’s data is stored as a list of cell values, which are then used to construct a pandas DataFrame. :param table: A BeautifulSoup Tag object representing the
tags, as well as tables where the header row is indistinguishable from the other rows. In such cases, the header row would simply be the first row in
tags in the first
. :type table: Tag :returns: A pandas DataFrame containing the extracted table data. :rtype: pd.DataFrame :raises Exception: If the table data extraction fails. For example, when there is an issue during the row data extraction process, such as missing , ,
tags or malformed rows. .. py:function:: parse_bounds(deposit_range) Parses deposit range to extract lower and upper bounds, ensuring inclusive bounds. :param deposit_range: String representing the deposit range :type deposit_range: str :returns: A tuple containing the lower and upper bounds as floats. If only upper bound exists, lower bound is set to 0. If only the lower bound exists, upper bound is set to 99,999,999. :rtype: tuple :raises ValueError: If the range cannot be parsed, if the lower bound is greater than the upper bound, or if the range is nonsensical (e.g., "<10000 - 20000" or '10000 - >20000'). .. rubric:: Examples - "$1,000 - $9,999" -> (1000.0, 9999.0) - ">S$20,000 - S$50,000" -> (20000.01, 50000.0) - "Below S$50,000" -> (0.0, 49999.99) - "S$50,000 - S$249,999" -> (50000.0, 249999.0) - ">$5,000" -> (5000.01, 99999999.0) - "Above 30,000" -> (30000.01, 99999999.0) .. py:function:: parse_tenure(period_str, header_str) Ensure the tenure period is in months and parse it. This function extracts the tenure information from `period_str`. The tenure is expected to be in months and indicated by keywords such as "month" or "mth" in either `period_str` or the `header_str`. :param period_str: Tenure period as a string (e.g., "6-12 months"). :type period_str: str :param header_str: Column header to verify if data represents months. :type header_str: str :returns: List of integer months if the tenure is valid. :rtype: list :raises ValueError: If the tenure cannot be parsed or is not in months. .. rubric:: Examples >>> parse_tenure("9 mths", header_str="Period") [9] >>> parse_tenure("6-month", header_str="Tenor (% p.a.)") [6] >>> parse_tenure("6-8", header_str="Tenure (months)") [6, 7, 8] >>> parse_tenure("12", header_str="Tenure (months)") [12] >>> parse_tenure("6-12 weeks", header_str="Tenure in weeks") ValueError: Neither header 'Tenure in weeks' nor content '6-12 weeks' indicates months. .. py:function:: clean_rate_value(rate_value) Cleans the rate value by removing any non-numeric characters and converting to a float. If the rate value is a string representing 'N.A', 'N.A.', or similar (case-insensitive), it returns None. :param rate_value: The rate value which may include non-numeric characters (e.g., '%', 'N.A.') or be a valid numeric value. :type rate_value: str or float :returns: The cleaned rate value as a float, or None if the value represents 'N.A'. :rtype: float or None :raises ValueError: If the rate value cannot be converted to a float and isn't a valid 'N.A.' string. .. rubric:: Examples - clean_rate_value("5%") -> 5.0 - clean_rate_value("N.A.") -> None - clean_rate_value("3.5") -> 3.5 .. py:function:: reshape_table(raw_df) Reshapes the raw DataFrame into a structured format for analysis. :param raw_df: The raw DataFrame containing fixed deposit rate data. The first column contains tenure in months (e.g., 'Period', 'Tenor', or 'Tenure'). The other columns contain rates for different deposit ranges (e.g., '$1,000-$9,999'). :type raw_df: pd.DataFrame :returns: A reshaped DataFrame with the following columns: - Tenure: The duration in months (as float). - Rate: The deposit rates (as float). - Deposit lower bound: The lower bound of the deposit range (as float). - Deposit upper bound: The upper bound of the deposit range (as float, or None if not specified). :rtype: pd.DataFrame :raises ValueError: If the first column does not contain keywords indicating tenure information. .. py:function:: scrape_deposit_rates(url, table_class, provider, req_multiples=None) Scrapes deposit rates from the given URL and manually add extra information. Sometimes, bank websites use the same class for multiple tables, including the key table of interest with fixed deposit rates. To enable our scraper to work in such cases, we attempt to scrape data for each of these tables, starting with the first. We ignore additional tables once we find one that can be successfully scraped. The intuition is that we would only be able to successfully scrape tables with our desired data, and attempted scraping of tables containing other information would fail. :param url: URL of the website to scrape. The website should contain a table of fixed deposit rates. :type url: str :param table_class: Class name of the table to locate in the website. :type table_class: str :param provider: The name of the provider offering the fixed deposit products. :type provider: str :param req_multiples: The required multiples for the deposit, if applicable. Defaults to None. :type req_multiples: optional, float or None :returns: A pandas DataFrame containing the reshaped deposit rates data, with additional columns: - Required multiples: The value provided in `req_multiples`. - Product provider: The value provided in `provider`. - Product: A static string "Fixed Deposit" indicating the type of product. :rtype: pd.DataFrame :raises Exception: If the scraping or data extraction process fails, an exception will be raised.