sgfixedincome_pkg.scraper
Functions
|
Fetches webpage content from the given URL. |
|
Locates tables with the specified class in the parsed HTML. |
|
Converts an HTML <table> element into a pandas DataFrame. |
|
Parses deposit range to extract lower and upper bounds, ensuring inclusive bounds. |
|
Ensure the tenure period is in months and parse it. |
|
Cleans the rate value by removing any non-numeric characters and converting to a float. |
|
Reshapes the raw DataFrame into a structured format for analysis. |
|
Scrapes deposit rates from the given URL and manually add extra information. |
Module Contents
- sgfixedincome_pkg.scraper.fetch_webpage(url)[source]
Fetches webpage content from the given URL.
- Parameters:
url (str) – The URL of the website to scrape.
- Returns:
Parsed HTML content of the page.
- Return type:
BeautifulSoup
- Raises:
Exception – If the webpage cannot be fetched or parsed.
- sgfixedincome_pkg.scraper.extract_table(soup, table_class)[source]
Locates tables with the specified class in the parsed HTML.
- Parameters:
soup (BeautifulSoup) – Parsed HTML content.
table_class (str) – Class name of the table(s) to locate.
- Returns:
A list of located <table> elements.
- Return type:
list
- Raises:
Exception – If no tables with the specified class are found.
- sgfixedincome_pkg.scraper.table_to_df(table)[source]
Converts an HTML <table> element into a pandas DataFrame.
This function takes in a BeautifulSoup Tag object representing a table and extracts the rows and columns of data. It can handle both traditional tables where headers are inside <th> tags, as well as tables where the header row is indistinguishable from the other rows. In such cases, the header row would simply be the first row in <tbody>, and contents would be found within <td> tags in the first <tr> row. Each row’s data is stored as a list of cell values, which are then used to construct a pandas DataFrame.
- Parameters:
table (Tag) – A BeautifulSoup Tag object representing the <table>.
- Returns:
A pandas DataFrame containing the extracted table data.
- Return type:
pd.DataFrame
- Raises:
Exception – If the table data extraction fails. For example, when there is an issue during the row data extraction process, such as missing <tbody>, <tr>, <td> tags or malformed rows.
- sgfixedincome_pkg.scraper.parse_bounds(deposit_range)[source]
Parses deposit range to extract lower and upper bounds, ensuring inclusive bounds.
- Parameters:
deposit_range (str) – String representing the deposit range
- Returns:
A tuple containing the lower and upper bounds as floats. If only upper bound exists, lower bound is set to 0. If only the lower bound exists, upper bound is set to 99,999,999.
- Return type:
tuple
- Raises:
ValueError – If the range cannot be parsed, if the lower bound is greater than the upper bound, or if the range is nonsensical (e.g., “<10000 - 20000” or ‘10000 - >20000’).
Examples
“$1,000 - $9,999” -> (1000.0, 9999.0)
“>S$20,000 - S$50,000” -> (20000.01, 50000.0)
“Below S$50,000” -> (0.0, 49999.99)
“S$50,000 - S$249,999” -> (50000.0, 249999.0)
“>$5,000” -> (5000.01, 99999999.0)
“Above 30,000” -> (30000.01, 99999999.0)
- sgfixedincome_pkg.scraper.parse_tenure(period_str, header_str)[source]
Ensure the tenure period is in months and parse it.
This function extracts the tenure information from period_str. The tenure is expected to be in months and indicated by keywords such as “month” or “mth” in either period_str or the header_str.
- Parameters:
period_str (str) – Tenure period as a string (e.g., “6-12 months”).
header_str (str) – Column header to verify if data represents months.
- Returns:
List of integer months if the tenure is valid.
- Return type:
list
- Raises:
ValueError – If the tenure cannot be parsed or is not in months.
Examples
>>> parse_tenure("9 mths", header_str="Period") [9]
>>> parse_tenure("6-month", header_str="Tenor (% p.a.)") [6]
>>> parse_tenure("6-8", header_str="Tenure (months)") [6, 7, 8]
>>> parse_tenure("12", header_str="Tenure (months)") [12]
>>> parse_tenure("6-12 weeks", header_str="Tenure in weeks") ValueError: Neither header 'Tenure in weeks' nor content '6-12 weeks' indicates months.
- sgfixedincome_pkg.scraper.clean_rate_value(rate_value)[source]
Cleans the rate value by removing any non-numeric characters and converting to a float.
If the rate value is a string representing ‘N.A’, ‘N.A.’, or similar (case-insensitive), it returns None.
- Parameters:
rate_value (str or float) – The rate value which may include non-numeric characters (e.g., ‘%’, ‘N.A.’) or be a valid numeric value.
- Returns:
The cleaned rate value as a float, or None if the value represents ‘N.A’.
- Return type:
float or None
- Raises:
ValueError – If the rate value cannot be converted to a float and isn’t a valid ‘N.A.’ string.
Examples
clean_rate_value(“5%”) -> 5.0
clean_rate_value(“N.A.”) -> None
clean_rate_value(“3.5”) -> 3.5
- sgfixedincome_pkg.scraper.reshape_table(raw_df)[source]
Reshapes the raw DataFrame into a structured format for analysis.
- Parameters:
raw_df (pd.DataFrame) – The raw DataFrame containing fixed deposit rate data. The first column contains tenure in months (e.g., ‘Period’, ‘Tenor’, or ‘Tenure’). The other columns contain rates for different deposit ranges (e.g., ‘$1,000-$9,999’).
- Returns:
A reshaped DataFrame with the following columns:
Tenure: The duration in months (as float).
Rate: The deposit rates (as float).
Deposit lower bound: The lower bound of the deposit range (as float).
Deposit upper bound: The upper bound of the deposit range (as float, or None if not specified).
- Return type:
pd.DataFrame
- Raises:
ValueError – If the first column does not contain keywords indicating tenure information.
- sgfixedincome_pkg.scraper.scrape_deposit_rates(url, table_class, provider, req_multiples=None)[source]
Scrapes deposit rates from the given URL and manually add extra information.
Sometimes, bank websites use the same class for multiple tables, including the key table of interest with fixed deposit rates. To enable our scraper to work in such cases, we attempt to scrape data for each of these tables, starting with the first. We ignore additional tables once we find one that can be successfully scraped. The intuition is that we would only be able to successfully scrape tables with our desired data, and attempted scraping of tables containing other information would fail.
- Parameters:
url (str) – URL of the website to scrape. The website should contain a table of fixed deposit rates.
table_class (str) – Class name of the table to locate in the website.
provider (str) – The name of the provider offering the fixed deposit products.
req_multiples (optional, float or None) – The required multiples for the deposit, if applicable. Defaults to None.
- Returns:
A pandas DataFrame containing the reshaped deposit rates data, with additional columns:
Required multiples: The value provided in req_multiples.
Product provider: The value provided in provider.
Product: A static string “Fixed Deposit” indicating the type of product.
- Return type:
pd.DataFrame
- Raises:
Exception – If the scraping or data extraction process fails, an exception will be raised.