utils
Utility functions for the Spark application.
check_columns_unique(df, columns)
¶
Checks if each column in the given list is unique in the DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
The input Spark DataFrame. |
required |
columns
|
list
|
List of column names to check for uniqueness. |
required |
Raises:
Type | Description |
---|---|
ValueError
|
If any column contains duplicate values. |
Source code in code\modules\utils.py
create_spark_session(app_name)
¶
Create a Spark session.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
app_name
|
str
|
Name of the Spark application. |
required |
Returns:
Name | Type | Description |
---|---|---|
SparkSession |
SparkSession
|
Spark session object. |
Source code in code\modules\utils.py
get_logger(name)
¶
Create a logger object with both console and file handlers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Name of the logger. |
required |
Returns:
Type | Description |
---|---|
Logger
|
logging.Logger: Logger object. |
Raises:
Type | Description |
---|---|
OSError
|
If unable to create logs directory or log file. |
Source code in code\modules\utils.py
mask_sensitive_columns(df, sensitive_columns)
¶
Masks sensitive columns in the given Spark DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Input Spark DataFrame. |
required |
sensitive_columns
|
list
|
List of column names to mask. |
required |
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
Spark DataFrame with masked sensitive columns. |
Source code in code\modules\utils.py
profile_data(df)
¶
Profile the data in a Spark DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
df
|
DataFrame
|
Spark DataFrame object. |
required |
Source code in code\modules\utils.py
read_csv_file(spark, file_directory, infer_schema=True, schema=None)
¶
Read a CSV file into a Spark DataFrame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
spark
|
SparkSession
|
Spark session object. |
required |
file_directory
|
str
|
Path to the CSV file. |
required |
infer_schema
|
bool
|
Whether to infer the schema of the CSV file. |
True
|
schema
|
str
|
Schema of the CSV file. |
None
|
Returns:
Name | Type | Description |
---|---|---|
DataFrame |
DataFrame
|
Spark DataFrame object. |