Apache Spark Machine Learning project using MLlib and Linear Regression on Databricks!
Apache Spark Machine Learning project using MLlib and Linear Regression on Databricks! This project demonstrates the application of machine learning techniques on big data using PySpark, the Python API for Apache Spark. This guide will walk you through the entire process, from setting up your Databricks environment to performing data analysis and building a linear regression model.
Apache Spark is an open-source, distributed computing system designed for fast and efficient big data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.
PySpark is the Python API for Apache Spark. It allows Python developers to utilize the powerful distributed computing capabilities of Spark while writing code in Python, a more user-friendly language.
You can either upload the data files directly to Databricks or use S3 for storage.
df = spark.read.csv("s3a://your-bucket-name/your-file.csv", header=True, inferSchema=True)
https://github.com/TravelXML/APACHE-SPARK-PYSPARK-DATABRICKS-MACHINE-LEARNING-MLIB
PYSPARK - LINER REGRESSION.ipynb
and PYSPARK ML.ipynb
.from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
spark = SparkSession.builder.appName('Spark ML Example').getOrCreate()
Replace 's3a://your-bucket-name/your-file.csv'
with the actual path to your data file.
# Load the data
file_path = '/FileStore/shared_uploads/[email protected]/test1.csv'
df = spark.read.csv(file_path, header=True, inferSchema=True)
# Index categorical columns
indexer = StringIndexer(inputCols=["sex", "smoker", "day", "time"],
outputCols=["sex_indexed", "smoker_indexed", "day_indexed", "time_index"])
df_r = indexer.fit(df).transform(df)
# Assemble features into a vector
featureassembler = VectorAssembler(inputCols=['tip', 'size', 'sex_indexed', 'smoker_indexed', 'day_indexed', 'time_index'],
outputCol="Independent Features")
finalized_data = featureassembler.transform(df_r)
# Select relevant columns
finalized_data = finalized_data.select("Independent Features", "total_bill")
train_data, test_data = finalized_data.randomSplit([0.75, 0.25])
regressor = LinearRegression(featuresCol='Independent Features', labelCol='total_bill')
regressor = regressor.fit(train_data)
# Make predictions
predictions = regressor.transform(test_data)
# Evaluate the model
evaluator = RegressionEvaluator(labelCol="total_bill", predictionCol="prediction", metricName="r2")
r2 = evaluator.evaluate(predictions)
evaluator = RegressionEvaluator(labelCol="total_bill", predictionCol="prediction", metricName="mae")
mae = evaluator.evaluate(predictions)
evaluator = RegressionEvaluator(labelCol="total_bill", predictionCol="prediction", metricName="mse")
mse = evaluator.evaluate(predictions)
# Show predictions
predictions.select("Independent Features", "total_bill", "prediction").show()
# Print performance metrics
print(f"R²: {r2}")
print(f"Mean Absolute Error: {mae}")
print(f"Mean Squared Error: {mse}")
Congratulations! You have successfully set up a Databricks environment, uploaded data, and performed machine learning analysis using PySpark. You have learned how to preprocess data, build a linear regression model, and evaluate its performance.
For more in-depth tutorials and articles on Apache Spark, PySpark, and big data analytics, subscribe to our updates.
Feel free to reach out if you have any questions or need further assistance. Happy coding!