2024-03-24

What we are Solving?

  • Acquiring thorough profit insights is critical in the current competitive online shopping platform scenario. These insights are crucial for developing successful sales tactics and offers, which in turn lead to higher sales and more profits.

  • We built a set of functions using linear regression, Elastic Net model, neural network model and Random Forest model on top of Amazon sales data. With the use of these features, we can identify weekly profit trends depending on consumer actions. Our ability to accurately and precisely improve sales performance can be enhanced by utilizing this analytical method.

Question: Can we develop a profit function using the provided sales data that offers insights into profitability, considering various aspects such as purchase day, repeat purchases, and product categories?

Libraries

Four distinct models were utilized in the search for the best function, hence a large number of libraries were required.

suppressWarnings({
library(ggplot2)
library(dplyr)
library(tidyr)
library(readxl)
library(nnet)
library(modelr)
library(tidyverse)
library(glmnet)
library(randomForest)
library(knitr)
library(caret)
library(kableExtra)
})

Importing Data

We used read_excel to import the data and altered the column names for more practical purposes..

sales.data <- read_excel("Amazon_2_Raw.xlsx")
colnames(sales.data) <- c("order_id","order_date","ship_date","email","geography","category","product_name",
                          "sales","quantity","profit")

Data Preprocessing and Feautre Engineering

It is necessary to organize the data in the right format because there are numerous figures and information that were separated.

sales.data$order_date <- as.Date(sales.data$order_date)
sales.data$ship_date <- as.Date(sales.data$ship_date)
sales.data$year <- format(sales.data$order_date, "%Y")
sales.data$month <- format(sales.data$order_date, "%m")
sales.data$day_of_week <- weekdays(sales.data$order_date)
sales.data$shipping_duration <- as.numeric(difftime(sales.data$ship_date, sales.data$order_date,
                                                    units = "days"))
sales.data$year <- as.factor(sales.data$year)
sales.data$month <- as.factor(sales.data$month)
sales.data$day_of_week <- as.factor(sales.data$day_of_week)
sales.data$category <- as.factor(sales.data$category)
sales.data$sales_normalized <- scale(sales.data$sales)
sales.data$quantity_normalized <- scale(sales.data$quantity)
sales.data$shipping_duration_normalized <- scale(sales.data$shipping_duration)
sales.data$order_day <- weekdays(as.Date(sales.data$order_date))
sales.data$repeat_purchases <- ave(sales.data$order_id, sales.data$email, FUN = length)
sales.data$repeat_purchases <- as.numeric(as.character(sales.data$repeat_purchases))
sales.data$repeat_purchases_normalized <- scale(sales.data$repeat_purchases)
sales.data$State <- sapply(strsplit(as.character(sales.data$geography), ","), function(x) x[3])

Customer Purchase Frequency

We want to analyze customer purchasing behavior over time, specifically identifying peak periods of purchases and lows throughout the year. I utilized a line graph plot for easy visualization of the peaks and troughs in customer purchasing patterns.

Total sales with respect to categories

  • Varied Sales Volume: The bar chart demonstrates significant variation in total sales among categories, with ‘Chairs’ and ‘Phones’ leading in sales volume.
  • Focus Areas for Growth: Categories such as ‘Fasteners’, ‘Envelopes’, and ‘Labels’ show lower sales, potentially indicating areas for strategic sales initiatives or product development.

Total profit with respect to categories

  • Profit Leaders: ‘Copiers’ and ‘Accessories’ categories show the highest total profit, indicating strong performance and possibly higher margins.
  • Areas for Improvement: ‘Bookcases’ and ‘Machines’ represent the lowest profit categories, suggesting a need for a review of pricing strategies, cost control, or sales tactics.

Time Series Analysis

Long term sales have increased but short-term behavior has been volatile. Profit, however, has been steady with a spike in early 2014. This data shows how Amazon has performed over time and can be used to predict its future value.

Geographical Analysis

After geographical analysis, California orders the most Amazon products by far. Washington is second, with the other states ordering similar amounts. This data could help Amazon build more facilities in California, where product demand is higher.

Product Performance Analysis

The dual-axis plot above shows sales and profit by product category, with bars representing sales and lines with markers representing profit. This visualization shows how sales translate into profit across product categories. From the plot, we can see each category’s sales and profit, which may help us decide which product categories to focus on or expand.

Linear Regression

  • The linear regression model uses category, repeat_purchases, and quantity as predictors to estimate profit.
  • The Mean Absolute Error (MAE) is 47.39, indicating that the model’s predictions are, on average, $47.39 away from the actual profit values.
  • The Mean Squared Error (MSE) is reported at 25325.26, suggesting that the model’s predictions are quite variable and may be influenced by outliers or extreme values.
  • The R-squared value of the model is 0.1643, which means that about 16.43% of the variation in profit is explained by the model. This is a relatively low value, implying that the model might not be capturing all the factors that influence profit.
Regression Coefficients
Estimate Std. Error t value Pr(>&#124;t&#124;)
(Intercept) 36.5942481 12.16953 3.0070395 0.0026587
categoryAppliances -0.2338952 16.95187 -0.0137976 0.9889923
categoryArt -52.2662349 14.19648 -3.6816324 0.0002356
categoryBinders -29.4721954 12.38528 -2.3796157 0.0173893
categoryBookcases -83.3536152 20.46702 -4.0725814 0.0000476

Regression Coefficients (Part 2)
Estimate Std. Error t value Pr(>&#124;t&#124;)
categoryChairs -41.2227905 14.9351472 -2.7601195 0.0058111
categoryCopiers 712.2774064 33.5139968 21.2531323 0.0000000
categoryEnvelopes -30.6072988 21.9446809 -1.3947480 0.1631892
categoryFasteners -57.4140393 21.3177668 -2.6932483 0.0071130
categoryFurnishings -37.8312663 13.5376149 -2.7945296 0.0052286
categoryLabels -45.1677310 17.8774378 -2.5265215 0.0115679
categoryMachines -77.7972172 27.4784041 -2.8312131 0.0046663
categoryPaper -35.3718518 12.4901703 -2.8319751 0.0046552
categoryPhones -29.9577117 13.8375138 -2.1649635 0.0304645
categoryStorage -30.8623088 13.9770623 -2.2080683 0.0273105
categorySupplies -50.8745020 21.6850309 -2.3460655 0.0190338
categoryTables -52.6158093 17.8954235 -2.9401824 0.0033038
repeat_purchases -0.3366328 0.6724205 -0.5006283 0.6166673
quantity 7.4423722 1.2536158 5.9367251 0.0000000

Interpretation

  • Pattern in residuals: The spread of residuals increases with fitted values, suggesting non-constant variance.
  • Presence of outliers: Specific data points, such as 2664, stand out with high residuals.

Normal Q-Q Plot

  • Distribution issues: The tail of the plot curves upwards, thus not a normal distribution.
  • Influential observations: Points like 2664 stray from the expected line, possibly exerting undue influence on the model.

Neural Network with k fold

Neural Network Training Introduction

  • Goal: Predict target using a neural network model.
  • Data: Features prepared from sales.data.
  • Method: 5-fold cross-validation to evaluate model performance.

Data Preparation and Model Configuration

  • Data matrix: data_for_nn prepared with predictors.
  • Target variable: profit from sales.data.
  • Neural network setup: nnet with size=5, decay=0.000005, max iterations=3000.

Neural Network Training Summary

Training Output Summary for Each Fold

  • Fold 1: Initial value: 44,034,439.54 | Final value: 35,729,033.63
  • Fold 2: Initial value: 88,105,695.11 | Final value: 58,442,771.49
  • Fold 3: Initial value: 93,857,949.85 | Final value: 71,036,755.92
  • Fold 4: Initial value: 94,524,748.27 | Final value: 74,115,030.54
  • Fold 5: Initial value: 82,663,911.40 | Final value: 56,772,514.61

Average Performance Across All Folds

  • Average Mean Absolute Error (MAE) across all folds: 50.21544
  • Average Mean Squared Error (MSE) across all folds: 31,663.92

Regression using Elastic Net model

Elastic Net Model Overview

  • Hybrid Model: Combines Lasso and Ridge attributes with alpha at 0.5.
  • Average Error: MAE of $39.38 shows the model’s close-fit predictions.

Model Performance

  • Precision: An MSE of 13,956.61 reflects a robust predictive accuracy.
  • Explained Variance: R-squared value at 53.94%, a substantial fit indicating the model’s effectiveness.

Actual vs Predicted Plot (Elastic Net model)

Prediction Alignment: “Actual vs. Predicted” visuals show many predictions closely match actual profits.

Residual Plot (Elastic Net model)

Residual Trends: “Residual Plot” reveals patterns suggesting the influence of outliers on prediction accuracy.

Random Forest

  • High Variance Explained: With an R-squared of 85.5%, the model accounts for a significant portion of profit variance.
  • Model Complexity: Utilizes 500 trees to achieve strong predictive accuracy and fit.
## Random Forest - Mean Absolute Error (MAE): 13.73329
## Random Forest - Mean Squared Error (MSE): 4395.276
## R-squared: 0.854963

Actual vs Predicted Plot (Random Forest)

The “Actual vs. Predicted Profits (Random Forest)” plot illustrates a tight clustering of points along the red line that indicates perfect predictions, demonstrating the model’s accuracy in predicting profits.

Residual Plot (Random Forest)

In the “Residual Plot (Random Forest)”, most residuals are distributed close to the horizontal line at 0, with fewer large residuals compared to previous models, suggesting that the Random Forest model has a consistent prediction quality across the range of actual profits.

Strategies for Profit Maximization

  • Leverage Predictive Insights: Utilize the Random Forest model’s feature importance to prioritize variables like category, quantity, and sales_normalized that are most predictive of profit.

  • Optimize Product and Sales Mix: Focus on high-profit categories and optimize inventory levels using insights from sales_normalized and quantity_normalized to meet demand without overstocking.

  • Refine Temporal Strategies: Analyze patterns from year, month, and day_of_week to capitalize on peak shopping times and tailor marketing efforts.

  • Efficient Logistics: Streamline the shipping process based on shipping_duration_normalized to improve customer satisfaction and reduce costs, thereby enhancing profit margins.