What is a Recommender System?

A recommender system is a software engine developed to suggest products and services for a given set of customers. While there are multiple ways in which these systems recommend products, the most common is by analyzing a customer's previous purchasing patterns by storing data related to previous purchases, positive and negative reviews, saves/adds to lists, views, and more.

So why do businesses such as Amazon and Netflix spend small fortunes building and improving these systems? Because recommender systems boost sales significantly. By acting as each customer’s personal sales team, recommender systems provide each user with a unique and personalized experience. These systems can help customers identify their favorite movies, books, shows, articles, and more without having to parse through the millions of choices.

TensorFlow and TensorFlow Recommender

Created by the well-known Tensor framework, the TensorFlow Recommender or TFRS is a library created specifically for building recommendation system models. In addition to being moderately easy to learn, the TensorFlow recommender framework helps in the entire recommendation system-building process, from data collection and evaluation to deployment.

In our short tutorial, we will integrate the TensorFlow recommender system into our recommendation system model. We’ll also explain the structure of our model, with a brief explanation of each step in the code.

The Goal of Our Recommender System?

We will build a recommendation system that takes in a list of electronic products (provided by Amazon) along with a list of ratings of each item given by different users as the input data. The system will provide further suggestions for similar products with the highest ratings that each customer might be interested in buying.

To find the data set used in our tutorial example, please check the following link: Amazon Product Review Data set. The data set contains four columns; the user ID, product ID, product rating, and the time stamp of each rating. Using the following values, we will build the required recommender system.

TensorFlow Recommenders Tutorial

Note that you can test your code on Kaggle or Google Colab.

Step 1: Importing the Required Libraries

import numpy as np
import pandas as pd
from datetime import datetime,timedelta

We will also import the plotting libraries. These are libraries that allow us to draw and plot figures, graphs, etc. These figures will mostly be used for explanation purposes and will have no effect on the final result of our model whatsoever.

import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.ticker import MaxNLocator
from learntools.time_series.utils import plot_periodogram, seasonal_plot
from learntools.time_series.style import *

import seaborn as sns
from IPython.display import Markdown, display
def printmd(string):
	display(Markdown(string))
from pathlib import Path
comp_dir = Path('../input/amazon-product-reviews')

Step 2: Importing the Amazon Reviews Data Set

In this step, we will import and read the electronic rating data set from Amazon.

electronics_data=pd.read_csv(comp_dir / "ratings_Electronics (1).csv", dtype={'rating': 'int8'},
names=['userId', 'productId','rating','timestamp'], index_col=None, header=0)
#electronics_data.drop("timestamp",axis=1, inplace=True)

Step 3: Printing the Data Set Info

In this part, we will print some information about the given dataset, such as the number of rows (ratings) available, the columns used (userID, productID, rating, and timestamp), the number of unique users, and finally the number of unique products.

printmd("Number of Rating: {:,}".format(electronics_data.shape[0]) )
printmd("Columns: {}".format( np.array2string(electronics_data.columns.values)) )
printmd("Number of Users: {:,}".format(len(electronics_data.userId.unique()) ) )
printmd("Number of Products: {:,}".format(len(electronics_data.productId.unique()) ) )
electronics_data.describe()['rating'].reset_index()

Output:

Number of Rating: 7,824,481
Columns: ['userId' 'productId' 'rating' 'timestamp']
Number of Users: 4,201,696
Number of Products: 476,001

Step 4: Checking for Missing Values

printmd('**Number of missing values**:')
pd.DataFrame(electronics_data.isnull().sum().reset_index()).rename( columns={0:"Total missing","index":"Columns"})

Luckily for us, the used data set contains no missing values. The reason that we check for missing values is to remove them. As in our case, missing values will negatively affect the accuracy of our final model.

Step 5: Printing the Number of Ratings Per Day

data_by_date = electronics_data.copy()
data_by_date.timestamp = pd.to_datetime(electronics_data.timestamp, unit="s")#.dt.date
data_by_date = data_by_date.sort_values(by="timestamp", ascending=False).reset_index(drop=True)
printmd("Number of Ratings each day:")
data_by_date.groupby("timestamp")["rating"].count().tail(10).reset_index()

Step 6: Plotting the Number of Ratings Over the Years

data_by_date["year"] = data_by_date.timestamp.dt.year
data_by_date["month"] = data_by_date.timestamp.dt.month
rating_by_year = data_by_date.groupby(["year","month"])["rating"].count().reset_index()
rating_by_year["date"] = pd.to_datetime(rating_by_year["year"].astype("str") +"-"+rating_by_year["month"].astype("str") +"-1")
rating_by_year.plot(x="date", y="rating")
plt.title("Number of Rating over years")
plt.show()

plotting the number of ratings over the years

Step 7: Sorting the Products by Number of Ratings

#rating_by_product = electronics_data.groupby(by='productId')['Rating'].count().sort_values(ascending=False).reset_index()
rating_by_product = electronics_data.groupby("productId").agg({"userId":"count","rating":"mean"}).rename(columns={"userId":"Number of Ratings", "rating":"Average Rating"}).reset_index()

We will then define the cutoff to be of value 50. Only products with a greater number of ratings than the cutoff will be counted. This means that products with less than the cutoff number of ratings will be negated from the sorting.

cutoff = 50
top_rated = rating_by_product.loc[rating_by_product["Number of Ratings"]>cutoff].sort_values(by="Average Rating",ascending=False).reset_index(drop=True)

Step 8: Installing TensorFlow Recommenders (TFSR)

To start with, we will update pip to the latest available version. If pip is not found, then it will be installed. After that, we will install the TensorFlow recommenders framework using pip.

!/opt/conda/bin/python3.7 -m pip install --upgrade pip
!pip install -q tensorflow-recommenders

Step 9: Import the TensorFlow and TensorFlow Recommenders Library.

We will use the below 2 libraries in order to import the two neural network models used in this tutorial.

import tensorflow as tf
import tensorflow_recommenders as tfrs

Step 10: Importing Our Models

The RankingModel method takes as its one and only parameter the tf.keras.Model, which is the parent neural network model imported from the TensorFlow library. Moving on, we will then pass a value of 32 to the embedding_dimension, which is a hyperparameter of the embedding layer that specifies the size of the embedding vector.

Next, inside the RankingModel we will perform word embedding on the product and user data sets. Because our model can’t understand textual data, we will need to transform the textual data into integer values that can be processed by it.

Word embedding is the process of transforming every single word inside a given dataset into an integer value. This value will represent the importance of the given word in the used neural network in order to reach the required objective.

At last, for the ratings, we will build 3 dense (fully connected) layers that will perform normal regression. The first layer contains 256 neurons, the second 64, and the last layer has only 1, meaning that our final value is either true/false (yes/no).

class RankingModel(tf.keras.Model):
	def init(self):
		super().init()
		embedding_dimension = 32
		self.user_embeddings = tf.keras.Sequential([

tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=unique_userIds, mask_token=None),
# add additional embedding to account for unknown tokens
tf.keras.layers.Embedding(len(unique_userIds)+1, embedding_dimension)
])

self.product_embeddings = tf.keras.Sequential([
tf.keras.layers.experimental.preprocessing.StringLookup(vocabulary=unique_productIds, mask_token=None),

# add additional embedding to account for unknown tokens
tf.keras.layers.Embedding(len(unique_productIds)+1, embedding_dimension)
])
	# Set up a retrieval task and evaluation metrics over the
	# entire dataset of candidates.
	self.ratings = tf.keras.Sequential([
		tf.keras.layers.Dense(256, activation="relu"),
		tf.keras.layers.Dense(64, activation="relu"),
		tf.keras.layers.Dense(1)
	])

def call(self, userId, productId):
	user_embeddings = self.user_embeddings (userId)
	product_embeddings = self.product_embeddings(productId)
	return
self.ratings(tf.concat([user_embeddings,product_embeddings], axis=1))

The amazonModel which is the final model used, inherits the TFRS neural network model imported from the TFRS library. This amazonModel then initializes the ranking model built from the above step.

class amazonModel(tfrs.models.Model):
	def init(self):
		super().init()
		self.ranking_model: tf.keras.Model = RankingModel()
		self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
		loss = tf.keras.losses.MeanSquaredError(),
		metrics = [tf.keras.metrics.RootMeanSquaredError()])

	def compute_loss(self, features, training=False):
		rating_predictions = self.ranking_model(features["userId"], features["productId"])

	return self.task(labels=features["rating"], predictions=rating_predictions)

Step 11: Filtering Out the Dataset

In this step, we will remove any product with fewer than 50 ratings and any rating before the year 2012 from our data set. The reason for filtering out these values is that they will weaken the accuracy of the model. In the case of products with fewer than 50 ratings, there are not enough ratings for the model to correctly judge the product.

We will be filtering all ratings before the year 2012 because values are older and may have a negative effect on the model's accuracy. In addition, removing them will boost our model's running time and save memory space.

cutoff_no_rat = 50 # Only count products which received more than or equal 50
cutoff_year = 2011 # Only count Rating after 2011
recent_data = data_by_date.loc[data_by_date["year"] > cutoff_year]
print("Number of Rating: {:,}".format(recent_data.shape[0]))
print("Number of Users: {:,}".format(len(recent_data.userId.unique())))
print("Number of Products: {:,}".format(len(recent_data.productId.unique())))
del data_by_date # Free up memory
recent_prod = recent_data.loc[recent_data.groupby("productId")["rating"].transform('count').ge(cutoff_no_rat)].reset_index(drop=True).drop(["timestamp","year","month"],axis=1)
del recent_data # Free up memory

Output:

Number of Rating: 5,566,858
Number of Users: 3,142,438
Number of Products: 382,245

Step 12: Storing the Final Ratings for Our Model

We will save the new values of unique users, products, and ratings.

userIds  = recent_prod.userId.unique()
productIds = recent_prod.productId.unique()
total_ratings= len(recent_prod.index)

We will save the final ratings required in a value called ratings.

ratings = tf.data.Dataset.from_tensor_slices( {"userId":tf.cast( recent_prod.userId.values ,tf.string),
"productId":tf.cast( recent_prod.productId.values,tf.string),
"rating":tf.cast( recent_prod.rating.values ,tf.int8,) } )

Step 13: Shuffle the Rating Data Set

We will shuffle the rating values and restore them in a new value called shuffled.

tf.random.set_seed(42)
shuffled = ratings.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

Step 14: Splitting the Data Set

We will split the final shuffled data set and store 80 percent of it in the training value and 20 percent in the testing value.

train = shuffled.take( int(total_ratings*0.8) )
test = shuffled.skip(int(total_ratings*0.8)).take(int(total_ratings*0.2))
unique_productIds = productIds
unique_userIds  = userIds

Step 15: Running Our Final Model

model = amazonModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad( learning_rate=0.1 ))
cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()
model.fit(cached_train, epochs=10)

Step 16: Evaluating the Model

In the evaluation part of the model, we will try to suggest 5 products to a user we chose at random. In the below example, we chose the user with ID 123.

model.evaluate(cached_test, return_dict=True)
user_rand = userIds[123]
test_rating = {}
for m in test.take(5):

test_rating[m["productId"].numpy()]=RankingModel()(tf.convert_to_tensor([user_rand]),tf.convert_to_tensor([m["productId"]]))
print("Top 5 recommended products for User {}: ".format(user_rand))
for m in sorted(test_rating, key=test_rating.get, reverse=True):
	print(m.decode())

Output:

Top 5 recommended products for User A32PYU1S3Y7QFY:
B002FFG6JC
B004ABO7QI
B006YW3DI4
B0012YJQWQ
B006ZBWV0K

To check the original code for the example used in this tutorial, please check the following link: Tensorflow Recommenders: Amazon Review Dataset.

Final Notes on Building a Recommender System Using the TFRS Library

A recommender system is a tool that helps businesses such as Amazon and Netflix recommend products or services to customers based on their previous purchasing patterns and other data. These systems can greatly increase sales by providing a personalized experience for each user, helping them discover their favorite items without having to search through a vast selection.

TensorFlow along with TensorFlow Recommender or the TFSR makes it easy and simple to develop a recommender system. With our brief explanation of its structure and tutorial hopefully, you have a better grasp of how to tackle the development of your own recommender system.

Looking to deploy a recommender system? SabrePC offers robust enterprise servers to execute cloud instances for various deployments. Contact Us to learn more!