# 第二章：推荐系统入门

• 推荐系统工作原理
• 社会化协同过滤工作原理
• 如何找到相似物品
• 曼哈顿距离
• 欧几里得距离
• 闵可夫斯基距离
• 皮尔逊相关系数
• 余弦相似度
• 使用Python实现K最邻近算法
• 图书漂流站（BookCrossing）数据集

## 你喜欢的东西我也喜欢

last.fm上对音乐和演唱会的推荐（相似歌手）：

### 推广：闵可夫斯基距离

• `r = 1` 该公式即曼哈顿距离
• `r = 2` 该公式即欧几里得距离
• `r = ∞` 极大距离

## 使用Python代码来表示数据（终于要开始编程了）

``````users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0, "Norah Jones": 4.5, "Phoenix": 5.0, "Slightly Stoopid": 1.5, "The Strokes": 2.5, "Vampire Weekend": 2.0},
"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5, "Deadmau5": 4.0, "Phoenix": 2.0, "Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},
"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0, "Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5, "Slightly Stoopid": 1.0},
"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0, "Deadmau5": 4.5, "Phoenix": 3.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 2.0},
"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0, "Norah Jones": 4.0, "The Strokes": 4.0, "Vampire Weekend": 1.0},
"Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0, "Norah Jones": 5.0, "Phoenix": 5.0, "Slightly Stoopid": 4.5, "The Strokes": 4.0, "Vampire Weekend": 4.0},
"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0, "Norah Jones": 3.0, "Phoenix": 5.0, "Slightly Stoopid": 4.0, "The Strokes": 5.0},
"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
}
``````

``````>>> users["Veronica"]
{"Blues Traveler": 3.0, "Norah Jones": 5.0, "Phoenix": 4.0, "Slightly Stoopid": 2.5, "The Strokes": 3.0}
>>>
``````

### 计算曼哈顿距离

``````def manhattan(rating1, rating2):
"""计算曼哈顿距离。rating1和rating2参数中存储的数据格式均为
{'The Strokes': 3.0, 'Slightly Stoopid': 2.5}"""
distance = 0
for key in rating1:
if key in rating2:
distance += abs(rating1[key] - rating2[key])
return distance
``````

``````>>> manhattan(users['Hailey'], users['Veronica'])
2.0
>>> manhattan(users['Hailey'], users['Jordyn'])
7.5
>>>
``````

``````def computeNearestNeighbor(username, users):
distances = []
for user in users:
distances.append((distance, user))
# 按距离排序——距离近的排在前面
distances.sort()
return distances
``````

``````>>> computeNearestNeighbor("Hailey", users)
[(2.0, 'Veronica'), (4.0, 'Chan'), (4.0, 'Sam'), (4.5, 'Dan'), (5.0, 'Angelica'), (5.5, 'Bill'), (7.5, 'Jordyn')]
``````

``````def recommend(username, users):
"""返回推荐结果列表"""
# 找到距离最近的用户
recommendations = []
# 找出这位用户评价过、但自己未曾评价的乐队
neighborRatings = users[nearest]
for artist in neighborRatings:
if not artist in userRatings:
recommendations.append((artist, neighborRatings[artist]))
# 按照评分进行排序
return sorted(recommendations, key=lambda artistTuple: artistTuple[1], reverse = True)
``````

``````>>> recommend('Hailey', users)
[('Phoenix', 4.0), ('Blues Traveler', 3.0), ('Slightly Stoopid', 2.5)]
``````

``````>>> recommend('Chan', users)
[('The Strokes', 4.0), ('Vampire Weekend', 1.0)]
>>> recommend('Sam', users)
``````

``````>>> recommend('Angelica', users)
[]
``````

``````>>> computeNearestNeighbor('Angelica', users)
[(3.5, 'Veronica'), (4.5, 'Chan'), (5.0, 'Hailey'), (8.0, 'Sam'), (9.0, 'Bill'), (9.0, 'Dan'), (9.5, 'Jordyn')]
``````

Angelica最相似的用户是Veronica，让我们回头看看数据：

``````def minkowski(rating1, rating2, r):
distance = 0
for key in rating1:
if key in rating2:
distance += pow(abs(rating1[key] - rating2[key]), r)
return pow(distance, 1.0 / r)

# 修改computeNearestNeighbor函数中的一行
# 这里2表示使用欧几里得距离
``````

### 用户的问题

• Bill没有打出极端的分数，都在2至4分之间；
• Jordyn似乎喜欢所有的乐队，打分都在4至5之间；
• Hailey是一个有趣的人，他的分数不是1就是4。

• 左：我非常喜欢Broken Bells乐队，所以我给他们打4分！
• 右：Broken Bells乐队还可以，我打4分。

### 皮尔逊相关系数

Clara的总评分是22.5， Robert是15，他们评价了5支乐队，因此：

``````from math import sqrt

def pearson(rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
# 计算分母
denominator = sqrt(sum_x2 - pow(sum_x, 2) / n) * sqrt(sum_y2 - pow(sum_y, 2) / n)
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator
``````

``````>>> pearson(users['Angelica'], users['Bill'])
-0.9040534990682699
>>> pearson(users['Angelica'], users['Hailey'])
0.42008402520840293
>>> pearson(users['Angelica'], users['Jordyn'])
0.7639748605475432
``````

## 最后一个公式：余弦相似度

### 应该使用哪种相似度？

• 如果数据存在“分数膨胀”问题，就使用皮尔逊相关系数。
• 如果数据比较“密集”，变量之间基本都存在公有值，且这些距离数据是非常重要的，那就使用欧几里得或曼哈顿距离。
• 如果数据是稀疏的，则使用余弦相似度。

• Jake（左）：乡村音乐的忠实听众。
• Linda和Eric（右）：我们爱六十年代的摇滚乐！

Linda和Eric喜欢相同的音乐，他们的评分列表中有20首相同的的歌曲，且评分均值相差不到0.5！所以他们之间的曼哈顿距离为20 x 0.5 = 10，欧几里得距离则为：

Linda和Jake只共同评分了一首歌曲：Chris Cagle的 What a Beautiful Day 。Linda打了3分，Jake打了5分，所以他们之间的曼哈顿距离为2，欧几里得距离为：

Cooper评价了26首歌曲，其中25首和Jake是一样的。他们对每首歌曲的评价差值只有0.25！

Kelsey在我们网站上评价了150首歌曲，其中25首和Jake相同。和Cooper一样，她和Jake之间的评价差值也只有0.25！

## Python推荐模块

``````import codecs
from math import sqrt

users = {"Angelica": {"Blues Traveler": 3.5, "Broken Bells": 2.0,
"Norah Jones": 4.5, "Phoenix": 5.0,
"Slightly Stoopid": 1.5,
"The Strokes": 2.5, "Vampire Weekend": 2.0},

"Bill":{"Blues Traveler": 2.0, "Broken Bells": 3.5,
"Slightly Stoopid": 3.5, "Vampire Weekend": 3.0},

"Chan": {"Blues Traveler": 5.0, "Broken Bells": 1.0,
"Deadmau5": 1.0, "Norah Jones": 3.0, "Phoenix": 5,
"Slightly Stoopid": 1.0},

"Dan": {"Blues Traveler": 3.0, "Broken Bells": 4.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 2.0},

"Hailey": {"Broken Bells": 4.0, "Deadmau5": 1.0,
"Norah Jones": 4.0, "The Strokes": 4.0,
"Vampire Weekend": 1.0},

"Jordyn":  {"Broken Bells": 4.5, "Deadmau5": 4.0,
"Norah Jones": 5.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.5, "The Strokes": 4.0,
"Vampire Weekend": 4.0},

"Sam": {"Blues Traveler": 5.0, "Broken Bells": 2.0,
"Norah Jones": 3.0, "Phoenix": 5.0,
"Slightly Stoopid": 4.0, "The Strokes": 5.0},

"Veronica": {"Blues Traveler": 3.0, "Norah Jones": 5.0,
"Phoenix": 4.0, "Slightly Stoopid": 2.5,
"The Strokes": 3.0}
}

class recommender:

def __init__(self, data, k=1, metric='pearson', n=5):
""" 初始化推荐模块
data   训练数据
k      K邻近算法中的值
metric 使用何种距离计算方式
n      推荐结果的数量
"""
self.k = k
self.n = n
self.userid2name = {}
self.productid2name = {}
# 将距离计算方式保存下来
self.metric = metric
if self.metric == 'pearson':
self.fn = self.pearson
#
# 如果data是一个字典类型，则保存下来，否则忽略
#
if type(data).__name__ == 'dict':
self.data = data

def convertProductID2name(self, id):
"""通过产品ID获取名称"""
if id in self.productid2name:
return self.productid2name[id]
else:
return id

def userRatings(self, id, n):
"""返回该用户评分最高的物品"""
print ("Ratings for " + self.userid2name[id])
ratings = self.data[id]
print(len(ratings))
ratings = list(ratings.items())
ratings = [(self.convertProductID2name(k), v)
for (k, v) in ratings]
# 排序并返回结果
ratings.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
ratings = ratings[:n]
for rating in ratings:
print("%s\t%i" % (rating[0], rating[1]))

"""加载BX数据集，path是数据文件位置"""
self.data = {}
i = 0
#
# 将书籍评分数据放入self.data
#
f = codecs.open(path + "BX-Book-Ratings.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
user = fields[0].strip('"')
book = fields[1].strip('"')
rating = int(fields[2].strip().strip('"'))
if user in self.data:
currentRatings = self.data[user]
else:
currentRatings = {}
currentRatings[book] = rating
self.data[user] = currentRatings
f.close()
#
# 将书籍信息存入self.productid2name
# 包括isbn号、书名、作者等
#
f = codecs.open(path + "BX-Books.csv", 'r', 'utf8')
for line in f:
i += 1
#separate line into fields
fields = line.split(';')
isbn = fields[0].strip('"')
title = fields[1].strip('"')
author = fields[2].strip().strip('"')
title = title + ' by ' + author
self.productid2name[isbn] = title
f.close()
#
#
f = codecs.open(path + "BX-Users.csv", 'r', 'utf8')
for line in f:
i += 1
#print(line)
#separate line into fields
fields = line.split(';')
userid = fields[0].strip('"')
location = fields[1].strip('"')
if len(fields) > 3:
age = fields[2].strip().strip('"')
else:
age = 'NULL'
if age != 'NULL':
value = location + '  (age: ' + age + ')'
else:
value = location
self.userid2name[userid] = value
f.close()
print(i)

def pearson(self, rating1, rating2):
sum_xy = 0
sum_x = 0
sum_y = 0
sum_x2 = 0
sum_y2 = 0
n = 0
for key in rating1:
if key in rating2:
n += 1
x = rating1[key]
y = rating2[key]
sum_xy += x * y
sum_x += x
sum_y += y
sum_x2 += pow(x, 2)
sum_y2 += pow(y, 2)
if n == 0:
return 0
# 计算分母
denominator = (sqrt(sum_x2 - pow(sum_x, 2) / n)
* sqrt(sum_y2 - pow(sum_y, 2) / n))
if denominator == 0:
return 0
else:
return (sum_xy - (sum_x * sum_y) / n) / denominator

"""获取邻近用户"""
distances = []
for instance in self.data:
self.data[instance])
distances.append((instance, distance))
# 按距离排序，距离近的排在前面
distances.sort(key=lambda artistTuple: artistTuple[1],
reverse=True)
return distances

def recommend(self, user):
"""返回推荐列表"""
recommendations = {}
# 首先，获取邻近用户
nearest = self.computeNearestNeighbor(user)
#
# 获取用户评价过的商品
#
userRatings = self.data[user]
#
# 计算总距离
totalDistance = 0.0
for i in range(self.k):
totalDistance += nearest[i][1]
# 汇总K邻近用户的评分
for i in range(self.k):
# 计算饼图的每个分片
weight = nearest[i][1] / totalDistance
# 获取用户名称
name = nearest[i][0]
# 获取用户评分
neighborRatings = self.data[name]
# 获得没有评价过的商品
for artist in neighborRatings:
if not artist in userRatings:
if artist not in recommendations:
recommendations[artist] = (neighborRatings[artist]
* weight)
else:
recommendations[artist] = (recommendations[artist]
+ neighborRatings[artist]
* weight)
# 开始推荐
recommendations = list(recommendations.items())
recommendations = [(self.convertProductID2name(k), v)
for (k, v) in recommendations]
# 排序并返回
recommendations.sort(key=lambda artistTuple: artistTuple[1],
reverse = True)
# 返回前n个结果
return recommendations[:self.n]
``````

``````>>> r = recommender(users)
>>> r.recommend('Jordyn')
[('Blues Traveler', 5.0)]
>>> r.recommend('Hailey')
[('Phoenix', 5.0), ('Slightly Stoopid', 4.5)]
``````

### 新的数据集

CSV文件包含了三张表：

• 用户表，包括用户ID、位置、年龄等信息。其中用户的姓名已经隐去；
• 书籍表，包括ISBN号、标题、作者、出版日期、出版社等；
• 评分表，包括用户ID、书籍ISBN号、以及评分（0-10分）。

``````>>> r.loadBookDB('/Users/raz/Downloads/BX-Dump/')
1700018
>>> r.recommend('171118')
``````

### 项目实践

1. 实现一个计算曼哈顿距离和欧几里得距离的方法；
2. 本书的网站上有一个包含25部电影评价的数据集，实现一个推荐算法。