# 3.6 scikit-learn：Python中的机器学习

In [5]:

``````%matplotlib inline
import numpy as np
``````

Numpy, Scipy

IPython

matplotlib

scikit-learn (http://scikit-learn.org)

``````加载样例数据集
- 学习与预测

- KNN分类器
- 分类的支持向量机（SVMs）

- K-means聚类

- 简约模型

- 网格搜索和交叉验证预测器
``````

## 3.5.1 加载样例数据集

In [1]:

``````from sklearn import datasets
``````

In [2]:

``````iris.data.shape
``````

Out[2]:

``````(150, 4)
``````

In [3]:

``````iris.target.shape
``````

Out[3]:

``````(150,)
``````

In [4]:

``````import numpy as np
np.unique(iris.target)
``````

Out[4]:

``````array([0, 1, 2])
``````

digits 数据集包含1797 图像，每一个是8X8像素的图片，代表一个手写的数字

In [15]:

``````digits = datasets.load_digits()
digits.images.shape
``````

Out[15]:

``````(1797, 8, 8)
``````

In [8]:

``````import pylab as pl
pl.imshow(digits.images[0], cmap=pl.cm.gray_r)
``````

Out[8]:

``````<matplotlib.image.AxesImage at 0x109abd990>
``````

In [9]:

``````data = digits.images.reshape((digits.images.shape[0], -1))
``````

### 3.5.1.1 学习和预测

In [11]:

``````from sklearn import svm
clf = svm.LinearSVC()
clf.fit(iris.data, iris.target) # 从数据学习
``````

Out[11]:

``````LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
intercept_scaling=1, loss='squared_hinge', max_iter=1000,
multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
verbose=0)
``````

In [12]:

``````clf.predict([[ 5.0,  3.6,  1.3,  0.25]])
``````

Out[12]:

``````array([0])
``````

In [13]:

``````clf.coef_
``````

Out[13]:

``````array([[ 0.18424728,  0.45122657, -0.80794162, -0.45070597],
[ 0.05691797, -0.89245895,  0.39682582, -0.92882381],
[-0.85072494, -0.98678239,  1.38091241,  1.86550868]])
``````

## 3.5.2 分类

### 3.5.2.1 KNN分类器

K个最临近的邻居分类器内部使用基于ball tree的算法，用来代表训练的样例。

KNN (K个最临近邻居) 分类的例子:

In [14]:

``````# 创建并拟合一个最临近邻居分类器
from sklearn import neighbors
knn = neighbors.KNeighborsClassifier()
knn.fit(iris.data, iris.target)
``````

Out[14]:

``````KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=5, p=2, weights='uniform')
``````

In [15]:

``````knn.predict([[0.1, 0.2, 0.3, 0.4]])
``````

Out[15]:

``````array([0])
``````

In [16]:

``````perm = np.random.permutation(iris.target.size)
iris.data = iris.data[perm]
iris.target = iris.target[perm]
knn.fit(iris.data[:100], iris.target[:100])
``````

Out[16]:

``````KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_neighbors=5, p=2, weights='uniform')
``````

In [17]:

``````knn.score(iris.data[100:], iris.target[100:])
``````

Out[17]:

``````0.95999999999999996
``````

### 3.5.2.2 分类的支持向量机 (SVMs))

#### 3.5.2.2.1 线性支持向量机

SVMs试图构建一个最大化两个类的间距的超平面。它选取输入的一个子集，称为支持向量，这个子集中的观察距离分隔超平面最近。

In [18]:

``````from sklearn import svm
svc = svm.SVC(kernel='linear')
svc.fit(iris.data, iris.target)
``````

Out[18]:

``````SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
``````

#### 3.5.2.2.2 使用核 (kernel))

In [19]:

``````svc = svm.SVC(kernel='linear')
``````

In [20]:

``````svc = svm.SVC(kernel='poly', degree=3)
# degree: 多项式的阶
``````

RBF核 (kernel) (径向基核函数)

In [21]:

``````svc = svm.SVC(kernel='rbf')
# gamma: 径向基核大小的倒数
``````

## 3.5.3 聚类 : 将观察值分组

### 3.5.3.1 K-means 聚类

(k-means的另一个实现在SciPy的`cluster`包中。`scikit-learn`实现的不同在于提供了一个对象API和一些额外的功能，包括智能初始化。)

In [2]:

``````from sklearn import cluster, datasets
k_means = cluster.KMeans(n_clusters=3)
k_means.fit(iris.data)
``````

Out[2]:

``````KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=0)
``````

In [25]:

``````print k_means.labels_[::10]
``````
``````[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
``````

In [26]:

``````print iris.target[::10]
``````
``````[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]
``````

K-means (3 组)

K-means (8 组)

In [5]:

``````from scipy import misc
lena = misc.lena().astype(np.float32)
X = lena.reshape((-1, 1)) # We need an (n_sample, n_feature) array
k_means = cluster.KMeans(n_clusters=5)
k_means.fit(X)
``````

Out[5]:

``````KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=5, n_init=10,
n_jobs=1, precompute_distances='auto', random_state=None, tol=0.0001,
verbose=0)
``````

In [6]:

``````values = k_means.cluster_centers_.squeeze()
labels = k_means.labels_
lena_compressed = np.choose(labels, values)
lena_compressed.shape = lena.shape
``````

K-means quantization

## 3.5.4 使用主成分分析的降维

In [3]:

``````from sklearn import decomposition
pca = decomposition.PCA(n_components=2)
pca.fit(iris.data)
``````

Out[3]:

``````PCA(copy=True, n_components=2, whiten=False)
``````

In [4]:

``````X = pca.transform(iris.data)
``````

In [6]:

``````import pylab as pl
pl.scatter(X[:, 0], X[:, 1], c=iris.target)
``````

Out[6]:

``````<matplotlib.collections.PathCollection at 0x107502b90>
``````

PCA并不仅仅在高纬度数据集的可视化上有用。它也可以用于帮助加速对高维不太高效的有监督方法的预处理步骤。

## 3.5.5 把所有的东西放在一起: 面孔识别

In [ ]:

``````"""
Stripped-down version of the face recognition example by Olivier Grisel

http://scikit-learn.org/dev/auto_examples/applications/face_recognition.html

## original shape of images: 50, 37
"""
import numpy as np
import pylab as pl
from sklearn import cross_val, datasets, decomposition, svm

# ..
lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, resize=0.4)
perm = np.random.permutation(lfw_people.target.size)
lfw_people.data = lfw_people.data[perm]
lfw_people.target = lfw_people.target[perm]
faces = np.reshape(lfw_people.data, (lfw_people.target.shape[0], -1))
train, test = iter(cross_val.StratifiedKFold(lfw_people.target, k=4)).next()
X_train, X_test = faces[train], faces[test]
y_train, y_test = lfw_people.target[train], lfw_people.target[test]

# ..
# .. dimension reduction ..
pca = decomposition.RandomizedPCA(n_components=150, whiten=True)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_test_pca = pca.transform(X_test)

# ..
# .. classification ..
clf = svm.SVC(C=5., gamma=0.001)
clf.fit(X_train_pca, y_train)

# ..
# .. predict on new images ..
for i in range(10):
print lfw_people.target_names[clf.predict(X_test_pca[i])[0]]
_ = pl.imshow(X_test[i].reshape(50, 37), cmap=pl.cm.gray)
_ = raw_input()
``````

## 3.5.6 线性模型: 从回归到简约

In [8]:

``````diabetes = datasets.load_diabetes()
diabetes_X_train = diabetes.data[:-20]
diabetes_X_test  = diabetes.data[-20:]
diabetes_y_train = diabetes.target[:-20]
diabetes_y_test  = diabetes.target[-20:]
``````

### 3.5.6.1 简约模型

In [9]:

``````from sklearn import linear_model
regr = linear_model.Lasso(alpha=.3)
regr.fit(diabetes_X_train, diabetes_y_train)
``````

Out[9]:

``````Lasso(alpha=0.3, copy_X=True, fit_intercept=True, max_iter=1000,
normalize=False, positive=False, precompute=False, random_state=None,
selection='cyclic', tol=0.0001, warm_start=False)
``````

In [10]:

``````regr.coef_ # 非常简约的系数
``````

Out[10]:

``````array([   0\.        ,   -0\.        ,  497.34075682,  199.17441034,
-0\.        ,   -0\.        , -118.89291545,    0\.        ,
430.9379595 ,    0\.        ])
``````

In [11]:

``````regr.score(diabetes_X_test, diabetes_y_test)
``````

Out[11]:

``````0.55108354530029779
``````

In [12]:

``````lin = linear_model.LinearRegression()
lin.fit(diabetes_X_train, diabetes_y_train)
``````

Out[12]:

``````LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
``````

In [13]:

``````lin.score(diabetes_X_test, diabetes_y_test)
``````

Out[13]:

``````0.58507530226905713
``````

## 3.5.7 模型选择: 选择预测器及其参数

### 3.5.7.1 网格搜索和交叉验证预测器

#### 3.5.7.1.1 网格搜索

scikit-learn提供了一个对象，给定数据，计算预测器在一个参数网格的分数，并且选择可以最大化交叉验证分数的参数。这个对象用一个构建中的预测器并且暴露了一个预测器的探索集API:

In [16]:

``````from sklearn import svm, grid_search
gammas = np.logspace(-6, -1, 10)
svc = svm.SVC()
clf = grid_search.GridSearchCV(estimator=svc, param_grid=dict(gamma=gammas), n_jobs=-1)
clf.fit(digits.data[:1000], digits.target[:1000])
``````

Out[16]:

``````GridSearchCV(cv=None, error_score='raise',
estimator=SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False),
fit_params={}, iid=True, loss_func=None, n_jobs=-1,
param_grid={'gamma': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
2.78256e-02,   1.00000e-01])},
pre_dispatch='2*n_jobs', refit=True, score_func=None, scoring=None,
verbose=0)
``````

In [20]:

``````clf.best_score_
``````

Out[20]:

``````0.93200000000000005
``````

In [22]:

``````clf.best_estimator_.gamma
``````

Out[22]:

``````0.00059948425031894088
``````

#### 3.5.7.1.2 交叉验证预测器

In [23]:

``````from sklearn import linear_model, datasets
lasso = linear_model.LassoCV()
X_diabetes = diabetes.data
y_diabetes = diabetes.target
lasso.fit(X_diabetes, y_diabetes)
``````

Out[23]:

``````LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
verbose=False)
``````

In [26]:

``````# 预测器自动选择他的lambda:
lasso.alpha_
``````

Out[26]:

``````0.012291895087486173
``````