# 7.4 Pandas 对象介绍

``````import numpy as np
import pandas as pd
``````

## Pandas 序列对象

Pandas `Series`是带索引的数据的一维数组。它可以从列表或数组创建，如下所示：

``````data = pd.Series([0.25, 0.5, 0.75, 1.0])
data

'''
0    0.25
1    0.50
2    0.75
3    1.00
dtype: float64
'''
``````

`values`只是一个熟悉的 NumPy 数组：

``````data.values

# array([ 0.25,  0.5 ,  0.75,  1.  ])
``````

`index`是类型为`pd.Index`的数组式对象，我们将在稍后详细讨论。

``````data.index

# RangeIndex(start=0, stop=4, step=1)
``````

``````data[1]

# 0.5

data[1:3]

'''
1    0.50
2    0.75
dtype: float64
'''
``````

### 作为扩展的 NumPy 数组的`Series`

``````data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=['a', 'b', 'c', 'd'])
data

'''
a    0.25
b    0.50
c    0.75
d    1.00
dtype: float64
'''
``````

``````data['b']

# 0.5
``````

``````data = pd.Series([0.25, 0.5, 0.75, 1.0],
index=[2, 5, 3, 7])
data

'''
2    0.25
5    0.50
3    0.75
7    1.00
dtype: float64
'''

data[5]

# 0.5
``````

### 作为特化字典的序列

``````population_dict = {'California': 38332521,
'Texas': 26448193,
'New York': 19651127,
'Florida': 19552860,
'Illinois': 12882135}
population = pd.Series(population_dict)
population

'''
California    38332521
Florida       19552860
Illinois      12882135
New York      19651127
Texas         26448193
dtype: int64
'''
``````

``````population['California']

# 38332521
``````

``````population['California':'Illinois']

'''
California    38332521
Florida       19552860
Illinois      12882135
dtype: int64
'''
``````

### 构造序列对象

``````>>> pd.Series(data, index=index)
``````

``````pd.Series([2, 4, 6])

'''
0    2
1    4
2    6
dtype: int64
'''
``````

`data`可以是标量，被重复来填充指定的索引：

``````pd.Series(5, index=[100, 200, 300])

'''
100    5
200    5
300    5
dtype: int64
'''
``````

`data`可以是一个字典，其中`index`默认为有序的字典键：

``````pd.Series({2:'a', 1:'b', 3:'c'})

'''
1    b
2    a
3    c
dtype: object
'''
``````

``````pd.Series({2:'a', 1:'b', 3:'c'}, index=[3, 2])

'''
3    c
2    a
dtype: object
'''
``````

## Pandas 数据帧对象

Pandas 的下一个基本结构是`DataFrame`。与前一节中讨论的`Series`对象一样，`DataFrame`可以被认为是 NumPy 数组的扩展，也可以被认为是 Python 字典的特化。我们现在来看看这些观点。

### 作为扩展的 NumPy 数组的`DataFrame`

``````area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)
area

'''
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
dtype: int64
'''
``````

``````states = pd.DataFrame({'population': population,
'area': area})
states
``````
area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

``````states.index

# Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object')
``````

``````states.columns

# Index(['area', 'population'], dtype='object')
``````

### 作为特化字典的`DataFrame`

``````states['area']

'''
California    423967
Florida       170312
Illinois      149995
New York      141297
Texas         695662
Name: area, dtype: int64
'''
``````

### 构造`DataFrame`对象

Pandas `DataFrame`可以通过多种方式构建。这里我们举几个例子。

#### 来自单个`Series`对象

`DataFrame``Series`对象的集合，单列`DataFrame`可以从单个`Series`构造：

``````pd.DataFrame(population, columns=['population'])
``````
population
California 38332521
Florida 19552860
Illinois 12882135
New York 19651127
Texas 26448193

#### 来自字典的列表

``````data = [{'a': i, 'b': 2 * i}
for i in range(3)]
pd.DataFrame(data)
``````
a b
0 0 0
1 1 2
2 2 4

``````pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}])
``````
a b c
0 1.0 2 NaN
1 NaN 3 4.0

#### 来自序列对象的字典

``````pd.DataFrame({'population': population,
'area': area})
``````
area population
California 423967 38332521
Florida 170312 19552860
Illinois 149995 12882135
New York 141297 19651127
Texas 695662 26448193

#### 来自二维 NumPy 数组

``````pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
``````
foo bar
a 0.865257 0.213169
b 0.442759 0.108267
c 0.047110 0.905718

#### 来自 NumPy 结构化数组

``````A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

'''
array([(0, 0.0), (0, 0.0), (0, 0.0)],
dtype=[('A', '<i8'), ('B', '<f8')])
'''

pd.DataFrame(A)
``````
A B
0 0 0.0
1 0 0.0
2 0 0.0

## Pandas 索引对象

``````ind = pd.Index([2, 3, 5, 7, 11])
ind

# Int64Index([2, 3, 5, 7, 11], dtype='int64')
``````

### 作为不可变数组的索引

`Index`在很多方面都像数组一样。例如，我们可以使用标准的 Python 索引表示法来检索值或切片：

``````ind[1]

# 3

ind[::2]

# Int64Index([2, 5, 11], dtype='int64')
``````

`Index``对象也有许多来自 NumPy 数组的熟悉的属性：

``````print(ind.size, ind.shape, ind.ndim, ind.dtype)

# 5 (5,) 1 int64
``````

`Index`对象和NumPy数组之间的一个区别是，索引是不可变的 - 也就是说，它们不能通过常规方式修改：

``````ind[1] = 0

'''
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-34-40e631c82e8a> in <module>()
----> 1 ind[1] = 0

/Users/jakevdp/anaconda/lib/python3.5/site-packages/pandas/indexes/base.py in __setitem__(self, key, value)
1243
1244     def __setitem__(self, key, value):
-> 1245         raise TypeError("Index does not support mutable operations")
1246
1247     def __getitem__(self, key):

TypeError: Index does not support mutable operations
'''
``````

### 作为有序集合的索引

Pandas 对象旨在促进一些操作，例如跨数据集的连接，这取决于集合运算的许多方面。`Index`对象遵循 Python 内置的`set`数据结构使用的许多约定，因此可以用熟悉的方式计算并集，交集，差集和其他组合：

``````indA = pd.Index([1, 3, 5, 7, 9])
indB = pd.Index([2, 3, 5, 7, 11])

indA & indB  # 交集

# Int64Index([3, 5, 7], dtype='int64')

indA | indB  # 并集

# Int64Index([1, 2, 3, 5, 7, 9, 11], dtype='int64')

indA ^ indB  # 对称差集

# Int64Index([1, 2, 9, 11], dtype='int64')
``````