7.9 组合数据集：连接和附加

``````import pandas as pd
import numpy as np
``````

``````def make_df(cols, ind):
"""Quickly make a DataFrame"""
data = {c: [str(c) + str(i) for i in ind]
for c in cols}
return pd.DataFrame(data, ind)

# 示例数据帧
make_df('ABC', range(3))
``````
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2

``````class display(object):
"""Display HTML representation of multiple objects"""
template = """<div style="float: left; padding: 10px;">
<p style='font-family:"Courier New", Courier, monospace'>{0}</p>{1}
</div>"""
def __init__(self, *args):
self.args = args

def _repr_html_(self):
return '\n'.join(self.template.format(a, eval(a)._repr_html_())
for a in self.args)

def __repr__(self):
return '\n\n'.join(a + '\n' + repr(eval(a))
for a in self.args)
``````

回忆：NumPy 数组的连接

`Series``DataFrame`对象的连接非常类似于 Numpy 数组的连接，这可以通过`np.concatenate`函数来完成，如[“NumPy 数组的基础知识”中所述。回想一下，使用它，你可以将两个或多个数组的内容组合到一个数组中：

``````x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])

# array([1, 2, 3, 4, 5, 6, 7, 8, 9])
``````

``````x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)

'''
array([[1, 2, 1, 2],
[3, 4, 3, 4]])
'''
``````

使用`pd.concat`的简单连接

Pandas 拥有函数`pd.concat()`，它的语法与`np.concatenate`类似，但是包含了一些我们将要讨论的选项：

``````# Pandas v0.18 中的签名
pd.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False,
keys=None, levels=None, names=None, verify_integrity=False,
copy=True)
``````

`pd.concat()`可以用于`Series``DataFrame`对象的简单连接，就像`np.concatenate()`可以用于简单的数组连接：

``````ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1, ser2])

'''
1    A
2    B
3    C
4    D
5    E
6    F
dtype: object
'''
``````

``````df1 = make_df('AB', [1, 2])
df2 = make_df('AB', [3, 4])
display('df1', 'df2', 'pd.concat([df1, df2])')
``````

`df1`

A B
1 A1 B1
2 A2 B2

`df2`

A B
3 A3 B3
4 A4 B4

`pd.concat([df1, df2])`

A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4

``````df3 = make_df('AB', [0, 1])
df4 = make_df('CD', [0, 1])
display('df3', 'df4', "pd.concat([df3, df4], axis='col')")
``````

`df3`

A B
0 A0 B0
1 A1 B1

`df4`

C D
0 C0 D0
1 C1 D1

`pd.concat([df3, df4], axis='col')`

A B C D
0 A0 B0 C0 D0
1 A1 B1 C1 D1

重复的索引

`np.concatenate``pd.concat`之间的一个重要区别是，Pandas 的连接保留了索引，即使结果会有重复的索引！考虑这个简单的例子：

``````x = make_df('AB', [0, 1])
y = make_df('AB', [2, 3])
y.index = x.index  # 复制索引
display('x', 'y', 'pd.concat([x, y])')
``````

`x`

A B
0 A0 B0
1 A1 B1

`y`

A B
0 A2 B2
1 A3 B3

`pd.concat([x, y])`

A B
0 A0 B0
1 A1 B1
0 A2 B2
1 A3 B3

将重复捕获为错误

``````try:
pd.concat([x, y], verify_integrity=True)
except ValueError as e:
print("ValueError:", e)

'''
ValueError: Indexes have overlapping values: [0, 1]
'''
``````

忽略索引

``````display('x', 'y', 'pd.concat([x, y], ignore_index=True)')
``````

`x`

A B
0 A0 B0
1 A1 B1

`y`

A B
0 A2 B2
1 A3 B3

`pd.concat([x, y], ignore_index=True)`

A B
0 A0 B0
1 A1 B1
2 A2 B2
3 A3 B3

添加`MultiIndex`的键

``````display('x', 'y', "pd.concat([x, y], keys=['x', 'y'])")
``````

`x`

A B
0 A0 B0
1 A1 B1

`y`

A B
0 A2 B2
1 A3 B3

`pd.concat([x, y], keys=['x', 'y'])`

A B
x 0 A0 B0
1 A1 B1
y 0 A2 B2
1 A3 B3

使用`join`的连接

``````df5 = make_df('ABC', [1, 2])
df6 = make_df('BCD', [3, 4])
display('df5', 'df6', 'pd.concat([df5, df6])')
``````

`df5`

A B C
1 A1 B1 C1
2 A2 B2 C2

`df6`

B C D
3 B3 C3 D3
4 B4 C4 D4

`pd.concat([df5, df6])`

A B C D
1 A1 B1 C1 NaN
2 A2 B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4

``````display('df5', 'df6',
"pd.concat([df5, df6], join='inner')")
``````

`df5`

A B C
1 A1 B1 C1
2 A2 B2 C2

`df6`

B C D
3 B3 C3 D3
4 B4 C4 D4

`pd.concat([df5, df6], join='inner')`

B C
1 B1 C1
2 B2 C2
3 B3 C3
4 B4 C4

``````display('df5', 'df6',
"pd.concat([df5, df6], join_axes=[df5.columns])")
``````

`df5`

A B C
1 A1 B1 C1
2 A2 B2 C2

`df6`

B C D
3 B3 C3 D3
4 B4 C4 D4

`pd.concat([df5, df6], join_axes=[df5.columns])`

A B C
1 A1 B1 C1
2 A2 B2 C2
3 NaN B3 C3
4 NaN B4 C4

`append()`方法

``````display('df1', 'df2', 'df1.append(df2)')
``````

`df1`

A B
1 A1 B1
2 A2 B2

`df2`

A B
3 A3 B3
4 A4 B4

`df1.append(df2)`

A B
1 A1 B1
2 A2 B2
3 A3 B3
4 A4 B4