第二章

原文:Chapter 2

译者:飞龙

协议:CC BY-NC-SA 4.0

# 通常的开头
import pandas as pd
# 使图表更大更漂亮
pd.set_option('display.mpl_style', 'default') 
pd.set_option('display.line_width', 5000) 
pd.set_option('display.max_columns', 60) 

figsize(15, 5)

我们将在这里使用一个新的数据集,来演示如何处理更大的数据集。 这是来自 NYC Open Data 的 311 个服务请求的子集。

complaints = pd.read_csv('../data/311-service-requests.csv')

2.1 里面究竟有什么?(总结)

当你查看一个大型数据框架,而不是显示数据框架的内容,它会显示一个摘要。 这包括所有列,以及每列中有多少非空值。

complaints
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 52 columns):
Unique Key                        111069  non-null values
Created Date                      111069  non-null values
Closed Date                       60270  non-null values
Agency                            111069  non-null values
Agency Name                       111069  non-null values
Complaint Type                    111069  non-null values
Descriptor                        111068  non-null values
Location Type                     79048  non-null values
Incident Zip                      98813  non-null values
Incident Address                  84441  non-null values
Street Name                       84438  non-null values
Cross Street 1                    84728  non-null values
Cross Street 2                    84005  non-null values
Intersection Street 1             19364  non-null values
Intersection Street 2             19366  non-null values
Address Type                      102247  non-null values
City                              98860  non-null values
Landmark                          95  non-null values
Facility Type                     110938  non-null values
Status                            111069  non-null values
Due Date                          39239  non-null values
Resolution Action Updated Date    96507  non-null values
Community Board                   111069  non-null values
Borough                           111069  non-null values
X Coordinate (State Plane)        98143  non-null values
Y Coordinate (State Plane)        98143  non-null values
Park Facility Name                111069  non-null values
Park Borough                      111069  non-null values
School Name                       111069  non-null values
School Number                     111052  non-null values
School Region                     110524  non-null values
School Code                       110524  non-null values
School Phone Number               111069  non-null values
School Address                    111069  non-null values
School City                       111069  non-null values
School State                      111069  non-null values
School Zip                        111069  non-null values
School Not Found                  38984  non-null values
School or Citywide Complaint      0  non-null values
Vehicle Type                      99  non-null values
Taxi Company Borough              117  non-null values
Taxi Pick Up Location             1059  non-null values
Bridge Highway Name               185  non-null values
Bridge Highway Direction          185  non-null values
Road Ramp                         184  non-null values
Bridge Highway Segment            223  non-null values
Garage Lot Name                   49  non-null values
Ferry Direction                   37  non-null values
Ferry Terminal Name               336  non-null values
Latitude                          98143  non-null values
Longitude                         98143  non-null values
Location                          98143  non-null values
dtypes: float64(5), int64(1), object(46)

2.2 选择列和行

为了选择一列,使用列名称作为索引,像这样:

complaints['Complaint Type']
0      Noise - Street/Sidewalk
1              Illegal Parking
2           Noise - Commercial
3              Noise - Vehicle
4                       Rodent
5           Noise - Commercial
6             Blocked Driveway
7           Noise - Commercial
8           Noise - Commercial
9           Noise - Commercial
10    Noise - House of Worship
11          Noise - Commercial
12             Illegal Parking
13             Noise - Vehicle
14                      Rodent
...
111054    Noise - Street/Sidewalk
111055         Noise - Commercial
111056      Street Sign - Missing
111057                      Noise
111058         Noise - Commercial
111059    Noise - Street/Sidewalk
111060                      Noise
111061         Noise - Commercial
111062               Water System
111063               Water System
111064    Maintenance or Facility
111065            Illegal Parking
111066    Noise - Street/Sidewalk
111067         Noise - Commercial
111068           Blocked Driveway
Name: Complaint Type, Length: 111069, dtype: object

要获得DataFrame的前 5 行,我们可以使用切片:df [:5]

这是一个了解数据框架中存在什么信息的很好方式 - 花一点时间来查看内容并获得此数据集的感觉。

complaints[:5]
Unique Key Created Date Closed Date Agency Agency Name Complaint Type Descriptor Location Type Incident Zip Incident Address Street Name Cross Street 1 Cross Street 2 Intersection Street 1 Intersection Street 2 Address Type City Landmark Facility Type Status Due Date Resolution Action Updated Date Community Board Borough X Coordinate (State Plane) Y Coordinate (State Plane) Park Facility Name Park Borough School Name School Number School Region School Code School Phone Number School Address School City School State School Zip School Not Found School or Citywide Complaint Vehicle Type Taxi Company Borough Taxi Pick Up Location Bridge Highway Name Bridge Highway Direction Road Ramp Bridge Highway Segment Garage Lot Name Ferry Direction Ferry Terminal Name Latitude Longitude Location
0 26589651 10/31/2013 02:08:41 AM NaN NYPD New York City Police Department Noise - Street/Sidewalk Loud Talking Street/Sidewalk 11432 90-03 169 STREET 169 STREET 90 AVENUE 91 AVENUE NaN NaN ADDRESS JAMAICA NaN Precinct Assigned 10/31/2013 10:08:41 AM 10/31/2013 02:35:17 AM 12 QUEENS QUEENS 1042027 197389 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.708275 -73.791604 (40.70827532593202, -73.79160395779721)
1 26593698 10/31/2013 02:01:04 AM NaN NYPD New York City Police Department Illegal Parking Commercial Overnight Parking Street/Sidewalk 11378 58 AVENUE 58 AVENUE 58 PLACE 59 STREET NaN NaN BLOCKFACE MASPETH NaN Precinct Open 10/31/2013 10:01:04 AM NaN 05 QUEENS QUEENS 1009349 201984 Unspecified QUEENS Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.721041 -73.909453 (40.721040535628305, -73.90945306791765)
2 26594139 10/31/2013 02:00:24 AM 10/31/2013 02:40:32 AM NYPD New York City Police Department Noise - Commercial Loud Music/Party Club/Bar/Restaurant 10032 4060 BROADWAY BROADWAY WEST 171 STREET WEST 172 STREET NaN NaN ADDRESS NEW YORK NaN Precinct Closed 10/31/2013 10:00:24 AM 10/31/2013 02:39:42 AM 12 MANHATTAN MANHATTAN 1001088 246531 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.843330 -73.939144 (40.84332975466513, -73.93914371913482)
3 26595721 10/31/2013 01:56:23 AM 10/31/2013 02:21:48 AM NYPD New York City Police Department Noise - Vehicle Car/Truck Horn Street/Sidewalk 10023 WEST 72 STREET WEST 72 STREET COLUMBUS AVENUE AMSTERDAM AVENUE NaN NaN BLOCKFACE NEW YORK NaN Precinct Closed 10/31/2013 09:56:23 AM 10/31/2013 02:21:10 AM 07 MANHATTAN MANHATTAN 989730 222727 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.778009 -73.980213 (40.7780087446372, -73.98021349023975)
4 26590930 10/31/2013 01:53:44 AM NaN DOHMH Department of Health and Mental Hygiene Rodent Condition Attracting Rodents Vacant Lot 10027 WEST 124 STREET WEST 124 STREET LENOX AVENUE ADAM CLAYTON POWELL JR BOULEVARD NaN NaN BLOCKFACE NEW YORK NaN N/A Pending 11/30/2013 01:53:44 AM 10/31/2013 01:59:54 AM 10 MANHATTAN MANHATTAN 998815 233545 Unspecified MANHATTAN Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified Unspecified N NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 40.807691 -73.947387 (40.80769092704951,

我们可以组合它们来获得一列的前五行。

complaints['Complaint Type'][:5]
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object

并且无论我们以什么方向:

complaints[:5]['Complaint Type']
0    Noise - Street/Sidewalk
1            Illegal Parking
2         Noise - Commercial
3            Noise - Vehicle
4                     Rodent
Name: Complaint Type, dtype: object

2.3 选择多列

如果我们只关心投诉类型和区,但不关心其余的信息怎么办? Pandas 使它很容易选择列的一个子集:只需将所需列的列表用作索引。

complaints[['Complaint Type', 'Borough']]
<class 'pandas.core.frame.DataFrame'>
Int64Index: 111069 entries, 0 to 111068
Data columns (total 2 columns):
Complaint Type    111069  non-null values
Borough           111069  non-null values
dtypes: object(2)

这会向我们展示总结,我们可以获取前 10 列:

complaints[['Complaint Type', 'Borough']][:10]
Complaint Type Borough
0 Noise - Street/Sidewalk QUEENS
1 Illegal Parking QUEENS
2 Noise - Commercial MANHATTAN
3 Noise - Vehicle MANHATTAN
4 Rodent MANHATTAN
5 Noise - Commercial QUEENS
6 Blocked Driveway QUEENS
7 Noise - Commercial QUEENS
8 Noise - Commercial MANHATTAN
9 Noise - Commercial BROOKLYN

2.4 什么是最常见的投诉类型?

这是个易于回答的问题,我们可以调用.value_counts()方法:

complaints['Complaint Type'].value_counts()
HEATING                     14200
GENERAL CONSTRUCTION         7471
Street Light Condition       7117
DOF Literature Request       5797
PLUMBING                     5373
PAINT - PLASTER              5149
Blocked Driveway             4590
NONCONST                     3998
Street Condition             3473
Illegal Parking              3343
Noise                        3321
Traffic Signal Condition     3145
Dirty Conditions             2653
Water System                 2636
Noise - Commercial           2578
...
Opinion for the Mayor                2
Window Guard                         2
DFTA Literature Request              2
Legal Services Provider Complaint    2
Open Flame Permit                    1
Snow                                 1
Municipal Parking Facility           1
X-Ray Machine/Equipment              1
Stalled Sites                        1
DHS Income Savings Requirement       1
Tunnel Condition                     1
Highway Sign - Damaged               1
Ferry Permit                         1
Trans Fat                            1
DWD                                  1
Length: 165, dtype: int64

如果我们想要最常见的 10 个投诉类型,我们可以这样:

complaint_counts = complaints['Complaint Type'].value_counts()
complaint_counts[:10]
HEATING                   14200
GENERAL CONSTRUCTION       7471
Street Light Condition     7117
DOF Literature Request     5797
PLUMBING                   5373
PAINT - PLASTER            5149
Blocked Driveway           4590
NONCONST                   3998
Street Condition           3473
Illegal Parking            3343
dtype: int64

但是还可以更好,我们可以绘制出来!

complaint_counts[:10].plot(kind='bar')
<matplotlib.axes.AxesSubplot at 0x7ba2290>

results matching ""

    No results matching ""