一、数据科学

绘制经典作品

``````# Read two books, fast!

huck_finn_url = 'https://www.inferentialthinking.com/chapters/01/3/huck_finn.txt'
huck_finn_chapters = huck_finn_text.split('CHAPTER ')[44:]

little_women_url = 'https://www.inferentialthinking.com/chapters/01/3/little_women.txt'
little_women_chapters = little_women_text.split('CHAPTER ')[1:]
``````

``````# Display the chapters of Huckleberry Finn in a table.

Table().with_column('Chapters', huck_finn_chapters)
``````
Chapters
I. YOU don't know about me without you have read a book ...
II. WE went tiptoeing along a path amongst the trees bac ...
III. WELL, I got a good going-over in the morning from o ...
IV. WELL, three or four months run along, and it was wel ...
V. I had shut the door to. Then I turned around and ther ...
VI. WELL, pretty soon the old man was up and around agai ...
VII. "GIT up! What you 'bout?" I opened my eyes and look ...
VIII. THE sun was up so high when I waked that I judged ...
IX. I wanted to go and look at a place right about the m ...
X. AFTER breakfast I wanted to talk about the dead man a ...

（已省略 33 行）

文本特征

《哈克贝利·芬历险记》描述了哈克和吉姆沿着密西西比河的旅程。汤姆·索亚（Tom Sawyer）在行动进行的时候加入了他们的行列。在加载文本后，我们可以快速地看到这些字符在本书的任何一处被提及的次数。

``````# Count how many times the names Jim, Tom, and Huck appear in each chapter.

counts = Table().with_columns([
'Jim', np.char.count(huck_finn_chapters, 'Jim'),
'Tom', np.char.count(huck_finn_chapters, 'Tom'),
'Huck', np.char.count(huck_finn_chapters, 'Huck')
])

# Plot the cumulative counts:
# how many times in Chapter 1, how many times in Chapters 1 and 2, and so on.

cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 44, 1))
cum_counts.plot(column_for_xticks=3)
plots.title('Cumulative Number of Times Each Name Appears', y=1.08);
``````

《小女人》是南北战争期间四个姐妹一起长大的故事。 在这本书中，章节号码拼写了出来，章节标题用大写字母表示。

``````# The chapters of Little Women, in a table

Table().with_column('Chapters', little_women_chapters)
``````
Chapters
ONE PLAYING PILGRIMS "Christmas won't be Christmas witho ...
TWO A MERRY CHRISTMAS Jo was the first to wake in the gr ...
THREE THE LAURENCE BOY "Jo! Jo! Where are you?" cried Me ...
FOUR BURDENS "Oh, dear, how hard it does seem to take up ...
FIVE BEING NEIGHBORLY "What in the world are you going t ...
SIX BETH FINDS THE PALACE BEAUTIFUL The big house did pr ...
SEVEN AMY'S VALLEY OF HUMILIATION "That boy is a perfect ...
EIGHT JO MEETS APOLLYON "Girls, where are you going?" as ...
NINE MEG GOES TO VANITY FAIR "I do think it was the most ...
TEN THE P.C. AND P.O. As spring came on, a new set of am ...

（已省略 37 行）

``````# Counts of names in the chapters of Little Women

counts = Table().with_columns([
'Amy', np.char.count(little_women_chapters, 'Amy'),
'Beth', np.char.count(little_women_chapters, 'Beth'),
'Jo', np.char.count(little_women_chapters, 'Jo'),
'Meg', np.char.count(little_women_chapters, 'Meg'),
'Laurie', np.char.count(little_women_chapters, 'Laurie'),

])

# Plot the cumulative counts.

cum_counts = counts.cumsum().with_column('Chapter', np.arange(1, 48, 1))
cum_counts.plot(column_for_xticks=5)
plots.title('Cumulative Number of Times Each Name Appears', y=1.08);
``````

另一种文本特征

``````# In each chapter, count the number of all characters;
# call this the "length" of the chapter.
# Also count the number of periods.

chars_periods_huck_finn = Table().with_columns([
'Huck Finn Chapter Length', [len(s) for s in huck_finn_chapters],
'Number of Periods', np.char.count(huck_finn_chapters, '.')
])
chars_periods_little_women = Table().with_columns([
'Little Women Chapter Length', [len(s) for s in little_women_chapters],
'Number of Periods', np.char.count(little_women_chapters, '.')
])
``````

`chars_periods_huck_finn`

《哈克贝利·芬》章节长度 句号数量
7026 66
11982 117
8529 72
6799 84
8166 91
14550 125
13218 127
22208 249
8081 71
7036 70

（已省略 33 行）

`chars_periods_little_women`

《小女人》章节长度 句号数量
21759 189
22148 188
20558 231
25526 195
23395 255
14622 140
14431 131
22476 214
33767 337
18508 185

（已省略 37 行）

``````plots.figure(figsize=(6, 6))
plots.scatter(chars_periods_huck_finn.column(1),
chars_periods_huck_finn.column(0),
color='darkblue')
plots.scatter(chars_periods_little_women.column(1),
chars_periods_little_women.column(0),
color='gold')
plots.xlabel('Number of periods in chapter')
plots.ylabel('Number of characters in chapter');
``````