12.5. 解析器
文本搜索分析器负责分离原文档文本为标记并且标识每个记号的类型,这里可能的类型集由解析器本身定义。 注意一个解析器并不修改文本—它只是确定合理的单词边界。因为这个限制范围, 为特定应用定制的分析器比自定义字典需要的更少。目前PostgreSQL提供了只有一个内置的解析器, 这已被用于一个广泛的应用中。
内置分析器命名pg_catalog.default
。它识别23种标记类型,显示在Table 12-1中。
Table 12-1. 缺省分析器的标记类型
Alias | Description | Example |
---|---|---|
asciiword |
Word, all ASCII letters | elephant |
word |
Word, all letters | mañana |
numword |
Word, letters and digits | beta1 |
asciihword |
Hyphenated word, all ASCII | up-to-date |
hword |
Hyphenated word, all letters | lógico-matemática |
numhword |
Hyphenated word, letters and digits | postgresql-beta1 |
hword_asciipart |
Hyphenated word part, all ASCII | postgresql in the context postgresql-beta1 |
hword_part |
Hyphenated word part, all letters | lógico or matemática in the context lógico-matemática |
hword_numpart |
Hyphenated word part, letters and digits | beta1 in the context postgresql-beta1 |
email |
Email address | [email protected] |
protocol |
Protocol head | http:// |
url |
URL | example.com/stuff/index.html |
host |
Host | example.com |
url_path |
URL path | /stuff/index.html , in the context of a URL |
file |
File or path name | /usr/local/foo.txt , if not within a URL |
sfloat |
Scientific notation | -1.234e56 |
float |
Decimal notation | -1.234 |
int |
Signed integer | -1234 |
uint |
Unsigned integer | 1234 |
version |
Version number | 8.3.0 |
tag |
XML tag | <a href="dictionaries.html"> |
entity |
XML entity | & |
blank |
Space symbols | (any whitespace or punctuation not otherwise recognized) |
Note: 注意:一个"字母"的语法分析器的概念是由数据库的区域设置决定的,特别是
lc_ctype
。 只包含基本ASCII字母的词作为一个单独的标记类型被报告,因为区分他们有时候是有用的。 大多数欧洲语言,标记类型word
和asciiword
应该一视同仁。
对于分析器从文本的同一块产生重叠的标记是可能的。作为一个例子, 一个连字符的单词将作为整个单词和每个组件被报道:
SELECT alias, description, token FROM ts_debug('foo-bar-beta1');
alias | description | token
-----------------+------------------------------------------+---------------
numhword | Hyphenated word, letters and digits | foo-bar-beta1
hword_asciipart | Hyphenated word part, all ASCII | foo
blank | Space symbols | -
hword_asciipart | Hyphenated word part, all ASCII | bar
blank | Space symbols | -
hword_numpart | Hyphenated word part, letters and digits | beta1
这种行为是可取的,因为它允许为整个复合词和组件进行搜索。这里是另一个很好的例子:
SELECT alias, description, token FROM ts_debug('http://example.com/stuff/index.html');
alias | description | token
----------+---------------+------------------------------
protocol | Protocol head | http://
url | URL | example.com/stuff/index.html
host | Host | example.com
url_path | URL path | /stuff/index.html