文档介绍:Analyzing Unicode Text with Regular Expressions
Analyzing Unicode Text with Regular Expressions
Andy Heninger
IBM Corporation
******@us.
Abstract
For decades now, Regular Expressions have been used in the analysis of text data, for
searching for key words, for extracting out desired fields or substrings from larger bodies
of text and in editing or transforming text.
This paper will discuss the application of regular expressions to Unicode text data,
including the approaches and extensions that are required to work effectively with the
very large Unicode character repertoire. The emphasis is on Unicode specifically, not on
the features of regular expressions in general, which is a subject about which entire books
can, and have, been written.
A Very Quick Look At Regular Expressions
Although this paper will primarily be dealing with Unicode related questions, a regular
expression language is still needed for discussion and for use in examples Here is a
minimalist one, smaller than most real implementations, but sufficient for the purpose.
26th Internationalization and Unicode Conference 1 San Jose, CA, September 2004
Analyzing Unicode Text with Regular Expressions
Item Definition
. Match any single character
[range or set of characters] Match any character of a class or set of characters. Set
expressions will be described later.
* Match 0 or more occurrences of the preceding item.
+ Match 1 or more occurrences of the preceding item.
Literal Characters Match themselves.
\udddd Unicode Code Point Values, 16 or 32 bits.
\Udddddddd
( sub-expression ) Grouping. (abc)*, for example.
a|b|c Alternation. Match any one of 'a' or 'b' or 'c'.
And, to make things more concrete, here are a few samples of simple expressions
Expression Description
Hello Match or select appearances of the word “Hello” in the
target text.
aa[a-z]* Match any word beginning with “aa” and consisting of
only the lower case letters a-z. (Just what is in the range